Purpose This study evaluates the performance of Claude and GPT LLM Vision APIs for automated clinical questionnaire processing in spine surgery by comparing accuracy, efficiency, reproducibility, and cost-effectiveness.
Methods Clinical questionnaires from 56 patients (336 total pages) were processed using a Python 3.12-based system incorporating PDF preprocessing, image enhancement via OpenCV, and direct LLM Vision analysis. Both models were evaluated on 26 questionnaire items (1,456 data points) using accuracy comparison, processing time measurement, token utilization analysis, and intra-class correlation coefficient (ICC) assessment through three independent iterations.
Results GPT achieved 98.83% accuracy (1,439/1,456) compared to Claude's 97.94% (1,426/1,456). Both models processed questionnaires in 27 seconds per set, representing 68% time reduction versus manual entry (85 seconds). GPT demonstrated 59% cost advantage ($0.023 vs. $0.056 per questionnaire), while Claude showed superior reproducibility (ICC 0.98 vs. 0.96). GPT achieved 100% accuracy across 21 items versus Claude's 17 items. Error analysis identified predominantly handwriting recognition (52%) and image quality issues (28%), with 89% of errors successfully flagged for review.
Conclusions Both models achieve clinical-grade performance exceeding 90% accuracy. GPT demonstrates superior accuracy and cost-effectiveness, while Claude provides better reproducibility. Model selection should be guided by institutional priorities regarding accuracy, reproducibility, and operational scale.
Purpose To assess the volume of fusion mass after posterior lumbar interbody fusion (PLIF) using Hounsfield units methods.
Methods The present study was within the frame work about a prospective observational cohort study to compare the surgical outcomes of a single-level PLIF for LSS between the local bone (LbG) and local bone plus hydroxyapatite groups (LbHa). The fusion material for each case was determined by the amount of available local bone. After the fusion material was chosen, patients were assigned to either the LbG group (n=20) or the LbHa group (n=20). The primary outcome was the assessment of fusion mass volume in each group.
Results We used the new method using Hounsfield units for volumetric assessments of interbody fusion mass. There was no difference in fusion rates or volume of the fusion mass between the 2 groups.
Conclusions Hounsfield unit method, that is the CT-based summation method using a cross-sectional slice, can be applied usefully to other areas of orthopaedics.
Objectives The aim of this study was to monitor the quality control of pedicle screw fixation using a cumulative summation test (CUSUM).
Overview of Literature: CUSUM test has already been used in several different surgical settings including the assessment of outcomes in transplant, laparoscopic, and total hip replacement surgeries. However, there has been no data regarding CUSUM analysis for spine surgery.
Methods Patients with lumbar spinal stenosis who underwent lumbar fusion surgery were included in this study.
The primary outcome was the CUSUM analysis for monitoring the quality control of the accuracy of pedicle screw insertion.
Results Seven screws of the 100 pedicle screw insertions were considered to have failed in the lumbar fusion surgery, respectively. Throughout the monitoring period, there was no indication by the CUSUM test that the quality of performance of the pedicle screw fixation procedure was inadequate.
Conclusions Thisstudy demonstrates the CUSUM test can be a useful tool for monitoring of the quality of procedures related with spine surgery.