GPT-4 increases screening accuracy in clinical trials and reduces costs

In a recent study published in the novel monthly NEJM AIa group of researchers from the United States evaluated the utility of a pre-trained Generative Transformer (GPT)-4 system with support for assisted recovery (RAG) in improving the accuracy, efficiency, and reliability of screening participants in clinical trials involving patients with symptomatic heart failure.

Test: Augmented Generation with Search – Downloadable GPT-4 for clinical trial screening. Image source: Treecha / Shutterstock

Background

Screening potential clinical trial participants is crucial to ensure eligibility based on specific criteria. Traditionally, this manual process requires the involvement of research staff and healthcare professionals, making it prone to human error, resource-intensive, and time-consuming. Natural language processing (NLP) can automate the extraction and analysis of data from electronic health records (EHRs) to boost accuracy and efficiency. However, conventional NLP struggles with convoluted, unstructured EHR data. Vast language models (LLMs) such as GPT-4 have shown promise in medical applications. Further research is needed to refine the implementation of GPT-4 within RAG to ensure scalability, accuracy, and integration into various clinical trial settings.

About the study

In this study, the Repetitive Error Correction System with Tolerance for Input Variations and Effective Regularization (RECTIFIER) was evaluated in the Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) study, which compared two remote care strategies for the treatment of patients with heart failure. Time-honored cohort identification involved EHR searches and manual chart reviews by nonclinical licensed staff to assess six inclusion criteria and 17 exclusion criteria. RECTIFIER focused on one inclusion criterion and 12 exclusion criteria derived from unstructured data, creating 14 prompts.

Using Microsoft Dynamics 365, yes/no values ​​for the criteria were captured during selection. An expert clinician provided a “gold standard” response to the 13 target criteria. The datasets were divided into development, validation and testing phases, starting with 3,000 patients. 282 patients were used for validation and 1894 patients were included in the test set.

GPT-4 Vision and GPT-3.5 Turbo were used, and the RAG architecture enables effective handling of clinical notes. The notes were fragmented and retrieved using a custom Python program and LangChain’s recursive sharding strategy. Numerical vector representations were generated and optimized using the Facebook AI Similarity Search (FAISS) library.

Fourteen prompts were used to generate a “Yes” or “No” response. Statistical analysis included calculation of sensitivity, specificity, and accuracy, with the primary outcome measure being the Matthews correlation coefficient (MCC). An analysis and comparison of costs in various demographic groups was also carried out.

Findings

In the validation set, note lengths ranged from 8 to 7,097 words, with 75.1% containing 500 words or less and 92% containing 1,500 words or less. In the test set, clinical notes for 26% of patients exceeded the GPT-4 token context window limit of 128K. The 1,000-token slice outperformed 500 in 10 of 13 criteria. Consistency analysis of the validation data set showed percentages ranging from 99.16% to 100%, with an accuracy standard deviation of 0% to 0.86%, indicating minimal variability and high consistency.

In the test set, both COPILOT-HF and RECTIFIER showed high sensitivity and specificity for 13 target criteria. Sensitivity to individual questions ranged from 66.7% to 100% for study staff and 75% to 100% for RECTIFIER. Specificity ranged from 82.1% to 100% for study staff and 92.1% to 100% for RECTIFIER. The positive predictive value ranged from 50% to 100% for study staff and 75% to 100% for RECTIFIER. Responses in both cases closely match those of expert clinicians, with accuracy ranging from 91.7% to 100% (MCC, 0.644 to 1) for study staff and 97.9% and 100% (MCC, 0.837 to 1) for RECTIFIER. RECTIFIER performed better on the inclusion criterion of “symptomatic heart failure” with an accuracy of 97.9% vs. 91.7% and an MCC of 0.924 vs. 0.721.

Overall, the sensitivity and specificity for determining eligibility were 90.1% and 83.6% for study staff and 92.3% and 93.9% for RECTIFIER. When inclusion and exclusion questions were combined into two prompts or when GPT-3.5 with the same RAG architecture was used instead of GPT-4, sensitivity and specificity were reduced. The employ of GPT-4 without RAG in 35 patients, 15 of whom were misclassified by RECTIFIER for the symptomatic heart failure criterion, slightly improved accuracy from 57.1% to 62.9%. There was no statistically significant variation in results based on race, ethnicity, or gender.

The cost per patient using RECTIFIER was 11 cents for the individual question method and 2 cents for the composite question method. Due to the larger number of characters required, the employ of GPT-4 and GPT-3.5 without RAG resulted in higher costs, $15.88 and $1.59 per patient, respectively.

Conclusions,

Overall, RECTIFIER demonstrated high accuracy in screening patients for clinical trials, in some respects outperforming conventional methods used by research staff and at a cost of just 11 cents per patient. In contrast, conventional screening methods in a Phase 3 trial can cost approximately $34.75 per patient. These findings suggest significant potential improvements in the efficiency of patient recruitment into clinical trials. However, automating screening processes raises concerns about potential risks, such as the lack of detailed patient and surgical risk context, which requires careful implementation to balance benefits and risks.


Leave a Reply

Your email address will not be published. Required fields are marked *