AI models often play a role in medical diagnoses, especially when it comes to analyzing images like X-rays. However, studies have shown that these models don’t always perform well across demographic groups, typically underperforming women and people of color.
These models have also been shown to develop some surprising abilities. In 2022, MIT researchers reported that artificial intelligence models could accurately predict a patient’s race based on their chest X-rays. something that the most skilled radiologists cannot do.
The research team has now found that models that are most true at predicting demographics also exhibit the largest “equity gaps” — that is, discrepancies in their ability to accurately diagnose images of people of different races or genders. The findings suggest that these models may be taking “demographic shortcuts” when making diagnostic judgments, leading to erroneous results for women, blacks, and other groups, the researchers say.
“It is well established that high-capacity machine-learning models are good predictors of human demographics, such as self-reported race, gender, or age. This paper re-demonstrates that capacity and then connects it to the lack of performance across groups, something that has never been done before,” says Marzyeh Ghassemi, an MIT assistant professor of electrical engineering and computer science, a member of MIT’s Institute for Medical Engineering and Science, and senior author of the study.
The researchers also found that they could retrain the models in ways that improved their fairness. However, their “drain” approach worked best when the models were tested on the same patients they were trained on, such as patients from the same hospital. When these models were applied to patients from different hospitals, equity gaps re-emerged.
I think the most crucial takeaway is that, first, you need to carefully evaluate any external models against your own data, because any fairness guarantees that modelers provide on their training data may not transfer to the population. Second, whenever sufficient data is available, you should train models on your own data.”
Haoran Zhang, a graduate student at MIT and one of the lead authors of the recent paper
MIT student Yuzhe Yang is also the lead author of the paper, which will appear in natural medicineJudy Gichoya, assistant professor of radiology and imaging sciences at Emory University School of Medicine, and Dina Katabi, the Thuan and Nicole Pham Professor of Electrical Engineering and Computer Science at MIT, also authored the paper.
Removing Prejudice
As of May 2024, the FDA has approved 882 AI-enabled medical devices, 671 of which are for operate in radiology. Since 2022, when Ghassemi and her colleagues showed that these diagnostic models could accurately predict race, they and other researchers have shown that such models are also very good at predicting gender and age, even though the models are not trained for these tasks.
“Many popular machine-learning models have superhuman demographic prediction skills—radiologists can’t detect their self-reported race from a chest X-ray,” Ghassemi says. “These are models that are good at predicting disease, but through training, they learn to predict other things that might not be desirable.” In this study, the researchers set out to investigate why these models didn’t perform as well for some groups. Specifically, they wanted to see if the models were using demographic shortcuts to make predictions that ended up being less true for some groups. These shortcuts can show up in AI models when they operate demographic attributes to determine whether a condition is present, rather than relying on other features of the images.
Using publicly available chest X-ray datasets from Beth Israel Deaconess Medical Center in Boston, researchers trained models to predict whether patients had one of three different medical conditions: fluid accumulation in the lungs, a collapsed lung, or an enlarged heart. They then tested the models on X-ray images that were excluded from the training data.
Overall, the models performed well, but most of them showed “fairness gaps” – that is, discrepancies between accuracy rates for men and women, and for white and black patients.
The models were also able to predict the gender, race and age of test subjects undergoing X-rays. Additionally, there was a significant correlation between each model’s accuracy in demographic predictions and the size of the equity gap. This suggests that models may operate demographic categorization as a shortcut to predicting disease.
The researchers then tried to reduce the fairness gaps by using two types of strategies. For one set of models, they trained them to optimize for “subgroup robustness,” meaning that the models are rewarded for performing better in the subgroup where they perform worst, and penalized if the error rate in one group is higher than the others.
In another set of models, the researchers had them remove all demographic information from the photos, using an “adversarial group” approach. The researchers found that both strategies worked quite well.
“For distribution data, existing state-of-the-art methods can be leveraged to reduce fairness gaps without significant compromises to overall performance,” Ghassemi says. “Subgroup tough methods force models to be sensitive to misprediction of a particular group, while group adversarial methods attempt to remove group information entirely.”
It’s not always fairer
However, these approaches only worked when the models were tested on data from the same types of patients they were trained on – for example, only patients from the Beth Israel Deaconess Medical Center dataset.
When the researchers tested the “decolorized” models using BIDMC data to analyze patients from five other hospital datasets, they found that the overall accuracy of the models remained high, but some of them showed gigantic gaps in fairness.
“If you challenge the model for one group of patients, that fairness will not necessarily apply when you move to a recent group of patients from a different hospital in a different location,” Zhang says.
This is worrying because in many cases hospitals operate models developed based on data from other hospitals, especially when they buy an off-the-shelf model, the researchers said.
“We found that even state-of-the-art models that perform optimally on data similar to their training sets are not optimal—that is, they do not represent the best trade-off between overall and subgroup performance—in the novel setting,” Ghassemi says. “Unfortunately, this is likely how the model will be implemented. Most models are trained and validated on data from a single hospital or source and then deployed broadly.”
The researchers found that models that were biased using adversarial group approaches showed slightly greater fairness when tested on recent groups of patients than those biased using subgroup robustness methods. Now they plan to try to develop and test additional methods to see if they can create models that are better at making fair predictions on recent data sets.
The results suggest that hospitals using these types of AI models should evaluate them on their own patient populations before deploying them to ensure they do not produce incorrect results for certain groups.
The research was funded by the Google Research Scholar Award Program, the Harold Amos Healthcare Workforce Development Program of the Robert Wood Johnson Foundation, RSNA Health Disparities, the Lacuna Fund, the Gordon and Betty Moore Foundation, the National Institute of Biomedical Imaging and Bioengineering, and the National Heart, Lung, and Blood Institute.
Source:
Magazine Reference:
Yang, Y., et al. (2024). The limits of fair medical imaging AI in real-world generalization. Nature’s Medicine. doi.org/10.1038/s41591-024-03113-4.