Machine learning for medical imaging: methodological failures and recommendations for the future

Litjens, G. et al. A survey on deep learning in medical picture evaluation. Med. Image Anal. 42, 60–88 (2017).PubMed 

Google Scholar 
Cheplygina, V., de Bruijne, M. & Pluim, J. P. W. Not-so-supervised: a survey of semi-supervised, multi-instance, and switch learning in medical picture evaluation. Med. Image Anal. 54, 280–296 (2019).PubMed 

Google Scholar 
Zhou, S. Ok. et al. A assessment of deep learning in medical imaging: Image traits, expertise traits, case research with progress highlights, and future guarantees. Proceedings of the IEEE1-19 (2020).Liu, X. et al. A comparability of deep learning efficiency in opposition to health-care professionals in detecting illnesses from medical imaging: a scientific assessment and meta-analysis. The Lancet Digital Health (2019).Topol, E. J. High-performance medication: the convergence of human and synthetic intelligence. Nat. Med. 25, 44–56 (2019).CAS 
PubMed 

Google Scholar 
Sendak, M. P. et al. A path for translation of machine learning merchandise into healthcare supply. Eur. Med. J. Innov. 10, 19–00172 (2020).
Google Scholar 
Schwartz, W. B., Patil, R. S. & Szolovits, P. Artificial intelligence in medication (1987).Roberts, M. et al. Common pitfalls and recommendations for utilizing machine learning to detect and prognosticate for COVID-19 utilizing chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021).
Google Scholar 
Willemink, M. J. et al. Preparing medical imaging knowledge for machine learning. Radiology192224 (2020).Mueller, S. G. et al. Ways towards an early prognosis in Alzheimer’s illness: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s Dement. 1, 55–66 (2005).
Google Scholar 
Dallora, A. L., Eivazzadeh, S., Mendes, E., Berglund, J. & Anderberg, P. Machine learning and microsimulation strategies on the prognosis of dementia: A scientific literature assessment. PLoS ONE 12, e0179804 (2017).PubMed 
PubMed Central 

Google Scholar 
Arbabshirani, M. R., Plis, S., Sui, J. & Calhoun, V. D. Single topic prediction of mind problems in neuroimaging: Promises and pitfalls. NeuroImage 145, 137–165 (2017).PubMed 

Google Scholar 
Sakai, Ok. & Yamada, Ok. Machine learning research on main mind illnesses: 5-year traits of 2014–2018. Jpn. J. Radiol. 37, 34–72 (2019).PubMed 

Google Scholar 
Wen, J. et al. Convolutional neural networks for classification of Alzheimer’s illness: overview and reproducible analysis. Medical Image Analysis 101694 (2020).Ansart, M. et al. Predicting the development of delicate cognitive impairment utilizing machine learning: a scientific, quantitative and important assessment. Medical Image Analysis 101848 (2020).Torralba, A. & Efros, A. A. Unbiased have a look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 1521–1528 (2011).Dockès, J., Varoquaux, G. & Poline, J.-B. Preventing dataset shift from breaking machine-learning biomarkers. GigaScience 10, giab055 (2021).PubMed 
PubMed Central 

Google Scholar 
Zendel, O., Murschitz, M., Humenberger, M. & Herzner, W. How good is my take a look at knowledge? introducing security evaluation for laptop imaginative and prescient. Int. J. Computer Vis. 125, 95–109 (2017).
Google Scholar 
Pooch, E. H., Ballester, P. L. & Barros, R. C. Can we belief deep learning fashions prognosis? the impression of area shift in chest radiograph classification. In MICCAI workshop on Thoracic Image Analysis (Springer, 2019).Zech, J. R. et al. Variable generalization efficiency of a deep learning mannequin to detect pneumonia in chest radiographs: a cross-sectional examine. PLoS Med. 15, e1002683 (2018).PubMed 
PubMed Central 

Google Scholar 
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided prognosis. Proceedings of the National Academy of Sciences (2020).Tasdizen, T., Sajjadi, M., Javanmardi, M. & Ramesh, N. Improving the robustness of convolutional networks to look variability in biomedical photos. In International Symposium on Biomedical Imaging (ISBI), 549–553 (IEEE, 2018).Wachinger, C., Rieckmann, A., Pölsterl, S. & Initiative, A. D. N. et al. Detect and appropriate bias in multi-site neuroimaging datasets. Med. Image Anal. 67, 101879 (2021).PubMed 

Google Scholar 
Ashraf, A., Khan, S., Bhagwat, N., Chakravarty, M. & Taati, B. Learning to unlearn: constructing immunity to dataset bias in medical imaging research. In NeurIPS workshop on Machine Learning for Health (ML4H) (2018).Yu, X., Zheng, H., Liu, C., Huang, Y. & Ding, X. Classify epithelium-stroma in histopathological photos based mostly on deep transferable community. J. Microsc. 271, 164–173 (2018).
Google Scholar 
Abbasi-Sureshjani, S., Raumanns, R., Michels, B. E., Schouten, G. & Cheplygina, V. Risk of coaching diagnostic algorithms on knowledge with demographic bias. In Interpretable and Annotation-Efficient Learning for Medical Image Computing, 183–192 (Springer, 2020).Suresh, H. & Guttag, J. V. A framework for understanding unintended penalties of machine learning. arXiv preprint arXiv:1901.10002 (2019).Park, S. H. & Han, Ok. Methodologic information for evaluating scientific efficiency and impact of synthetic intelligence expertise for medical prognosis and prediction. Radiology 286, 800–809 (2018).PubMed 

Google Scholar 
Oakden-Rayner, L., Dunnmon, J., Carneiro, G. & Ré, C. Hidden stratification causes clinically significant failures in machine learning for medical imaging. In ACM Conference on Health, Inference, and Learning, 151–159 (2020).Winkler, J. Ok. et al. Association between surgical pores and skin markings in dermoscopic photos and diagnostic efficiency of a deep learning convolutional neural community for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).PubMed 
PubMed Central 

Google Scholar 
Joskowicz, L., Cohen, D., Caplan, N. & Sosna, J. Inter-observer variability of handbook contour delineation of constructions in CT. Eur. Radiol. 29, 1391–1399 (2019).PubMed 

Google Scholar 
Oakden-Rayner, L. Exploring large-scale public medical picture datasets. Academic Radiol. 27, 106–112 (2020).
Google Scholar 
Langley, P. The altering science of machine learning. Mach. Learn. 82, 275–279 (2011).
Google Scholar 
Rabanser, S., Günnemann, S. & Lipton, Z. C. Failing loudly: an empirical examine of strategies for detecting dataset shift. In Neural Information Processing Systems (NeurIPS) (2018).Rädsch, T. et al. What your radiologist is likely to be lacking: utilizing machine learning to establish mislabeled cases of X-ray photos. In Hawaii International Conference on System Sciences (HICSS) (2020).Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X. & Oord, A. v. d. Are we completed with ImageWeb? arXiv preprint arXiv:2006.07159 (2020).Gebru, T. et al. Datasheets for datasets. In Workshop on Fairness, Accountability, and Transparency in Machine Learning (2018).Mitchell, M. et al. Model playing cards for mannequin reporting. In Fairness, Accountability, and Transparency (FAccT), 220–229 (ACM, 2019).Ørting, S. N. et al. A survey of crowdsourcing in medical picture evaluation. Hum. Comput. 7, 1–26 (2020).
Google Scholar 
Poldrack, R. A., Huckins, G. & Varoquaux, G. Establishment of greatest practices for proof for prediction: a assessment. JAMA Psychiatry 77, 534–540 (2020).PubMed 
PubMed Central 

Google Scholar 
Pulini, A. A., Kerr, W. T., Loo, S. Ok. & Lenartowicz, A. Classification accuracy of neuroimaging biomarkers in attention-deficit/hyperactivity dysfunction: Effects of pattern dimension and round evaluation. Biol. Psychiatry.: Cogn. Neurosci. Neuroimaging 4, 108–120 (2019).
Google Scholar 
Saeb, S., Lonini, L., Jayaraman, A., Mohr, D. C. & Kording, Ok. P. The have to approximate the use-case in scientific machine learning. Gigascience 6, gix019 (2017).
Google Scholar 
Hosseini, M. et al. I attempted a bunch of issues: The risks of sudden overfitting in classification of mind knowledge. Neuroscience & Biobehavioral Reviews (2020).Simpson, A. L. et al. A big annotated medical picture dataset for the improvement and analysis of segmentation algorithms. arXiv preprint arXiv:1902.09063 (2019).Rohlfing, T. Image similarity and tissue overlaps as surrogates for picture registration accuracy: broadly used however unreliable. IEEE Trans. Med. Imaging 31, 153–163 (2011).PubMed 
PubMed Central 

Google Scholar 
Maier-Hein, L. et al. Why rankings of biomedical picture evaluation competitions ought to be interpreted with care. Nat. Commun. 9, 5217 (2018).CAS 
PubMed 
PubMed Central 

Google Scholar 
Van Calster, B., McLernon, D. J., Van Smeden, M., Wynants, L. & Steyerberg, E. W. Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 1–7 (2019).
Google Scholar 
Wagstaff, Ok. L. Machine learning that issues. In International Conference on Machine Learning (ICML), 529–536 (2012).Shankar, V. et al. Evaluating machine accuracy on imagenet. In International Conference on Machine Learning (ICML) (2020).Bellamy, D., Celi, L. & Beam, A. L. Evaluating progress on machine learning for longitudinal digital healthcare knowledge. arXiv preprint arXiv:2010.01149 (2020).Oliver, A., Odena, A., Raffel, C., Cubuk, E. D. & Goodfellow, I. J. Realistic analysis of semi-supervised learning algorithms. In Neural Information Processing Systems (NeurIPS) (2018).Dacrema, M. F., Cremonesi, P. & Jannach, D. Are we actually making a lot progress? a worrying evaluation of current neural advice approaches. In ACM Conference on Recommender Systems, 101–109 (2019).Musgrave, Ok., Belongie, S. & Lim, S.-N. A metric learning actuality verify. In European Conference on Computer Vision, 681–699 (Springer, 2020).Pham, H. V. et al. Problems and alternatives in coaching deep learning software program techniques: an evaluation of variance. In IEEE/ACM International Conference on Automated Software Engineering, 771–783 (2020).Bouthillier, X. et al. Accounting for variance in machine learning benchmarks. In Machine Learning and Systems (2021).Varoquaux, G. Cross-validation failure: small pattern sizes result in massive error bars. NeuroImage 180, 68–77 (2018).PubMed 

Google Scholar 
Szucs, D. & Ioannidis, J. P. Sample dimension evolution in neuroimaging analysis: an analysis of highly-cited research (1990–2012) and of newest practices (2017–2018) in high-impact journals. NeuroImage117164 (2020).Roelofs, R. et al. A meta-analysis of overfitting in machine learning. In Neural Information Processing Systems (NeurIPS), 9179–9189 (2019).Demšar, J. Statistical comparisons of classifiers over a number of knowledge units. J. Mach. Learn. Res. 7, 1–30 (2006).
Google Scholar 
Thompson, W. H., Wright, J., Bissett, P. G. & Poldrack, R. A. Meta-research: dataset decay and the drawback of sequential analyses on open datasets. eLife 9, e53498 (2020).PubMed 
PubMed Central 

Google Scholar 
Maier-Hein, L. et al. Is the winner actually the greatest? a important evaluation of widespread analysis follow in biomedical picture evaluation competitions. Nature Communications (2018).Cockburn, A., Dragicevic, P., Besançon, L. & Gutwin, C. Threats of a replication disaster in empirical laptop science. Commun. ACM 63, 70–79 (2020).
Google Scholar 
Gigerenzer, G. Statistical rituals: the replication delusion and how we bought there. Adv. Methods Pract. Psychol. Sci. 1, 198–218 (2018).
Google Scholar 
Benavoli, A., Corani, G. & Mangili, F. Should we actually use post-hoc exams based mostly on mean-ranks? J. Mach. Learn. Res. 17, 152–161 (2016).
Google Scholar 
Berrar, D. Confidence curves: an alternative choice to null speculation significance testing for the comparability of classifiers. Mach. Learn. 106, 911–949 (2017).
Google Scholar 
Bouthillier, X., Laurent, C. & Vincent, P. Unreproducible analysis is reproducible. In International Conference on Machine Learning (ICML), 725–734 (2019).Norgeot, B. et al. Minimum details about scientific synthetic intelligence modeling: the MI-CLAIM guidelines. Nat. Med. 26, 1320–1324 (2020).CAS 
PubMed 
PubMed Central 

Google Scholar 
Drummond, C. Machine learning as an experimental science (revisited). In AAAI workshop on analysis strategies for machine learning, 1–5 (2006).Steyerberg, E. W. & Harrell, F. E. Prediction fashions want applicable inner, inner–exterior, and exterior validation. J. Clin. Epidemiol. 69, 245–247 (2016).PubMed 

Google Scholar 
Woo, C.-W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building higher biomarkers: mind fashions in translational neuroimaging. Nat. Neurosci. 20, 365 (2017).CAS 
PubMed 
PubMed Central 

Google Scholar 
Van Calster, B. et al. Reporting and deciphering resolution curve evaluation: a information for investigators. Eur. Urol. 74, 796 (2018).PubMed 
PubMed Central 

Google Scholar 
Thomas, R. & Uminsky, D. The drawback with metrics is a basic drawback for AI. arXiv preprint arXiv:2002.08512 (2020).for the Evaluation of Medicinal Products, E. A. Points to contemplate on switching between superiority and non-inferiority. Br. J. Clin. Pharmacol. 52, 223–228 (2001).
Google Scholar 
D’Agostino Sr, R. B., Massaro, J. M. & Sullivan, L. M. Non-inferiority trials: design ideas and points–the encounters of educational consultants in statistics. Stat. Med. 22, 169–186 (2003).
Google Scholar 
Christensen, E. Methodology of superiority vs. equivalence trials and non-inferiority trials. J. Hepatol. 46, 947–954 (2007).PubMed 

Google Scholar 
Hendriksen, J. M., Geersing, G.-J., Moons, Ok. G. & de Groot, J. A. Diagnostic and prognostic prediction fashions. J. Thrombosis Haemost. 11, 129–141 (2013).
Google Scholar 
Campbell, M. Ok., Elbourne, D. R. & Altman, D. G. Consort assertion: extension to cluster randomised trials. BMJ 328, 702–708 (2004).PubMed 
PubMed Central 

Google Scholar 
Blasini, M., Peiris, N., Wright, T. & Colloca, L. The position of affected person–practitioner relationships in placebo and nocebo phenomena. Int. Rev. Neurobiol. 139, 211–231 (2018).PubMed 

Google Scholar 
Lipton, Z. C. & Steinhardt, J. Troubling traits in machine learning scholarship: some ML papers endure from flaws that might mislead the public and stymie future analysis. Queue 17, 45–77 (2019).
Google Scholar 
Tatman, R., VanderPlas, J. & Dane, S. A sensible taxonomy of reproducibility for machine learning analysis. In ICML workshop on Reproducibility in Machine Learning (2018).Gundersen, O. E. & Kjensmo, S. State of the artwork: Reproducibility in synthetic intelligence. In AAAI Conference on Artificial Intelligence (2018).Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D. & Amorim Fernández-Delgado, D. Do we want tons of of classifiers to resolve actual world classification issues? J. Mach. Learn. Res. 15, 3133–3181 (2014).
Google Scholar 
Sculley, D. et al. Hidden technical debt in machine learning techniques. In Neural Information Processing Systems (NeurIPS), 2503–2511 (2015).Ioannidis, J. P. A. Why most revealed analysis findings are false. PLoS Med. 2, e124 (2005).PubMed 
PubMed Central 

Google Scholar 
Teney, D. et al. On the worth of out-of-distribution testing: an instance of Goodhart’s Law. In Neural Information Processing Systems (NeurIPS) (2020).Kerr, N. L. HARKing: hypothesizing after the outcomes are identified. Personal. Soc. Psychol. Rev. 2, 196–217 (1998).CAS 

Google Scholar 
Gencoglu, O. et al. HARK facet of deep learning–from grad pupil descent to automated machine learning. arXiv preprint arXiv:1904.07633 (2019).Rosenthal, R. The file drawer drawback and tolerance for null outcomes. Psychological Bull. 86, 638 (1979).
Google Scholar 
Kellmeyer, P. Ethical and authorized implications of the methodological disaster in neuroimaging. Camb. Q. Healthc. Ethics 26, 530–554 (2017).PubMed 

Google Scholar 
Japkowicz, N. & Shah, M. Performance analysis in machine learning. In Machine Learning in Radiation Oncology, 41–56 (Springer, 2015).Santafe, G., Inza, I. & Lozano, J. A. Dealing with the analysis of supervised classification algorithms. Artif. Intell. Rev. 44, 467–508 (2015).
Google Scholar 
Han, Ok., Song, Ok. & Choi, B. W. How to develop, validate, and examine scientific prediction fashions involving radiological parameters: examine design and statistical strategies. Korean J. Radiol. 17, 339–350 (2016).PubMed 
PubMed Central 

Google Scholar 
Richter, A. N. & Khoshgoftaar, T. M. Sample dimension willpower for biomedical large knowledge with restricted labels. Netw. Modeling Anal. Health Inform. Bioinforma. 9, 12 (2020).
Google Scholar 
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, Ok. G. Transparent reporting of a multivariable prediction mannequin for particular person prognosis or prognosis (tripod): the tripod assertion. J. Br. Surg. 102, 148–158 (2015).CAS 

Google Scholar 
Wolff, R. F. et al. Probast: a device to evaluate the threat of bias and applicability of prediction mannequin research. Ann. Intern. Med. 170, 51–58 (2019).PubMed 

Google Scholar 
Henderson, P. et al. Towards the systematic reporting of the power and carbon footprints of machine learning. J. Mach. Learn. Res. 21, 1–43 (2020).
Google Scholar 
Bowen, A. & Casadevall, A. Increasing disparities between useful resource inputs and outcomes, as measured by sure well being deliverables, in biomedical analysis. Proc. Natl Acad. Sci. 112, 11335–11340 (2015).CAS 
PubMed 
PubMed Central 

Google Scholar 
Chambers, C. D., Dienes, Z., McIntosh, R. D., Rotshtein, P. & Willmes, Ok. Registered studies: realigning incentives in scientific publishing. Cortex 66, A1–A2 (2015).PubMed 

Google Scholar 
Forde, J. Z. & Paganini, M. The scientific technique in the science of machine learning. In ICLR workshop on Debugging Machine Learning Models (2019).Firestein, S.Failure: Why science is so profitable (Oxford University Press, 2015).Borji, A. Negative ends in laptop imaginative and prescient: a perspective. Image Vis. Comput. 69, 1–8 (2018).
Google Scholar 
Voets, M., Møllersen, Ok. & Bongo, L. A. Replication examine: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus pictures. arXiv preprint arXiv:1803.04337 (2018).Wilkinson, J. et al. Time to actuality verify the guarantees of machine learning-powered precision medication. The Lancet Digital Health (2020).Whitaker, Ok. & Guest, O. #bropenscience is damaged science. Psychologist 33, 34–37 (2020).
Google Scholar 
Kakarmath, S. et al. Best practices for authors of healthcare-related synthetic intelligence manuscripts. NPJ Digital Med. 3, 134–134 (2020).
Google Scholar 

https://www.nature.com/articles/s41746-022-00592-y

Recommended For You