Machine learning for medical imaging: methodological failures and recommendations for the future

Litjens, G. et al. A survey on deep learning in medical picture evaluation. Med. Image Anal. 42, 60–88 (2017).PubMed

Google Scholar
Cheplygina, V., de Bruijne, M. & Pluim, J. P. W. Not-so-supervised: a survey of semi-supervised, multi-instance, and switch learning in medical picture evaluation. Med. Image Anal. 54, 280–296 (2019).PubMed

Google Scholar
Zhou, S. Ok. et al. A assessment of deep learning in medical imaging: Image traits, expertise traits, case research with progress highlights, and future guarantees. Proceedings of the IEEE1-19 (2020).Liu, X. et al. A comparability of deep learning efficiency in opposition to health-care professionals in detecting illnesses from medical imaging: a scientific assessment and meta-analysis. The Lancet Digital Health (2019).Topol, E. J. High-performance medication: the convergence of human and synthetic intelligence. Nat. Med. 25, 44–56 (2019).CAS
PubMed

Google Scholar
Sendak, M. P. et al. A path for translation of machine learning merchandise into healthcare supply. Eur. Med. J. Innov. 10, 19–00172 (2020).
Google Scholar
Schwartz, W. B., Patil, R. S. & Szolovits, P. Artificial intelligence in medication (1987).Roberts, M. et al. Common pitfalls and recommendations for utilizing machine learning to detect and prognosticate for COVID-19 utilizing chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021).
Google Scholar
Willemink, M. J. et al. Preparing medical imaging knowledge for machine learning. Radiology192224 (2020).Mueller, S. G. et al. Ways towards an early prognosis in Alzheimer’s illness: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s Dement. 1, 55–66 (2005).
Google Scholar
Dallora, A. L., Eivazzadeh, S., Mendes, E., Berglund, J. & Anderberg, P. Machine learning and microsimulation strategies on the prognosis of dementia: A scientific literature assessment. PLoS ONE 12, e0179804 (2017).PubMed
PubMed Central

Google Scholar
Arbabshirani, M. R., Plis, S., Sui, J. & Calhoun, V. D. Single topic prediction of mind problems in neuroimaging: Promises and pitfalls. NeuroImage 145, 137–165 (2017).PubMed

Google Scholar
Sakai, Ok. & Yamada, Ok. Machine learning research on main mind illnesses: 5-year traits of 2014–2018. Jpn. J. Radiol. 37, 34–72 (2019).PubMed

Google Scholar
Wen, J. et al. Convolutional neural networks for classification of Alzheimer’s illness: overview and reproducible analysis. Medical Image Analysis 101694 (2020).Ansart, M. et al. Predicting the development of delicate cognitive impairment utilizing machine learning: a scientific, quantitative and important assessment. Medical Image Analysis 101848 (2020).Torralba, A. & Efros, A. A. Unbiased have a look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 1521–1528 (2011).Dockès, J., Varoquaux, G. & Poline, J.-B. Preventing dataset shift from breaking machine-learning biomarkers. GigaScience 10, giab055 (2021).PubMed
PubMed Central

Google Scholar
Zendel, O., Murschitz, M., Humenberger, M. & Herzner, W. How good is my take a look at knowledge? introducing security evaluation for laptop imaginative and prescient. Int. J. Computer Vis. 125, 95–109 (2017).
Google Scholar
Pooch, E. H., Ballester, P. L. & Barros, R. C. Can we belief deep learning fashions prognosis? the impression of area shift in chest radiograph classification. In MICCAI workshop on Thoracic Image Analysis (Springer, 2019).Zech, J. R. et al. Variable generalization efficiency of a deep learning mannequin to detect pneumonia in chest radiographs: a cross-sectional examine. PLoS Med. 15, e1002683 (2018).PubMed
PubMed Central

Google Scholar
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided prognosis. Proceedings of the National Academy of Sciences (2020).Tasdizen, T., Sajjadi, M., Javanmardi, M. & Ramesh, N. Improving the robustness of convolutional networks to look variability in biomedical photos. In International Symposium on Biomedical Imaging (ISBI), 549–553 (IEEE, 2018).Wachinger, C., Rieckmann, A., Pölsterl, S. & Initiative, A. D. N. et al. Detect and appropriate bias in multi-site neuroimaging datasets. Med. Image Anal. 67, 101879 (2021).PubMed

Google Scholar
Ashraf, A., Khan, S., Bhagwat, N., Chakravarty, M. & Taati, B. Learning to unlearn: constructing immunity to dataset bias in medical imaging research. In NeurIPS workshop on Machine Learning for Health (ML4H) (2018).Yu, X., Zheng, H., Liu, C., Huang, Y. & Ding, X. Classify epithelium-stroma in histopathological photos based mostly on deep transferable community. J. Microsc. 271, 164–173 (2018).
Google Scholar
Abbasi-Sureshjani, S., Raumanns, R., Michels, B. E., Schouten, G. & Cheplygina, V. Risk of coaching diagnostic algorithms on knowledge with demographic bias. In Interpretable and Annotation-Efficient Learning for Medical Image Computing, 183–192 (Springer, 2020).Suresh, H. & Guttag, J. V. A framework for understanding unintended penalties of machine learning. arXiv preprint arXiv:1901.10002 (2019).Park, S. H. & Han, Ok. Methodologic information for evaluating scientific efficiency and impact of synthetic intelligence expertise for medical prognosis and prediction. Radiology 286, 800–809 (2018).PubMed

Google Scholar
Oakden-Rayner, L., Dunnmon, J., Carneiro, G. & Ré, C. Hidden stratification causes clinically significant failures in machine learning for medical imaging. In ACM Conference on Health, Inference, and Learning, 151–159 (2020).Winkler, J. Ok. et al. Association between surgical pores and skin markings in dermoscopic photos and diagnostic efficiency of a deep learning convolutional neural community for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).PubMed
PubMed Central

Google Scholar
Joskowicz, L., Cohen, D., Caplan, N. & Sosna, J. Inter-observer variability of handbook contour delineation of constructions in CT. Eur. Radiol. 29, 1391–1399 (2019).PubMed

Google Scholar
Oakden-Rayner, L. Exploring large-scale public medical picture datasets. Academic Radiol. 27, 106–112 (2020).
Google Scholar
Langley, P. The altering science of machine learning. Mach. Learn. 82, 275–279 (2011).
Google Scholar
Rabanser, S., Günnemann, S. & Lipton, Z. C. Failing loudly: an empirical examine of strategies for detecting dataset shift. In Neural Information Processing Systems (NeurIPS) (2018).Rädsch, T. et al. What your radiologist is likely to be lacking: utilizing machine learning to establish mislabeled cases of X-ray photos. In Hawaii International Conference on System Sciences (HICSS) (2020).Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X. & Oord, A. v. d. Are we completed with ImageWeb? arXiv preprint arXiv:2006.07159 (2020).Gebru, T. et al. Datasheets for datasets. In Workshop on Fairness, Accountability, and Transparency in Machine Learning (2018).Mitchell, M. et al. Model playing cards for mannequin reporting. In Fairness, Accountability, and Transparency (FAccT), 220–229 (ACM, 2019).Ørting, S. N. et al. A survey of crowdsourcing in medical picture evaluation. Hum. Comput. 7, 1–26 (2020).
Google Scholar
Poldrack, R. A., Huckins, G. & Varoquaux, G. Establishment of greatest practices for proof for prediction: a assessment. JAMA Psychiatry 77, 534–540 (2020).PubMed
PubMed Central

Google Scholar
Pulini, A. A., Kerr, W. T., Loo, S. Ok. & Lenartowicz, A. Classification accuracy of neuroimaging biomarkers in attention-deficit/hyperactivity dysfunction: Effects of pattern dimension and round evaluation. Biol. Psychiatry.: Cogn. Neurosci. Neuroimaging 4, 108–120 (2019).
Google Scholar
Saeb, S., Lonini, L., Jayaraman, A., Mohr, D. C. & Kording, Ok. P. The have to approximate the use-case in scientific machine learning. Gigascience 6, gix019 (2017).
Google Scholar
Hosseini, M. et al. I attempted a bunch of issues: The risks of sudden overfitting in classification of mind knowledge. Neuroscience & Biobehavioral Reviews (2020).Simpson, A. L. et al. A big annotated medical picture dataset for the improvement and analysis of segmentation algorithms. arXiv preprint arXiv:1902.09063 (2019).Rohlfing, T. Image similarity and tissue overlaps as surrogates for picture registration accuracy: broadly used however unreliable. IEEE Trans. Med. Imaging 31, 153–163 (2011).PubMed
PubMed Central

Google Scholar
Maier-Hein, L. et al. Why rankings of biomedical picture evaluation competitions ought to be interpreted with care. Nat. Commun. 9, 5217 (2018).CAS
PubMed
PubMed Central

Google Scholar
Van Calster, B., McLernon, D. J., Van Smeden, M., Wynants, L. & Steyerberg, E. W. Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 1–7 (2019).
Google Scholar
Wagstaff, Ok. L. Machine learning that issues. In International Conference on Machine Learning (ICML), 529–536 (2012).Shankar, V. et al. Evaluating machine accuracy on imagenet. In International Conference on Machine Learning (ICML) (2020).Bellamy, D., Celi, L. & Beam, A. L. Evaluating progress on machine learning for longitudinal digital healthcare knowledge. arXiv preprint arXiv:2010.01149 (2020).Oliver, A., Odena, A., Raffel, C., Cubuk, E. D. & Goodfellow, I. J. Realistic analysis of semi-supervised learning algorithms. In Neural Information Processing Systems (NeurIPS) (2018).Dacrema, M. F., Cremonesi, P. & Jannach, D. Are we actually making a lot progress? a worrying evaluation of current neural advice approaches. In ACM Conference on Recommender Systems, 101–109 (2019).Musgrave, Ok., Belongie, S. & Lim, S.-N. A metric learning actuality verify. In European Conference on Computer Vision, 681–699 (Springer, 2020).Pham, H. V. et al. Problems and alternatives in coaching deep learning software program techniques: an evaluation of variance. In IEEE/ACM International Conference on Automated Software Engineering, 771–783 (2020).Bouthillier, X. et al. Accounting for variance in machine learning benchmarks. In Machine Learning and Systems (2021).Varoquaux, G. Cross-validation failure: small pattern sizes result in massive error bars. NeuroImage 180, 68–77 (2018).PubMed

Google Scholar
Szucs, D. & Ioannidis, J. P. Sample dimension evolution in neuroimaging analysis: an analysis of highly-cited research (1990–2012) and of newest practices (2017–2018) in high-impact journals. NeuroImage117164 (2020).Roelofs, R. et al. A meta-analysis of overfitting in machine learning. In Neural Information Processing Systems (NeurIPS), 9179–9189 (2019).Demšar, J. Statistical comparisons of classifiers over a number of knowledge units. J. Mach. Learn. Res. 7, 1–30 (2006).
Google Scholar
Thompson, W. H., Wright, J., Bissett, P. G. & Poldrack, R. A. Meta-research: dataset decay and the drawback of sequential analyses on open datasets. eLife 9, e53498 (2020).PubMed
PubMed Central

Google Scholar
Maier-Hein, L. et al. Is the winner actually the greatest? a important evaluation of widespread analysis follow in biomedical picture evaluation competitions. Nature Communications (2018).Cockburn, A., Dragicevic, P., Besançon, L. & Gutwin, C. Threats of a replication disaster in empirical laptop science. Commun. ACM 63, 70–79 (2020).
Google Scholar
Gigerenzer, G. Statistical rituals: the replication delusion and how we bought there. Adv. Methods Pract. Psychol. Sci. 1, 198–218 (2018).
Google Scholar
Benavoli, A., Corani, G. & Mangili, F. Should we actually use post-hoc exams based mostly on mean-ranks? J. Mach. Learn. Res. 17, 152–161 (2016).
Google Scholar
Berrar, D. Confidence curves: an alternative choice to null speculation significance testing for the comparability of classifiers. Mach. Learn. 106, 911–949 (2017).
Google Scholar
Bouthillier, X., Laurent, C. & Vincent, P. Unreproducible analysis is reproducible. In International Conference on Machine Learning (ICML), 725–734 (2019).Norgeot, B. et al. Minimum details about scientific synthetic intelligence modeling: the MI-CLAIM guidelines. Nat. Med. 26, 1320–1324 (2020).CAS
PubMed
PubMed Central

Google Scholar
Drummond, C. Machine learning as an experimental science (revisited). In AAAI workshop on analysis strategies for machine learning, 1–5 (2006).Steyerberg, E. W. & Harrell, F. E. Prediction fashions want applicable inner, inner–exterior, and exterior validation. J. Clin. Epidemiol. 69, 245–247 (2016).PubMed

Google Scholar
Woo, C.-W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building higher biomarkers: mind fashions in translational neuroimaging. Nat. Neurosci. 20, 365 (2017).CAS
PubMed
PubMed Central

Google Scholar
Van Calster, B. et al. Reporting and deciphering resolution curve evaluation: a information for investigators. Eur. Urol. 74, 796 (2018).PubMed
PubMed Central

Google Scholar
Thomas, R. & Uminsky, D. The drawback with metrics is a basic drawback for AI. arXiv preprint arXiv:2002.08512 (2020).for the Evaluation of Medicinal Products, E. A. Points to contemplate on switching between superiority and non-inferiority. Br. J. Clin. Pharmacol. 52, 223–228 (2001).
Google Scholar
D’Agostino Sr, R. B., Massaro, J. M. & Sullivan, L. M. Non-inferiority trials: design ideas and points–the encounters of educational consultants in statistics. Stat. Med. 22, 169–186 (2003).
Google Scholar
Christensen, E. Methodology of superiority vs. equivalence trials and non-inferiority trials. J. Hepatol. 46, 947–954 (2007).PubMed

Google Scholar
Hendriksen, J. M., Geersing, G.-J., Moons, Ok. G. & de Groot, J. A. Diagnostic and prognostic prediction fashions. J. Thrombosis Haemost. 11, 129–141 (2013).
Google Scholar
Campbell, M. Ok., Elbourne, D. R. & Altman, D. G. Consort assertion: extension to cluster randomised trials. BMJ 328, 702–708 (2004).PubMed
PubMed Central

Google Scholar
Blasini, M., Peiris, N., Wright, T. & Colloca, L. The position of affected person–practitioner relationships in placebo and nocebo phenomena. Int. Rev. Neurobiol. 139, 211–231 (2018).PubMed

Google Scholar
Lipton, Z. C. & Steinhardt, J. Troubling traits in machine learning scholarship: some ML papers endure from flaws that might mislead the public and stymie future analysis. Queue 17, 45–77 (2019).
Google Scholar
Tatman, R., VanderPlas, J. & Dane, S. A sensible taxonomy of reproducibility for machine learning analysis. In ICML workshop on Reproducibility in Machine Learning (2018).Gundersen, O. E. & Kjensmo, S. State of the artwork: Reproducibility in synthetic intelligence. In AAAI Conference on Artificial Intelligence (2018).Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D. & Amorim Fernández-Delgado, D. Do we want tons of of classifiers to resolve actual world classification issues? J. Mach. Learn. Res. 15, 3133–3181 (2014).
Google Scholar
Sculley, D. et al. Hidden technical debt in machine learning techniques. In Neural Information Processing Systems (NeurIPS), 2503–2511 (2015).Ioannidis, J. P. A. Why most revealed analysis findings are false. PLoS Med. 2, e124 (2005).PubMed
PubMed Central

Google Scholar
Teney, D. et al. On the worth of out-of-distribution testing: an instance of Goodhart’s Law. In Neural Information Processing Systems (NeurIPS) (2020).Kerr, N. L. HARKing: hypothesizing after the outcomes are identified. Personal. Soc. Psychol. Rev. 2, 196–217 (1998).CAS

Google Scholar
Gencoglu, O. et al. HARK facet of deep learning–from grad pupil descent to automated machine learning. arXiv preprint arXiv:1904.07633 (2019).Rosenthal, R. The file drawer drawback and tolerance for null outcomes. Psychological Bull. 86, 638 (1979).
Google Scholar
Kellmeyer, P. Ethical and authorized implications of the methodological disaster in neuroimaging. Camb. Q. Healthc. Ethics 26, 530–554 (2017).PubMed

Google Scholar
Japkowicz, N. & Shah, M. Performance analysis in machine learning. In Machine Learning in Radiation Oncology, 41–56 (Springer, 2015).Santafe, G., Inza, I. & Lozano, J. A. Dealing with the analysis of supervised classification algorithms. Artif. Intell. Rev. 44, 467–508 (2015).
Google Scholar
Han, Ok., Song, Ok. & Choi, B. W. How to develop, validate, and examine scientific prediction fashions involving radiological parameters: examine design and statistical strategies. Korean J. Radiol. 17, 339–350 (2016).PubMed
PubMed Central

Google Scholar
Richter, A. N. & Khoshgoftaar, T. M. Sample dimension willpower for biomedical large knowledge with restricted labels. Netw. Modeling Anal. Health Inform. Bioinforma. 9, 12 (2020).
Google Scholar
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, Ok. G. Transparent reporting of a multivariable prediction mannequin for particular person prognosis or prognosis (tripod): the tripod assertion. J. Br. Surg. 102, 148–158 (2015).CAS

Google Scholar
Wolff, R. F. et al. Probast: a device to evaluate the threat of bias and applicability of prediction mannequin research. Ann. Intern. Med. 170, 51–58 (2019).PubMed

Google Scholar
Henderson, P. et al. Towards the systematic reporting of the power and carbon footprints of machine learning. J. Mach. Learn. Res. 21, 1–43 (2020).
Google Scholar
Bowen, A. & Casadevall, A. Increasing disparities between useful resource inputs and outcomes, as measured by sure well being deliverables, in biomedical analysis. Proc. Natl Acad. Sci. 112, 11335–11340 (2015).CAS
PubMed
PubMed Central

Google Scholar
Chambers, C. D., Dienes, Z., McIntosh, R. D., Rotshtein, P. & Willmes, Ok. Registered studies: realigning incentives in scientific publishing. Cortex 66, A1–A2 (2015).PubMed

Google Scholar
Forde, J. Z. & Paganini, M. The scientific technique in the science of machine learning. In ICLR workshop on Debugging Machine Learning Models (2019).Firestein, S.Failure: Why science is so profitable (Oxford University Press, 2015).Borji, A. Negative ends in laptop imaginative and prescient: a perspective. Image Vis. Comput. 69, 1–8 (2018).
Google Scholar
Voets, M., Møllersen, Ok. & Bongo, L. A. Replication examine: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus pictures. arXiv preprint arXiv:1803.04337 (2018).Wilkinson, J. et al. Time to actuality verify the guarantees of machine learning-powered precision medication. The Lancet Digital Health (2020).Whitaker, Ok. & Guest, O. #bropenscience is damaged science. Psychologist 33, 34–37 (2020).
Google Scholar
Kakarmath, S. et al. Best practices for authors of healthcare-related synthetic intelligence manuscripts. NPJ Digital Med. 3, 134–134 (2020).
Google Scholar

https://www.nature.com/articles/s41746-022-00592-y