SMU Data Science Review


Breast cancer is prevalent among women in the United States. Breast cancer screening is standard but requires a radiologist to review screening images to make a diagnosis. Diagnosis through the traditional screening method of mammography currently has an accuracy of about 78% for women of all ages and demographics. A more recent and precise technique called Digital Breast Tomosynthesis (DBT) has shown to be more promising but is less well studied. A machine learning model trained on DBT images has the potential to increase the success of identifying breast cancer and reduce the time it takes to diagnose a patient, leading to faster treatment. In this study, a Convolutional Neural Network (CNN) was trained on an open-source dataset from Duke of DBT images belonging to patients with no, benign, and malignant tumors. The model was designed to identify the presence of a tumor (both malignant or benign) or its absence. Robust open-source datasets of medical images are scarce due to the nature of medicine. Deidentifying medical images is very time-intensive, and labeling the dataset requires the expertise of a medical professional, in this case, a radiologist. The open-source dataset was small and imbalanced, so transfer learning, under-sampling the more prevalent healthy patient class, and image augmentation was used to improve prediction accuracy. Training a CNN is very computationally expensive, and a high compute VM environment with extensive RAM was created to facilitate learning the weights of a CNN.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License