Overview
This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models.
De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification.
Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility.
Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.
Description
Background: Breast cancer is the most prevalent malignancy among women worldwide. Ultrasound is a first-line screening modality, particularly in Asian populations with dense breast tissue where mammographic sensitivity is limited. However, ultrasound interpretation is highly operator-dependent, with substantial inter-observer variability in BI-RADS classification, especially for category 4A-4B lesions. Multimodal large language models (MLLMs) have emerged as a promising tool for medical image analysis due to their zero-shot diagnostic capability, interpretable chain-of-thought reasoning, and structured report generation. Nevertheless, there is currently no standardized benchmark for evaluating AI performance in breast ultrasound interpretation.
Study Design: Approximately 1,380 breast ultrasound images will be curated (1,200 evaluation set + 150 out-of-distribution safety test set + 30 prompt development set), encompassing three diagnostic categories: normal breast, benign lesions (BI-RADS 2-4B), and malignant lesions (BI-RADS 3-5). Two junior radiologists (\<5 years of experience) and two senior radiologists (\>15 years) will independently annotate images per ACR BI-RADS v2025 with arbitration by a fifth expert for discordant cases.
Diagnostic difficulty will be stratified into three tiers using cross-architecture deep learning consensus: Tier 1 (straightforward, both models correct), Tier 2 (equivocal, one correct/one incorrect), and Tier 3 (difficult, both incorrect, with senior expert validation). MLLMs will be evaluated across multiple dimensions: classification accuracy, sensitivity, specificity, F1 score, AUC, Cohen's kappa agreement with expert consensus, expected calibration error (ECE), morphological feature description accuracy, and chain-of-thought reasoning quality.
Safety Assessment: (1) Out-of-distribution rejection test using 150 non-diagnostic images (degraded images, non-breast ultrasound, other imaging modalities); (2) Temperature-stability pre-experiment across parameter settings; (3) Thinking-mode ablation comparing standard vs. chain-of-thought reasoning modes. All experiments use fixed model snapshots, system fingerprint monitoring, and complete logging for reproducibility.
Eligibility
Inclusion Criteria:
- B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval
- Image quality adequate for clinical diagnosis with clear visualization of the region of interest
- Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with \>15 years of breast ultrasound experience (for the normal group)
- Complete de-identification with removal of all personally identifiable information
Exclusion Criteria:
- Severely degraded image quality precluding meaningful BI-RADS assessment
- Duplicate images from the same patient (only the most representative image retained per lesion)
- Images with residual personally identifiable information after de-identification processing
- Cases with ambiguous, disputed, or unavailable pathological results
- Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging
