Image

Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models

Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models

Recruiting
18-75 years
Female
Phase N/A

Powered by AI

Overview

This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models.

De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification.

Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility.

Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.

Description

Background: Breast cancer is the most prevalent malignancy among women worldwide. Ultrasound is a first-line screening modality, particularly in Asian populations with dense breast tissue where mammographic sensitivity is limited. However, ultrasound interpretation is highly operator-dependent, with substantial inter-observer variability in BI-RADS classification, especially for category 4A-4B lesions. Multimodal large language models (MLLMs) have emerged as a promising tool for medical image analysis due to their zero-shot diagnostic capability, interpretable chain-of-thought reasoning, and structured report generation. Nevertheless, there is currently no standardized benchmark for evaluating AI performance in breast ultrasound interpretation.

Study Design: Approximately 1,380 breast ultrasound images will be curated (1,200 evaluation set + 150 out-of-distribution safety test set + 30 prompt development set), encompassing three diagnostic categories: normal breast, benign lesions (BI-RADS 2-4B), and malignant lesions (BI-RADS 3-5). Two junior radiologists (\<5 years of experience) and two senior radiologists (\>15 years) will independently annotate images per ACR BI-RADS v2025 with arbitration by a fifth expert for discordant cases.

Diagnostic difficulty will be stratified into three tiers using cross-architecture deep learning consensus: Tier 1 (straightforward, both models correct), Tier 2 (equivocal, one correct/one incorrect), and Tier 3 (difficult, both incorrect, with senior expert validation). MLLMs will be evaluated across multiple dimensions: classification accuracy, sensitivity, specificity, F1 score, AUC, Cohen's kappa agreement with expert consensus, expected calibration error (ECE), morphological feature description accuracy, and chain-of-thought reasoning quality.

Safety Assessment: (1) Out-of-distribution rejection test using 150 non-diagnostic images (degraded images, non-breast ultrasound, other imaging modalities); (2) Temperature-stability pre-experiment across parameter settings; (3) Thinking-mode ablation comparing standard vs. chain-of-thought reasoning modes. All experiments use fixed model snapshots, system fingerprint monitoring, and complete logging for reproducibility.

Eligibility

Inclusion Criteria:

  • B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval
  • Image quality adequate for clinical diagnosis with clear visualization of the region of interest
  • Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with \>15 years of breast ultrasound experience (for the normal group)
  • Complete de-identification with removal of all personally identifiable information

Exclusion Criteria:

  • Severely degraded image quality precluding meaningful BI-RADS assessment
  • Duplicate images from the same patient (only the most representative image retained per lesion)
  • Images with residual personally identifiable information after de-identification processing
  • Cases with ambiguous, disputed, or unavailable pathological results
  • Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging

Study details
    Breast Neoplasms
    Breast Diseases
    Ultrasonography

NCT07500428

Peking Union Medical College Hospital

13 May 2026

Step 1 Get in touch with the nearest study center
We have submitted the contact information you provided to the research team at {{SITE_NAME}}. A copy of the message has been sent to your email for your records.
Would you like to be notified about other trials? Sign up for Patient Notification Services.
Sign up

Send a message

Enter your contact details to connect with study team

Investigator Avatar

Primary Contact

  Other languages supported:

First name*
Last name*
Email*
Phone number*
Other language

FAQs

Learn more about clinical trials

What is a clinical trial?

A clinical trial is a study designed to test specific interventions or treatments' effectiveness and safety, paving the way for new, innovative healthcare solutions.

Why should I take part in a clinical trial?

Participating in a clinical trial provides early access to potentially effective treatments and directly contributes to the healthcare advancements that benefit us all.

How long does a clinical trial take place?

The duration of clinical trials varies. Some trials last weeks, some years, depending on the phase and intention of the trial.

Do I get compensated for taking part in clinical trials?

Compensation varies per trial. Some offer payment or reimbursement for time and travel, while others may not.

How safe are clinical trials?

Clinical trials follow strict ethical guidelines and protocols to safeguard participants' health. They are closely monitored and safety reviewed regularly.
Add a private note
  • abc Select a piece of text.
  • Add notes visible only to you.
  • Send it to people through a passcode protected link.