Multisite, multi-center, multi-diagnosis benchmarks for evaluating frontier radiology AI models and vision language models.
RSNA Benchmarks is a community-driven initiative to establish standardized, reproducible evaluation frameworks for frontier radiology AI models. As vision language models rapidly advance, the field needs rigorous, multi-center benchmarks that reflect real-world clinical complexity.
Our benchmarks are designed to be multisite and multi-diagnosis, drawing data and expertise from institutions worldwide. Each benchmark targets a specific clinical domain with carefully curated cases, consensus ground truth, and transparent evaluation metrics. Critically, our datasets are assembled to be representative of real-world clinical populations — capturing the diversity of pathology, patient demographics, and imaging conditions that practitioners encounter in daily practice.
By providing an open, community-governed resource grounded in clinical realism, we aim to accelerate responsible development and deployment of AI in radiology.
Each benchmark is a structured evaluation covering specific clinical domains, modalities, and diagnostic tasks.
The inaugural RSNA Benchmark — a comprehensive evaluation framework for AI models interpreting emergency radiology CT abdomen cases. Focuses on acute diagnoses encountered in clinical practice such as appendicitis, diverticulitis, and cholecystitis, spanning pathologies across liver, kidney, pancreas, bowel, and vascular structures with multi-reader consensus ground truth.
Multi-center evaluation of AI performance on frontal and lateral chest radiographs across a spectrum of thoracic pathology.
Structured assessment of AI interpretation across neuro MRI sequences for common and critical neurological diagnoses.
RSNA Benchmarks is an open initiative. We welcome contributions from radiologists, AI researchers, radiology AI vendors, regulators, and institutions worldwide.
Share anonymized cases from your institution to strengthen benchmark diversity and clinical representativeness.
Learn More→Help design evaluation frameworks, define ground truth protocols, and build the technical infrastructure.
Get Started→Run your models against our benchmarks and contribute results to the growing body of evaluation data.
Coming Soon→