Background
Emergency abdominal CT is one of the highest-volume, most clinically consequential imaging workflows in radiology. Millions of abdominal CT studies are performed annually in emergency departments worldwide, often under time pressure and with the expectation of rapid, accurate interpretation that directly guides surgical and medical decision-making.
As frontier AI models — particularly multimodal vision-language models — begin to demonstrate capability in medical image interpretation, the radiology community needs a principled way to evaluate whether these systems can perform reliably in this high-stakes clinical context. Existing benchmarks are often limited to single institutions, narrow diagnostic categories, or curated datasets that fail to capture the complexity and variability of real emergency practice.
The CT Abdomen Benchmark addresses this gap directly.
Objective
This benchmark aims to provide a standardized, multi-center evaluation framework that measures how well AI models can identify and characterize acute abdominal pathology on CT — the same diagnoses that emergency radiologists encounter daily. The goal is not to test AI on cherry-picked teaching cases, but to evaluate performance against a dataset that faithfully represents the clinical reality of emergency abdominal imaging.
Clinical Scope
The benchmark focuses on 10 predefined core diagnoses drawn from the acute abdominal presentations that constitute the core workload of emergency radiology CT interpretation. Multi-label cases are explicitly permitted, reflecting the real-world reality that overlapping pathology is common in clinical practice.
Normal
No acute abdominal pathology identified
Abdominal Aortic Aneurysm
AAA with or without rupture or dissection
Appendicitis
Acute appendicitis, perforation, abscess, and mimics
Cholecystitis
Acute cholecystitis, cholelithiasis, biliary pathology
Diverticulitis
Acute diverticulitis, complicated disease, abscess, perforation
Free Air
Pneumoperitoneum indicating visceral perforation
Hemoperitoneum
Intraperitoneal hemorrhage from traumatic or atraumatic causes
Hydronephrosis
Renal collecting system dilatation and ureteral obstruction
Pancreatitis
Acute pancreatitis, necrosis, peripancreatic collections
Small Bowel Obstruction
SBO with transition point identification
Dataset Design
Cases are sourced from 5 to 40 participating institutions to ensure diversity in scanner hardware, imaging protocols, contrast phases, patient demographics, and disease prevalence. The dataset targets 200 to 1,600 total cases, with a design preference of 500 to 1,000 cases to ensure statistical power. Each site contributes a minimum of 4 cases per diagnosis (2 Easy and 2 Hard), with each case assigned a binary difficulty label and a supplementary numeric difficulty score from 1 to 10.
Each case undergoes triple-blinded radiologist consensus review. Any case that does not achieve full agreement among three independent board-certified radiologists on the primary diagnosis is escalated to the Benchmark Committee for adjudication.
Evaluation Framework
The primary metric is macro-averaged one-vs-rest ROC AUC across all 10 diagnoses, measuring ranking quality independent of threshold choice and giving equal weight to each condition regardless of prevalence. Secondary metrics include:
- Top-1 Primary-Diagnosis Accuracy — Can the model correctly identify the single most likely diagnosis?
- Log Loss — How well-calibrated are the model's predicted probabilities?
- F1 Score — Reported in both macro and micro-averaged formulations to capture precision-recall trade-offs.
The dataset is partitioned equally into a Public Development Set (available for model training and validation, with a permissive commercial-use license) and a Private Test Set (held exclusively by the Benchmark organizers for official scoring and longitudinal comparability).
All evaluation metrics, scoring rubrics, and statistical methods are fully documented and reproducible.
Timeline
Initiative formation and benchmark design
Protocol finalization, site recruitment, data collection, and annotation
Pilot evaluation with initial model submissions
Public benchmark release and open evaluation
Participation
The CT Abdomen Benchmark is currently in active development. Participating sites receive authorship or acknowledgement on all resulting publications, early access to the public validation dataset, and direct influence over benchmark design. If your institution is interested in contributing anonymized cases or if you represent an AI vendor wishing to evaluate against this benchmark, please get in touch.