Background

Emergency abdominal CT is one of the highest-volume, most clinically consequential imaging workflows in radiology. Millions of abdominal CT studies are performed annually in emergency departments worldwide, often under time pressure and with the expectation of rapid, accurate interpretation that directly guides surgical and medical decision-making.

As frontier AI models — particularly multimodal vision-language models — begin to demonstrate capability in medical image interpretation, the radiology community needs a principled way to evaluate whether these systems can perform reliably in this high-stakes clinical context. Existing benchmarks are often limited to single institutions, narrow diagnostic categories, or curated datasets that fail to capture the complexity and variability of real emergency practice.

The CT Abdomen Benchmark addresses this gap directly.

Objective

This benchmark aims to provide a standardized, multi-center evaluation framework that measures how well AI models can identify and characterize acute abdominal pathology on CT — the same diagnoses that emergency radiologists encounter daily. The goal is not to test AI on cherry-picked teaching cases, but to evaluate performance against a dataset that faithfully represents the clinical reality of emergency abdominal imaging.

Clinical Scope

The benchmark focuses on 10 predefined core diagnoses drawn from the acute abdominal presentations that constitute the core workload of emergency radiology CT interpretation. Multi-label cases are explicitly permitted, reflecting the real-world reality that overlapping pathology is common in clinical practice.

Normal

No acute abdominal pathology identified

Abdominal Aortic Aneurysm

AAA with or without rupture or dissection

Appendicitis

Acute appendicitis, perforation, abscess, and mimics

Cholecystitis

Acute cholecystitis, cholelithiasis, biliary pathology

Diverticulitis

Acute diverticulitis, complicated disease, abscess, perforation

Free Air

Pneumoperitoneum indicating visceral perforation

Hemoperitoneum

Intraperitoneal hemorrhage from traumatic or atraumatic causes

Hydronephrosis

Renal collecting system dilatation and ureteral obstruction

Pancreatitis

Acute pancreatitis, necrosis, peripancreatic collections

Small Bowel Obstruction

SBO with transition point identification

Dataset Design

Cases are sourced from 5 to 40 participating institutions to ensure diversity in scanner hardware, imaging protocols, contrast phases, patient demographics, and disease prevalence. The dataset targets 200 to 1,600 total cases, with a design preference of 500 to 1,000 cases to ensure statistical power. Each site contributes a minimum of 4 cases per diagnosis (2 Easy and 2 Hard), with each case assigned a binary difficulty label and a supplementary numeric difficulty score from 1 to 10.

Each case undergoes triple-blinded radiologist consensus review. Any case that does not achieve full agreement among three independent board-certified radiologists on the primary diagnosis is escalated to the Benchmark Committee for adjudication.

Evaluation Framework

The primary metric is macro-averaged one-vs-rest ROC AUC across all 10 diagnoses, measuring ranking quality independent of threshold choice and giving equal weight to each condition regardless of prevalence. Secondary metrics include:

  • Top-1 Primary-Diagnosis Accuracy — Can the model correctly identify the single most likely diagnosis?
  • Log Loss — How well-calibrated are the model's predicted probabilities?
  • F1 Score — Reported in both macro and micro-averaged formulations to capture precision-recall trade-offs.

The dataset is partitioned equally into a Public Development Set (available for model training and validation, with a permissive commercial-use license) and a Private Test Set (held exclusively by the Benchmark organizers for official scoring and longitudinal comparability).

All evaluation metrics, scoring rubrics, and statistical methods are fully documented and reproducible.

Timeline

2025

Initiative formation and benchmark design

2026

Protocol finalization, site recruitment, data collection, and annotation

2026–2027

Pilot evaluation with initial model submissions

2027

Public benchmark release and open evaluation

Participation

The CT Abdomen Benchmark is currently in active development. Participating sites receive authorship or acknowledgement on all resulting publications, early access to the public validation dataset, and direct influence over benchmark design. If your institution is interested in contributing anonymized cases or if you represent an AI vendor wishing to evaluate against this benchmark, please get in touch.