LVO Detection in Published Literature: What Sensitivity and Specificity Numbers Actually Mean for Your Department

Data visualization of LVO detection sensitivity and specificity benchmarks across published radiology AI studies

When a radiology AI vendor publishes sensitivity and specificity numbers for LVO detection, those numbers are accurate — and they are also profoundly context-dependent. The same algorithm evaluated on different patient populations, with different CT scanner parameters, at different clinical sites, and with different reader adjudication standards can produce sensitivity estimates ranging from 87% to 97% across the published literature. Both ends of that range can be true simultaneously, and neither number is the one that matters most for your department's deployment decision.

This article is an attempt to give radiology department leaders and hospital CMOs a functional reading framework for published AI performance benchmarks — not to dismiss the evidence base, but to read it accurately.

What the Published LVO Literature Actually Shows

Large vessel occlusion detection on CT angiography is among the most mature areas of radiology AI clinical evidence. Viz.ai's LVO detection system received FDA 510(k) clearance (K193482), and multiple peer-reviewed studies have examined its performance characteristics. Published validation studies in Stroke and related journals have reported sensitivities in the range of 91–95% for proximal MCA occlusion in retrospective cohorts with well-characterized ground truth. Specificity in the same cohorts tends to run 88–94%, depending on threshold selection.

Those numbers reflect performance in curated retrospective datasets from tertiary stroke centers — typically single-site or multi-site series where patients received CT angiography as part of a suspected acute stroke protocol, where neuroradiologist adjudication was used as ground truth, and where imaging was acquired on modern scanners with consistent CTA protocols. These are favorable conditions for algorithm performance.

The harder question is what happens when those algorithms encounter the imaging reality of a community hospital: variable scanner generations, inconsistent CTA protocols because the ordering physician is an ED attending who sometimes orders a non-contrast head CT when they meant to order CTA, motion artifact on uncooperative patients, or the incidental CTA ordered primarily for suspected vertebral artery disease where the MCA wasn't the clinical question but happened to show a subtle M1 occlusion.

The Three Variables That Determine Which Number Applies to You

Cohort Selection Bias

Published LVO studies almost universally select patients who were clinically suspected of having a large vessel occlusion — the patients who had CT angiography ordered by a stroke team. This means the prevalence of LVO in the study cohort is substantially higher than the prevalence of LVO in the general CT head population at a community hospital, where CTA is also ordered for headache workup, suspected aneurysm, vertebral artery symptoms, and other indications where LVO prevalence is much lower.

Sensitivity doesn't change with prevalence, but positive predictive value does — dramatically. A model with 95% sensitivity and 90% specificity in a cohort where LVO prevalence is 40% has a PPV of about 86%. The same model deployed in a population where LVO prevalence is 5% — more representative of a general community hospital CTA volume — has a PPV of approximately 33%. Two-thirds of its alerts in the real deployment environment would be false positives, even at the published performance numbers.

Threshold Selection and the ROC Curve

Published benchmarks are typically reported at a single operating point on the ROC curve — often the point that maximizes the Youden index or achieves a sensitivity target (e.g., 95% sensitivity for safety reasons). But deployed systems can operate at different thresholds, and vendors often configure thresholds based on their clinical alert philosophy. The sensitivity/specificity pair you see in a published study may not correspond to the threshold the vendor configures for deployment.

When evaluating an AI vendor, ask explicitly: what threshold configuration produces the published numbers, and what threshold do you deploy by default? If the vendor cannot answer this or treats it as proprietary, that is itself informative.

Ground Truth Adjudication

LVO adjudication in published studies is typically performed by board-certified neuroradiologists reviewing the CTA with full clinical context. In community hospital practice, the overnight radiologist making the ground truth read may be a general radiologist without dedicated neuroradiology subspecialty training — which means the "ground truth" against which the AI is being compared has different characteristics than the study adjudication standard. This can artificially inflate or deflate apparent sensitivity depending on whether the AI model or the human reader tends to be more conservative on borderline cases.

What to Ask a Vendor Before Purchasing

Given these variables, published sensitivity and specificity numbers should be viewed as proof of concept, not as deployment performance guarantees. The questions that matter more for community hospital procurement decisions include:

In what imaging environment was the validation performed — academic stroke center or community hospital? What was the scanner vendor and generation mix?
What was the CTA protocol? Did the validation cohort include non-contrast head CT ordered in error that the algorithm had to process?
What was the LVO prevalence in the validation cohort? How does that compare to your expected community hospital CTA-head volume?
What is the alert volume per 100 CTA-head studies at the default deployment threshold?
Has the vendor tracked alert-to-action conversion rates in deployed community hospital environments?

We're not saying published literature is misleading. We're saying it answers a different question than "how will this perform in my specific radiology department?" Reading the literature critically means understanding what question was being asked in the study, and whether that question matches your deployment context.

How Pacslens Approaches LVO Benchmarking

Pacslens's LVO triage component is designed against the published literature as a performance target, not as a validated claim. We reference the Viz.ai K193482 published evidence and the prospective Stroke journal validation data as context for the detection architecture, not as a representation of Pacslens's own cleared performance. Our 510(k) submissions reference these cleared predicate devices in the intended use and predicate device sections, as is standard practice under 21 CFR Part 807.

For community hospital pilots, we track the metrics that matter for operational utility: alert volume per 100 CTA-head studies, time from study arrival at PACS to triage score assignment, and — in collaboration with participating hospitals — alert-to-read-order conversion as a proxy for radiologist engagement with the triage ranking. These metrics tell us whether the system is being used as intended, which is a precondition for clinical impact.

The published sensitivity/specificity literature for LVO is genuine evidence that AI-assisted LVO detection can work at clinically meaningful performance levels. The gap between that evidence and real-world deployment performance is a function of cohort, threshold, and environment — and closing that gap requires prospective deployment monitoring, not just citing the retrospective benchmark.

Have specific questions about how LVO triage scoring works in a community hospital CTA volume? Contact our clinical team or review our published evidence summary.