When Chatbots Beat the Virologists: What the New “VCT” Benchmark Really Means

Ever wondered how far large-language models have crept into hands-on lab science? A new test called the Virology Capabilities Test (VCT) just set the bar—and the results are eye-opening.

The Quick Story

Researchers at SecureBio and the Center for AI Safety rounded up dozens of PhD-level virologists to build 322 complex questions—many with electron microcopy images, plaque pictures and real-life troubleshooting puzzles. Then they pitted today’s best AI models against human experts.

Humans experts with access to the internet averaged 22 % in their own specialty areas.
OpenAI’s latest “o3” model scored 44 %—better than 94% of the virologists.
Several other frontier models also cleared the human bar.

And remember: this paper is still a pre-print. It hasn’t gone through peer review yet.

Why Should Clinicians and Public-Health Folks Care?

LLMs are crossing from paperwork into wet-lab know-how.
If a chatbot can already out-advise seasoned virologists on troubleshooting plaque assays, imagine what next year will bring for molecular diagnostics—or for people with bad intentions.
We finally have a benchmark that feels like the real lab.
These questions test tacit “lab lore” you learn only after you’ve ruined a few experiments—not the stuff that sits in a textbook. The mix of images and open-ended answers makes it much harder for a model to guess.
Governance is playing catch-up.
The authors argue that “expert-level virology coaching” should itself be treated as a dual-use technology—open access for everyday PCR tips, but gated for anything that could juice up a dangerous pathogen.

Limitations to Keep in Mind

Benchmark ≠ bench-top. We still need real “uplift” studies where human technicians run experiments with and without AI help to see if the model advice actually works.
Images weren’t always essential. In some questions, models guessed correctly even after the picture was removed—future versions will need even tighter visual tests.
The AI race moves fast. The scores you’re reading today could look quaint after the next model release cycle.

Take-Home for Your Next Team Meeting

Start talking risk management: Could your lab’s LLM-powered assistant accidentally reveal too much?
Don’t fear the tools—pilot them safely. Properly fenced, the same models could shorten the time from “failed assay” to “clear result,” especially in resource-limited settings.
Stay tuned: The VCT group plans to share a refusal policy playbook that platform providers can bake into future models.

Want To Read the Source?

Götting J, Medeiros P, Sanders JG, et al. Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark. SecureBio / Center for AI Safety, April 2025. Pre-print available at virologytest.ai.

AI-generated content disclaimer

This blog post was drafted with the assistance of a large-language model, then reviewed and edited by a human curator for accuracy and tone. It is provided for educational purposes only and does not constitute medical or biosafety advice.