
Ever wondered how far large-language models have crept into hands-on lab science? A new test called the Virology Capabilities Test (VCT) just set the bar—and the results are eye-opening.
The Quick Story
Researchers at SecureBio and the Center for AI Safety rounded up dozens of PhD-level virologists to build 322 complex questions—many with electron microcopy images, plaque pictures and real-life troubleshooting puzzles. Then they pitted today’s best AI models against human experts.
- Humans experts with access to the internet averaged 22 % in their own specialty areas.
- OpenAI’s latest “o3” model scored 44 %—better than 94% of the virologists.
- Several other frontier models also cleared the human bar.
And remember: this paper is still a pre-print. It hasn’t gone through peer review yet.
Why Should Clinicians and Public-Health Folks Care?
- LLMs are crossing from paperwork into wet-lab know-how.
If a chatbot can already out-advise seasoned virologists on troubleshooting plaque assays, imagine what next year will bring for molecular diagnostics—or for people with bad intentions. - We finally have a benchmark that feels like the real lab.
These questions test tacit “lab lore” you learn only after you’ve ruined a few experiments—not the stuff that sits in a textbook. The mix of images and open-ended answers makes it much harder for a model to guess. - Governance is playing catch-up.
The authors argue that “expert-level virology coaching” should itself be treated as a dual-use technology—open access for everyday PCR tips, but gated for anything that could juice up a dangerous pathogen.
Limitations to Keep in Mind
- Benchmark ≠ bench-top. We still need real “uplift” studies where human technicians run experiments with and without AI help to see if the model advice actually works.
- Images weren’t always essential. In some questions, models guessed correctly even after the picture was removed—future versions will need even tighter visual tests.
- The AI race moves fast. The scores you’re reading today could look quaint after the next model release cycle.
Take-Home for Your Next Team Meeting
- Start talking risk management: Could your lab’s LLM-powered assistant accidentally reveal too much?
- Don’t fear the tools—pilot them safely. Properly fenced, the same models could shorten the time from “failed assay” to “clear result,” especially in resource-limited settings.
- Stay tuned: The VCT group plans to share a refusal policy playbook that platform providers can bake into future models.
Want To Read the Source?
Götting J, Medeiros P, Sanders JG, et al. Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark. SecureBio / Center for AI Safety, April 2025. Pre-print available at virologytest.ai.
AI-generated content disclaimer
This blog post was drafted with the assistance of a large-language model, then reviewed and edited by a human curator for accuracy and tone. It is provided for educational purposes only and does not constitute medical or biosafety advice.