Ever wondered how far large-language models have crept into hands-on lab science? A new test called the Virology Capabilities Test (VCT) just set the bar—and the results are eye-opening.

The Quick Story

Researchers at SecureBio and the Center for AI Safety rounded up dozens of PhD-level virologists to build 322 complex questions—many with electron microcopy images, plaque pictures and real-life troubleshooting puzzles. Then they pitted today’s best AI models against human experts.

  • Humans experts with access to the internet averaged 22 % in their own specialty areas.
  • OpenAI’s latest “o3” model scored 44 %—better than 94% of the virologists.
  • Several other frontier models also cleared the human bar. ​

And remember: this paper is still a pre-print. It hasn’t gone through peer review yet.

Why Should Clinicians and Public-Health Folks Care?

  1. LLMs are crossing from paperwork into wet-lab know-how.
    If a chatbot can already out-advise seasoned virologists on troubleshooting plaque assays, imagine what next year will bring for molecular diagnostics—or for people with bad intentions.
  2. We finally have a benchmark that feels like the real lab.
    These questions test tacit “lab lore” you learn only after you’ve ruined a few experiments—not the stuff that sits in a textbook. The mix of images and open-ended answers makes it much harder for a model to guess.
  3. Governance is playing catch-up.
    The authors argue that “expert-level virology coaching” should itself be treated as a dual-use technology—open access for everyday PCR tips, but gated for anything that could juice up a dangerous pathogen. ​

Limitations to Keep in Mind

  • Benchmark ≠ bench-top. We still need real “uplift” studies where human technicians run experiments with and without AI help to see if the model advice actually works.
  • Images weren’t always essential. In some questions, models guessed correctly even after the picture was removed—future versions will need even tighter visual tests.
  • The AI race moves fast. The scores you’re reading today could look quaint after the next model release cycle.

Take-Home for Your Next Team Meeting

  • Start talking risk management: Could your lab’s LLM-powered assistant accidentally reveal too much?
  • Don’t fear the tools—pilot them safely. Properly fenced, the same models could shorten the time from “failed assay” to “clear result,” especially in resource-limited settings.
  • Stay tuned: The VCT group plans to share a refusal policy playbook that platform providers can bake into future models.

Want To Read the Source?

Götting J, Medeiros P, Sanders JG, et al. Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark. SecureBio / Center for AI Safety, April 2025. Pre-print available at virologytest.ai. ​

AI-generated content disclaimer

This blog post was drafted with the assistance of a large-language model, then reviewed and edited by a human curator for accuracy and tone. It is provided for educational purposes only and does not constitute medical or biosafety advice.