OpenAI Launches LifeSciBench: 750-Task Expert-Written Benchmark for Life Science Research

Summary

OpenAI has released LifeSciBench, a new benchmark designed in collaboration with 173 Ph.D.-level scientists to evaluate AI performance on real-world life science and drug discovery workflows. Spanning seven biological domains and seven research workflows, the suite features 750 complex, free-response tasks. Rather than testing simple textbook biology, LifeSciBench challenges models with multi-step reasoning and multimodal data, including 1,062 scientific attachments such as chemical structures, tables, and PDFs. Initial evaluations reveal that even the top-performing model, GPT-Rosalind (a science-tuned variant of GPT-5.5), achieved a pass rate of just 36.1%, highlighting the significant gap between current AI capabilities and autonomous scientific research.

What happened?

Official Release: OpenAI officially announced LifeSciBench on June 18, 2026, alongside a comprehensive technical preprint detailing its methodology.
Expert Authorship: The benchmark was co-developed with 173 Ph.D. scientists possessing extensive experience in biotechnology and pharmaceutical industries.
Benchmark Composition: LifeSciBench contains 750 tasks accompanied by 1,062 external research artifacts. Approximately 79% of the tasks require multi-step reasoning.
Model Performance: GPT-Rosalind led the evaluations with a normalized score of 36.1% (171 tasks passed). Standard GPT-5.5 followed at 25.7%, showing that the benchmark is far from saturated.
Key Bottlenecks: Models performed poorest on exact numerical calculations (14.8% pass rate) and experimental design tasks (30.7% pass rate).

Why it matters

Previous AI benchmarks in science primarily relied on multiple-choice questions testing fact retrieval. LifeSciBench shifts the paradigm by simulating the daily operational tasks of active researchers in applied biology. The results expose a crucial limitation: while frontier models excel at synthesizing literature and writing scientific communications, they struggle with the mathematical rigor and logical design required for active laboratory operations. This demonstrates that current AI systems are suitable as research assistants rather than autonomous scientific agents.

Evidence

OpenAI Announcement: The official OpenAI blog post detailing the taxonomy of the benchmark and its design philosophy.
Technical Preprint: The paper “LifeSciBench: Evaluating Language Models on Realistic, Expert-Level Tasks in the Life Sciences” outlining detailed model evaluations.
Industry Coverage: Coverage from Marktechpost and AI Weekly confirming the collaborative authoring process and performance metrics.

Analysis

The design of LifeSciBench highlights why the life sciences represent an exceptionally high bar for artificial intelligence. Scientific workflows require the integration of diverse, highly specific data modalities—such as SMILES strings for chemical structures, genomic sequences, and raw experimental data. The fact that more than a third of the tasks yielded a pass rate below 20% across all tested models confirms that general pre-training is insufficient. Future breakthroughs in scientific AI will likely depend on tighter integrations with symbolic computation tools, chemical simulation engines, and bioinformatics databases.

Practical Takeaways

For research organizations and developers integrating AI into biopharma pipelines, the findings suggest:

Maintain Human-in-the-Loop: Do not automate critical tasks like experimental design, protocol drafting, or dosage calculations without expert verification.
Deploy on Communication and Evidence Review: Leverage AI where it currently performs best, such as literature synthesis (evidence handling) and scientific writing (communication).
Integrate External Tools: Equip LLM agents with specialized computational tools (e.g., RDKit, BLAST APIs) rather than relying on the models’ native mathematical or chemical reasoning.

Open Questions

How will OpenAI navigate the biosecurity and safety risks associated with models that eventually achieve high scores on these advanced biological benchmarks?
Will open-weight models (such as specialized Llama variants) catch up to proprietary models like GPT-Rosalind through targeted fine-tuning on academic literature?

OpenAI Launches LifeSciBench: 750-Task Expert-Written Benchmark for Life Science Research

OpenAI Launches LifeSciBench: 750-Task Expert-Written Benchmark for Life Science Research

Summary

What happened?

Why it matters

Evidence

Analysis

Practical Takeaways

Open Questions

Sources