CHI-Bench: Frontier Agents Struggle with Complex Healthcare Workflows

Summary

A new healthcare-specific benchmark, CHI-Bench, highlights the limitations of current frontier agents. While AI models often excel in isolated tasks, the study finds they fail 72% of real-world, multi-step clinical workflows. The focus is shifting from simple scores to evaluating reliability across many steps, tools, and policy gates.

What happened?

Researchers have released CHI-Bench, a benchmark that tests AI agents in realistic, long-horizon medical scenarios. Instead of just answering medical questions, agents must coordinate tasks over many steps, utilize various tools, and adhere to regulatory guidelines (policy gates). The result: leading models such as Claude, GPT, and Gemini successfully handle only a fraction of these complex end-to-end processes.

Why it matters

This trend marks a turning point in AI evaluation. Simple benchmarks are increasingly saturated. CHI-Bench addresses the “long-horizon” problem: an agent’s ability to act consistently and safely over hours or days without losing context or violating safety guidelines. Particularly in regulated industries like healthcare, this form of reliability is crucial for real-world deployment.

Evidence

72% Failure Rate: Frontier models fail in the majority of tested U.S. healthcare workflows.
Complexity Focus: The benchmark includes 163 clinical workflows with an average of 12 steps per task.
Tool Usage: Agents must integrate databases, appointment calendars, and medical records.
Regulatory Compliance: Adherence to HIPAA and other guidelines is part of the evaluation.

Analysis

The failure of agents in CHI-Bench suggests that “agentic reasoning” capabilities are not yet mature enough for high-stakes, multi-step processes. The issue is often not a lack of knowledge, but the loss of context over a long chain of actions. The rise of such benchmarks shows that the industry is moving from excitement over “what is possible” to the rigorous testing of “what works reliably.”

Practical Takeaways

Companies should look for long-horizon benchmarks rather than simple chat performance when selecting AI solutions.
Human-in-the-loop oversight remains essential for complex workflows at this stage.
Agent development must focus more heavily on handling policy gates and long-term context management.

Open Questions

Will specialized “medical-only” models perform better on CHI-Bench than generic frontier models?
How quickly can reliability in multi-step tasks be increased through improved reasoning techniques?
Can multi-agent systems reduce failure rates through mutual oversight?