Microsoft MDASH: Multi-Agent Ensemble Tops Security Benchmarks

Summary

Microsoft has unveiled MDASH (Multi-Model Agentic Scanning Harness), a revolutionary AI-powered security system that utilizes an ensemble of over 100 specialized agents to discover and prove vulnerabilities. Unlike single-model approaches, MDASH orchestrates multiple frontier and distilled models to “debate” and validate findings, behaving more like a team of researchers than a chatbot. The system has already demonstrated its efficacy by identifying 16 new vulnerabilities in the Windows kernel and networking stack, including four critical remote code execution (RCE) flaws. This marks the first major production validation of multi-agent orchestration in high-stakes enterprise security.

What happened

The Microsoft Autonomous Code Security (ACS) team, featuring veterans from DARPA’s AI Cyber Challenge, developed MDASH to move AI vulnerability research into production engineering. MDASH operates in stages: preparing target indices, scanning for candidates, validating findings through “debater” agents, deduplicating results, and finally proving the bugs by generating triggering inputs. During its internal rollout, it successfully discovered a cohort of 16 previously unknown vulnerabilities in Windows, four of which were critical RCEs in the TCP/IP stack and IKEv2 service.

Why it matters

This is a paradigm shift in cybersecurity. Traditional automated scanning often suffers from high false-positive rates and limited reasoning capabilities. By using a multi-agent “ensemble” approach, Microsoft has shown that the “agentic system” around the models is more important than the individual models themselves. It allows for higher precision (zero false positives in private tests) and the ability to reason about complex, proprietary codebases that weren’t part of any model’s training data.

Evidence

16 Zero-Days: Identified in Windows networking and authentication stacks.
4 Critical RCEs: Specifically found in components like the kernel TCP/IP stack.
100% Recall: Achieved in tcpip.sys against five years of historical MSRC cases.
Benchmark Leader: Scored 88.45% on the public CyberGym benchmark, taking the top spot on the leaderboard.
Zero False Positives: In a controlled test with 21 planted vulnerabilities, MDASH found all 21 with no noise.

Analysis

The success of MDASH highlights the maturity of “agentic workflows.” The system doesn’t just “ask an AI” to find bugs; it forces multiple AIs to compete and cooperate. “Auditor” agents flag suspects, while “Debater” agents try to refute them. If a debater fails to disprove a finding, its credibility increases. This mimics the adversarial nature of real-world security research. Furthermore, the model-agnostic architecture means Microsoft can swap in better models as they emerge without rebuilding the entire pipeline.

Practical takeaway

For enterprise security leaders, the message is that AI security is moving from “copilots” (assisting humans) to “autonomous systems” (performing end-to-end research). Organizations should look beyond single-model implementations and start investigating agentic orchestration for their own DevSecOps pipelines. While MDASH is currently in private preview, it sets the standard for how high-value codebases will be secured in the near future.

Open questions

Cost vs. Benefit: Running 100+ agents per scan is computationally expensive. How will Microsoft optimize this for broader commercial availability?
The AI Arms Race: How soon will offensive actors deploy similar multi-agent ensembles to find vulnerabilities before vendors can patch them?
Azure Availability: When will this harness be integrated directly into Azure DevOps or GitHub Advanced Security for external developers?