
Mercury Two Rewrites the Rules on Inference Speed_
There's a question that comes up a lot in production AI discussions that almost never gets asked in benchmark threads: what does it actually feel…

Mercury Two Rewrites the Rules on Inference Speed_
There's a question that comes up a lot in production AI discussions that almost never gets asked in benchmark threads: what does it actually feel like to wait for a model? Not the MMLU score. Not the GPQA Diamond number. The raw, user-visible latency of sitting there while tokens trickle out one at a time.
For most of AI's recent history, that question had one answer: you wait, because that's just how language models work. They generate text the way a typewriter types — left to right, one character at a time, no skipping ahead. The industry's response has been to make that process faster through hardware, quantization, speculative decoding, and clever batching. Faster typewriters, essentially.
Inception Labs is asking a different question: what if we threw out the typewriter entirely?
What Mercury Two Actually Is
Mercury Two, released February 20, 2026, is a reasoning language model built on a diffusion architecture rather than the autoregressive transformer design that underlies virtually every other major LLM on the market. It generates over 1,000 tokens per second — Artificial Analysis independently measured 1,196 tokens per second via the Inception API — compared to roughly 89 tokens per second for Claude 4.5 Haiku and 71 for GPT-5 Mini. That is not a marginal improvement. That is a different category of speed.
Inception Labs isn't a newcomer doing something gimmicky. The company was founded by researchers from Stanford, UCLA, and Cornell, and CEO Stefano Ermon is a co-inventor of some of the core diffusion methods that power modern image and video generators. The founding team also contributed to flash attention, direct preference optimization, and decision transformers — techniques that are now considered foundational infrastructure in modern AI. These are people who know how the standard architecture works and are consciously choosing to build something different.
How Diffusion Language Models Work
If you've used Stable Diffusion or watched Sora generate a video, you already have an intuition for the core idea. Diffusion models don't build outputs piece by piece from left to right. Instead, they start with noise and iteratively refine it toward coherence.
Applied to text, this plays out roughly like this: rather than generating token 1, then token 2, then token 3 in strict sequence, a diffusion language model starts with a noisy, masked representation of the entire output sequence. It then runs multiple forward passes, each time refining the whole sequence simultaneously, gradually converging toward a coherent response. Inception Labs describes it as "less typewriter, more editor revising a full draft at once."
The critical implication: the number of forward passes the model needs does not scale linearly with output length the way autoregressive decoding does. An autoregressive model generating 1,000 tokens requires 1,000 sequential forward passes. A diffusion model runs a small number of refinement steps over the entire sequence, regardless of how long it is. That structural difference is where the throughput advantage comes from — and it's architectural, not a product of clever inference tricks that could be matched by better hardware alone.
This iterative refinement also does something autoregressive models fundamentally cannot: it allows the model to revise earlier tokens during later refinement passes. In standard left-to-right generation, once a token is produced, it's committed. Diffusion generation has built-in error correction baked into the process, which has interesting implications for both reasoning quality and structured output generation.
The Benchmark Picture
Speed without quality is a toy. So how does Mercury Two actually perform?
| Benchmark | Mercury Two | Claude 4.5 Haiku | GPT-5 Mini |
|---|---|---|---|
| AIME 2025 | 91.1 | 84.0 | ~91 |
| GPQA Diamond | 73.6 | 67.0 | 80.0 |
| LiveCodeBench | 67.3 | 62.0 | 69.0 |
| IFBench | 71.3 | 54.0 | — |
| End-to-end latency | 1.7s | 23.4s | 14.4s* |
*GPT-5 Mini latency measured via Gemini 3 Flash with reasoning enabled at 14.4s.
Mercury Two is not uniformly best-in-class across every benchmark, and Inception Labs isn't claiming it is. What they're arguing — and what the numbers support — is that this level of quality, competitive with the best speed-optimized models in the world, is being delivered at end-to-end latency of 1.7 seconds.
One tunable feature worth highlighting: the reasoning_effort parameter lets you select from instant, low, medium, or high reasoning levels per request. More effort means more refinement passes, higher quality, slightly more latency — but still dramatically lower latency than autoregressive alternatives at equivalent reasoning depth.
Tip: Match
reasoning_effortto task complexity rather than defaulting tohigh. For high-volume pipelines,lowormediumcan cut costs further while staying well within acceptable quality thresholds.
The Pricing Makes the Speed Argument Even Stronger
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Mercury Two | $0.25 | $0.75 |
| Mercury Two (cached) | $0.025 | $0.75 |
| Gemini 3 Flash | $0.50 | $3.00 |
| Claude 4.5 Haiku | $1.00 | $5.00 |
Mercury Two undercuts Haiku by 4x on input and more than 6x on output. For high-volume production workloads — customer support systems, agentic pipelines running hundreds of steps, coding assistants handling large contexts — the cost savings at scale are significant independent of the speed story.
Note: New API keys receive 10 million free tokens, which is enough to run serious evaluation workloads before committing to production spend.
Dropping It Into Your Stack
Mercury Two exposes an OpenAI-compatible API. Three changes from your existing code and you're running on a diffusion model:
import openai
client = openai.OpenAI( base_url="https://api.inceptionlabs.ai/v1", api_key="YOUR_INCEPTION_API_KEY")
response = client.chat.completions.create( model="mercury-2", messages=[{"role": "user", "content": "Explain binary search trees"}], reasoning_effort="medium")
print(response.choices[0].message.content)The reasoning_effort parameter is Mercury Two-specific, but everything else is standard OpenAI SDK syntax. The model supports:
- Tool use and structured JSON output
- RAG pipelines
- 128K context window
- LiteLLM, LangChain, and AISuite integrations
- Inception Platform directly, with AWS Bedrock and Azure Foundry expansion underway
The use case where the speed advantage matters most is agentic workflows with multiple tool calls. When a model is orchestrating a sequence of browser lookups, API calls, or code executions, latency compounds at every step. A model that returns results in 1.7 seconds instead of 14 seconds doesn't just feel faster — it changes the economics of what's feasible to build.
Where the Debate Sits
The ML community hasn't settled on whether diffusion is a genuine paradigm shift for language or a well-executed alternative for a specific niche. Autoregressive models have decades of optimization work behind them — speculative decoding, continuous batching, quantization — and that research momentum continues. Skeptics reasonably point out that diffusion text models are earlier in their optimization curve and may close the gap as autoregressive inference keeps improving.
The counterargument from Inception Labs' position is that their speed advantage is structural, not contingent on outrunning a competitor's optimization roadmap. You can't speculative-decode your way to a fundamentally different generation paradigm.
What's clear right now: Mercury Two is a production-ready reasoning model that is genuinely fast at a price point that's competitive with models that are a fraction of its speed. Whether diffusion becomes the dominant approach to language modeling or remains a powerful alternative for latency-sensitive workloads, this release forces a real conversation about the assumption that autoregressive is the only serious path forward.
If you build anything where generation speed matters — agentic systems, voice products, high-volume APIs, real-time coding tools — Mercury Two deserves a serious evaluation. The API is live, the pricing is transparent, and the benchmarks are public. The typewriter had a good run.
Related Posts_

AI Faces Now Fool Almost Everyone, Study Finds
The next time you're scrolling through LinkedIn and a connection request arrives from someone you don't recognize, take a moment to consider…

AI Personality Is a Feature, a Bug, and a Mirror
If you've ever felt like ChatGPT was being weirdly cheerful, or noticed that Claude has a different vibe than Gemini, you're not imagining things.

Nano Banana 2: What Developers Need to Know
Google just made its best image generation capabilities available to everyone, and the implications for developers are significant.