OTFotf
All posts

MiniMax M3: million-token context just got cheap — here's how to put it in your agent

D
DaveAuthor
6 min read
MiniMax M3: million-token context just got cheap — here's how to put it in your agent

MiniMax just dropped M3, and the headline is one most of us have been waiting for: million-token context that's actually cheap to run. The architecture behind it — sparse attention done right — is a genuinely hard problem, and MiniMax shipped it. If you write code with an agent, this is good news. Full stop.

Agentic coding is expensive for one specific reason: the agent re-reads a huge context every single turn. Prefill and decode over a long window is where the bill comes from. M3 attacks exactly that — and per early reports it's priced to match: around $0.60 per million input tokens, $2.40 output, $0.12 cache read, $0.75 cache write (listed, not yet final). Early testing opens tomorrow.

Let's give credit, show you how to wire it into your agent, and then talk about the part that doesn't change when the model does.

a clay developer happily plugging a glowing model chip into a laptop, warm celebratory moo

MiniMax earned this one

This isn't a lab dropping a benchmark and vanishing. The M-series has a track record. M1 shipped real open-weight lightning-attention. M2 and M2.7 are strong Mixture-of-Experts models — 230B total parameters, ~10B active, 256 experts with 8 active per token — MIT-licensed, open weights on day one, at roughly 8% of Claude Sonnet's price (MiniMax-M2 on GitHub). That's not a toy.

The detail that should earn your trust: MiniMax killed sparse attention in M2. They went back to full attention because their efficient-attention work "wasn't production-ready" yet. M3 reintroduces it only now that they believe it's ready (VentureBeat). Shipping a hard optimization, pulling it when it's not good enough, then shipping it again when it is — that's engineering discipline, not a hype cycle. Respect it.

Takeaway: the team has shipped cheap, open, capable models before. M3 is the next step on a real line, not a cold start.

What sparse attention actually buys you

The new piece is MiniMax Sparse Attention (MSA). Instead of attending densely across the entire context every turn, a lightweight index branch first selects which past key-value blocks matter, and sparse attention runs only on those (VentureBeat). It's built on grouped-query attention, and — importantly — it operates on real, uncompressed KV, unlike DeepSeek's MLA, which compresses into a latent space. Less information thrown away, in principle.

Why you care: that's the difference between "1M context exists on the spec sheet" and "1M context is affordable to actually use on every turn of an agent loop."

MiniMax's own figures put it at 9.7× faster prefill and 15.6× faster decoding at a 1,000,000-token context versus the predecessor. Those are vendor speed numbers, not independent benchmarks — and there's no published accuracy curve yet, so quality retention at long context is the thing to watch when the testing window opens tomorrow. But the direction is unambiguous: long context is getting cheap.

Takeaway: MSA is aimed squarely at the agentic / whole-codebase regime — the exact place token cost dominates your bill.

How to put it in your agent

Here's the part most write-ups skip. MiniMax keeps one access surface across model generations, so you can wire up the exact workflow now — rehearse on the shipping M2.7 today, and swap a single string when M3 testing opens tomorrow.

Claude Code (Anthropic-compatible endpoint):

export ANTHROPIC_BASE_URL="https://api.minimax.io/anthropic"
export ANTHROPIC_AUTH_TOKEN="<your MiniMax key>"
export ANTHROPIC_MODEL="MiniMax-M2"          # → "MiniMax-M3" when testing opens
export ANTHROPIC_SMALL_FAST_MODEL="MiniMax-M2"
claude

Cursor / Codex / Cline (OpenAI-compatible): point the tool's custom base URL at MiniMax's OpenAI endpoint with your key and the model id (MiniMax-M2.7 today).

OpenRouter — one key, every tool:

# minimax/minimax-m2.7 — ~$0.26/M in, ~$1.20/M out, 205K context
# base URL: https://openrouter.ai/api/v1

Self-host: the M2 weights are MIT-licensed on Hugging Face (MiniMaxAI/MiniMax-M2), deployable via vLLM or SGLang.

The whole point of that stable surface: when M3 lands on the API, switching is ANTHROPIC_MODEL=MiniMax-M3 — no new wiring, no rewrite. Set the workflow up today, flip the string tomorrow.

Takeaway: you don't have to wait to prepare — the integration is a one-line swap away.

The part that doesn't change

Here's where it gets honest. The cost of tokens keeps falling — M3 is the latest proof, and it won't be the last. But the cost of an agent that can't read your codebase doesn't move at all. A faster, cheaper model still ships confident garbage into a repo it doesn't understand: it invents folder structures, picks a different state library than the three you already use, re-derives your auth flow from scratch because nobody told it how this project works.

a clay developer standing confidently next to a tall stack of neatly organized glowing cod

That's the gap a cheaper model doesn't close. It's the gap context closes — a readable codebase, a CLAUDE.md and .cursorrules that encode your conventions, tested prompts that say "here's how we add a feature." That layer is model-agnostic by design. It works the same whether the engine underneath is M3, Sonnet, or GPT.

OTF kits are exactly that layer: production-grade full-stack code plus the AI config and deploy scripts that turn any capable model into a teammate that ships to production — not a slot machine you keep re-pulling. M3 makes the tokens cheap. The kit is the part that makes them land.

Takeaway: bring whatever model is cheapest this month. The context is the durable asset.

Try it

Cheaper million-token context is a real tailwind for every builder — go use it. Wire M2.7 into your agent today, watch the quality benchmarks when the M3 testing window opens tomorrow, and swap the model string when you're happy. The model layer is becoming a commodity, and that's a gift: it means the thing that actually differentiates your output isn't which engine you rented this week. It's whether the codebase you point it at is one an agent can read, extend, and ship. Get that part right and every model release — M3 included — just makes you faster.

ai-toolsagentsarchitecture

On this page