JetBrains open-sources Mellum2 to challenge third-party API limitations

JetBrains’ Mellum2 open source coding model is a real step-change: a 12B-parameter, on-premises AI infrastructure model, shipped open from day one. This isn’t a warmed-over code completion tool — Mellum2’s speed, scope, and native private-deployment break new ground for agentic AI. On March 25, 2025, JetBrains announced Mellum2 with honest ambitions: go where API-locked models like Claude Code never will. For teams serious about AI infrastructure, not just code suggestions, this changes the map.

Mellum2 vs Mellum: how does the new model actually differ?

Mellum2 is not just “Mellum, scaled up.” The difference is structural. Mellum (late 2024) was a proprietary 4B-parameter model, delivered as a code completion engine for JetBrains IDEs — and eventually open-sourced in April 2025. Mellum2 multiplies the capacity (12B parameters), but the leap is in mandate: from single-task autocomplete to a practical infrastructure brain built for multi-agent AI systems.

Here’s what actually changes:

Parameter scale: Mellum2: 12B (with Mixture-of-Experts, see below). Original Mellum: 4B.
Capabilities: Mellum did code completion, period. Mellum2 was designed to coordinate sub-agent tasks, compress context for retrieval-heavy pipelines, and route queries across models.
Open from launch: Mellum2 shipped open-source “day one”, erasing the window between closed beta and broad adoption that limited “early Mellum.”
Variant support: Mellum2 ships base, “instruct,” and “thinking” models — the last capable of explicit reasoning steps, covering agentic requirements beyond simple codex tasks.

From JetBrains’ own official announcement: "Mellum2 ... runs inference on infrastructure teams control themselves."

Takeaway: Mellum2 is built for the agentic AI layer, not just for writing code faster — a mandate shift as much as a scale jump.

Why Mellum2 actually enables on-premises AI — where Claude Code cannot

Most codegen AIs, including Anthropic’s Claude Code, demand third-party APIs. You send your codebase or context “to the cloud," process happens off-premises, and the outputs trickle back. For some orgs, this is fine; for regulated or privacy-minded infrastructure teams, it’s dead on arrival.

Mellum2 is open source, designed to run entirely on your hardware — no API calls, no vendor lock, no forced network hops. This enables:

Private, air-gapped deployments: Ship Mellum2 onto an internal cluster, leave the firewall closed.
Guaranteed data residency: Source never leaves your racks. No shared cloud cache, no cross-border risk.
Latency and control: The Mixture-of-Experts setup means only 2.5B of the 12B parameters go active per-token, keeping inference snappy enough for chained orchestration in high-frequency agent workflows.

In infrastructure overlays — routing, retrieval pipelines, delegated (“sub-agent”) jobs — the round-trip time alone often kills any benefit with a cloud-only model. Mellum2 is tailored for these workloads, where “close to the metal” wins over generality.

Security and compliance pressure, especially in finance, bio, and critical infrastructure, make this a must-have: Mellum2 lets you build agentic AI with zero third-party exposure.

Takeaway: Mellum2 is not just faster on-prem — it’s actually viable there, which API-locked models are not.

11 production screens. Login, database, payments — all wired.

The SaaS Dashboard Kit ships everything already connected. Nothing to set up. Live demo at saas.otf-kit.dev.

See the live demo

How to deploy and use Mellum2 now in real engineering stacks

Ready to run it? Mellum2’s open-source stack and local mode mean you’re not waiting for a commercial host. Here’s your starter workflow:

Get the model files:
JetBrains releases Mellum2 (base, instruct, thinking) under an open license. You’ll need enough disk for a 12B model and RAM/GPU for your preferred throughput. (Real file names/links depend on the release, source link above.)
Run local inference:
Standard practice is using a serving engine like llama.cpp or vLLM, but adapt for Mellum2-specific quirks if needed. Example launch (assuming vLLM and CUDA):
```
vllm-server \
  --model /models/mellum2-instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.9
```
Adapt --model path, dtype, and memory flags to your hardware.
For an instruct-mode endpoint, ensure serving the right variant.
Integrate agentic workloads:
For developers orchestrating sub-agents (e.g., task routing, context caching), plug Mellum2 at the coordination layer. Use it for:
- Code completion and review within your IDE
- Retrieval compression for answering questions over large docs (indexing pipelines)
- Sub-agent management: Have Mellum2 broker, break down, and sequence jobs in your own job-queue system.
Here’s a simple Python client using OpenAI API compatibility (assume Mellum2 server running locally):
```
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "sk-local-"

resp = openai.ChatCompletion.create(
    model="mellum2-instruct",
    messages=[
        {"role": "system", "content": "You are an expert code routing assistant."},
        {"role": "user", "content": "Decompose the following task into sub-agent jobs: implement distributed cache with LRU and sharding."}
    ]
)
print(resp['choices'][0]['message']['content'])
```
For JetBrains IDE integration, watch for pluggable language server protocol (LSP) or Mellum2-native plugin updates.

Takeaway: Mellum2 can run today in your own racks, with the workflow as close to “point and serve” as a modern AI stack allows.

schematic of an on-prem server running Mellum2 side-by-side with code editors and an agent

What does JetBrains mean by a "focal model" — and how does Mellum2 fit?

JetBrains staff call Mellum2 a “focal model” — not a claim to out-benchmark frontier LLMs, but to nail one high-impact, high-frequency use case: software engineering orchestration. It’s not about broad generative smarts or world knowledge — it’s about:

Speed and specialization: Mixture-of-Experts means per-token inference uses only a working set of 2.5B active params. Real engineering stacks want wire speed and minimal context lag, not an extra few percent in synthetic “general coding” benchmarks.
Lean surface area: Focal models accept narrower mandates — but optimize hard within those bounds, so your orchestration (sub-agents, retrieval, task routing) becomes reliable infrastructure, not novelty.

As JetBrains’ engineers put it: "This specialization ensures the model excels in software engineering environments while remaining lean and fast." Frontier experiments continue — meanwhile, “focal” keeps infra predictable.

Takeaway: Mellum2 is an infrastructure-bias model, built for tasks where latency and determinism matter more than general IQ.

The future: open source AI models as infrastructure control points

Mellum2’s open-source release is part of a visible trend: heavyweight AI models open from first commit, not after-market. For emerging agentic AI, running open models on your own hardware starts becoming the norm, not just a fallback for the risk-averse.

Industry is moving away from the “API-as-datacenter” pattern. It’s not just security rhetoric — control of source, weights, and inference enables:

Auditable, modifiable logic: Tweak, debug, or extend model behavior as needed.
Community-driven variants: Expect “instruct” and “thinking” derivatives to be tuned by integrators for very specific verticals, not just handed down from one vendor.
Ecosystem independence: Avoid getting trapped by upstream API changes, quota limits, or “region not supported.”

Where does OTF fit here? The value is in treating models and routing chains as a durable substrate underneath the shifting landscape of models. Mellum2 may be central today, but with open models — and open orchestration frameworks — your workflow persists regardless of which model has the advantage next year.

Takeaway: Open-source models like Mellum2 put engineering teams back in control of their infrastructure layer, propelling a wider trend away from vendor lock.

a code pipeline diagram, showing context retrieval, Mellum2 focal worker, and alternative

Mellum2 open source coding model lands as something rare: a high-capacity, high-speed infrastructure model shipped open from the start. It’s not a mere code completion engine; it’s the first “focal model” to face agentic AI tasks head-on — routing, retrieval, orchestration — where API-bound tools like Claude Code cannot safely or practically go. Deployable on your hardware and modifiable by your own team, Mellum2 finally lets engineers prioritize privacy, speed, and specialization over lock-in and latency.

OTF SaaS Dashboard Kit

Ship the product, not the setup.

11 production screens — auth, billing, team, analytics, settings
Real database, payments, and login — all wired on day 1
AI configs pre-tuned so your agent extends instead of regenerates

See the live demo View pricing