OTFotf
All posts

Vercel's Agent Browser is 15× cheaper than Playwright MCP — and that's not the most interesting part

D
DaveAuthor
8 min read
Vercel's Agent Browser is 15× cheaper than Playwright MCP — and that's not the most interesting part

Agent Browser by Vercel just changed how I think about AI browser automation.

For weeks I was burning through Claude credits trying to get Playwright MCP to behave. Every action shipped a full screenshot back to the model, and every request loaded ~13,000 tokens of tool definitions before Claude even started thinking. A 10-step task could chew through 100,000+ tokens. That's a real bill — not "experimentation," not "hobbyist tinkering."

Then Vercel Labs quietly dropped agent-browser.

The same task that cost me ~100k tokens in Playwright MCP now costs around 7k. That's 15× cheaper. Same browser, same target, same outcome.

But the savings aren't even the interesting part. The interesting part is why it's cheaper — and what that shape unlocks once you have it.

Pixels are a terrible interface for a language model

Playwright MCP's default action loop looks like this:

  1. Take a screenshot of the page (~50–200 KB encoded as base64)
  2. Send the screenshot to the model so it can "see" the UI
  3. Ask the model to pick coordinates or write a CSS selector
  4. Execute the click, navigate, repeat

Step 1 alone is ~10–30k tokens per turn. The model doesn't actually want pixels — it wants to know what's on the page and what it can do with it. Pixels are an extremely lossy, expensive container for that information.

Agent Browser flips this. Instead of a screenshot, every page becomes an accessibility tree:

button "Submit" @e1
link "Home" @e2 -> /
textbox "Email" @e3
heading 2 "Sign in" @e4

A snapshot uses 200–400 tokens, not 30,000. And every interactive element gets a stable ref like @e1. The agent never has to invent a CSS selector or guess at coordinates — it just calls:

agent-browser fill @e3 "test@example.com"
agent-browser click @e1

Same primitive, two orders of magnitude smaller payload.

Tool definitions are also tokens

The other half of the Playwright MCP bill is the tool schema. MCP tool definitions get loaded into the system prompt every turn — Playwright MCP ships ~13k tokens of them whether you use one tool or twenty.

Agent Browser sidesteps this entirely by not being an MCP server. It's a native Rust CLI. The agent just runs shell commands:

agent-browser open https://example.com
agent-browser snapshot -i

No tool definitions to load. No JSON-RPC handshake. No stateful MCP server to keep alive. The model already knows how to invoke shell commands — that's a capability it has whether you give it browser tools or not.

Cost: zero overhead tokens. You pay only for the snapshot you actually request.

Stable refs change what an agent can verify

Coordinates are ephemeral — a re-render shifts them. CSS selectors are brittle — a className changes and your script dies. The @e1 refs Agent Browser hands back are derived from the accessibility tree, which means they're tied to semantic identity, not layout.

This is what makes self-verification cheap. After a click, you can:

agent-browser snapshot -i               # what's interactive now?
agent-browser diff snapshot --baseline before.txt

The diff command shows what changed in the tree — added elements, removed elements, mutated attributes — in ~100 tokens. Compare that to the Playwright MCP loop where you'd take two screenshots and ask the model to spot the difference visually. That's a 30k-token verification step replaced by 100 tokens of structured diff.

Self-verifying agents stop being theoretical when verification fits in pocket change.

Snapshot filters that match the question

The snapshot command takes filters that turn a 400-token tree into a 50-token answer for narrow questions:

agent-browser snapshot -i               # interactive only (buttons, inputs, links)
agent-browser snapshot -i --urls        # interactive + href URLs
agent-browser snapshot -c               # compact (drop empty structural nodes)
agent-browser snapshot -d 3             # depth-limit to 3 levels
agent-browser snapshot -s "#main"       # scope to a CSS selector

When the agent's task is "click the Sign In button," it doesn't need the full tree. -i is enough. When the task is "list every external link in the article," it's -i --urls -s "article". The right snapshot for the right question, every time.

When pixels actually help, you keep them

Some things the accessibility tree genuinely cannot capture: an unlabeled icon button, a canvas-rendered chart, the visual state of a CSS animation. For those cases Agent Browser still gives you screenshots — but with a twist:

agent-browser screenshot --annotate

This overlays numbered labels [1], [2], [3] on every interactive element, and the labels map directly to the same @e1, @e2, @e3 refs. The agent can reason visually for one turn (read the screenshot, decide the unlabeled icon at [3] is the share button), then execute textually:

agent-browser click @e3

Pixels for perception, refs for action. That's the right division of labor between a multimodal model and a browser.

Tabs you can name, sessions you can isolate

Long-running agents need stateful workspaces. Agent Browser ships two primitives for that:

Labeled tabs — instead of positional indices, you name tabs like variables:

agent-browser tab new --label docs https://docs.example.com
agent-browser tab new --label app  https://app.example.com
agent-browser tab docs            # switch to docs
agent-browser snapshot -i         # snapshot is scoped to the active tab
agent-browser tab app
agent-browser fill @e2 "value"    # different refs, same agent loop

The label is yours forever — never auto-generated, never rewritten on navigation.

Isolated sessions — for parallel agents that shouldn't share auth or cookies:

agent-browser --session researcher open arxiv.org
agent-browser --session writer    open notion.so

Each session has its own cookies, history, and authentication state. You can fan out work across multiple agents without one accidentally posting in the other's logged-in account.

Auth the LLM never sees

The most quietly important feature: the auth vault.

echo "secret-pw" | agent-browser auth save github \
  --url https://github.com/login \
  --username mani \
  --password-stdin

Now the agent can log in by name:

agent-browser auth login github

The password is encrypted on disk (AES-256-GCM with AGENT_BROWSER_ENCRYPTION_KEY). The model never sees the credential string in any tool call, any snapshot, any output. That removes one of the largest classes of "the LLM accidentally exfiltrated a secret" failures, full stop.

If you've ever wired up agent-driven login and felt nervous, this is the answer.

Guardrails for agent deployments

Agent Browser ships an opt-in security layer that's clearly designed for production agent workloads, not lab demos:

  • Domain allowlist--allowed-domains "example.com,*.example.com" blocks navigation, sub-resource fetches, and WebSockets to anything else. CDN allowlisted explicitly.
  • Action policy — a JSON file declaring which actions need approval (eval, download, navigate).
  • Action confirmation--confirm-actions eval,download prompts before destructive verbs.
  • Output limits--max-output 50000 caps the page output that flows back to the model, preventing context flooding from a malicious page.
  • Content boundaries--content-boundaries wraps page output in delimiters so the model can tell tool output from untrusted page content (defense against prompt injection).

These are the controls you want before pointing an autonomous agent at the open web. They aren't theoretical; they're a checklist.

React introspection and Web Vitals come free

Two surprises in the manifest:

agent-browser open --enable react-devtools <url>
agent-browser react tree                # full component tree
agent-browser react inspect <fiberId>   # props, hooks, state, source
agent-browser react renders start       # fiber render profiler
agent-browser react suspense            # suspense boundary classifier

The React DevTools hook is embedded in the binary. No extension to install, no runtime dependency. You point an agent at a Next.js / Remix / Vite app and it can introspect the actual React tree, profile renders, and classify Suspense boundaries. That turns "debug this hydration error" from a multi-hour ticket into a one-shot agent run.

Plus agent-browser vitals — LCP, CLS, TTFB, FCP, INP, plus React hydration phases — works on any framework. If you've ever wanted to fold "did this PR regress LCP?" into the same loop where the agent makes the change, the primitive is there.

What this actually unlocks

Once your verification step is 100 tokens instead of 30k, three things become tractable that weren't before:

  1. Self-verifying agents — the agent can check its own work after every action without going broke. The "rollback if the snapshot diff doesn't match expectations" pattern stops being aspirational.
  2. Long-horizon tasks in a single context window — a 50-step browser task that used to overflow context now fits with room to spare. No mid-task summarisation, no context loss.
  3. Multi-tab and multi-session workflows — labeled tabs and isolated sessions make it cheap to run an agent that, e.g., reads docs in one tab while editing config in another, both authenticated to different services.

None of these are new ideas. They were just priced out of reach with pixel-based tooling.

The lesson, again

The cheap tools usually win. The well-shaped ones always do.

When you give a model the right interface — a structured tree instead of raw pixels, a stable ref instead of a CSS selector, a Rust binary instead of an MCP server — the cost curve bends in your favor without sacrificing capability. Often the well-shaped tool is also more capable, because it gave the model less to hallucinate around.

Are you still on Playwright MCP or Claude in Chrome? If you're paying per token (and at this point, who isn't), it's worth a weekend of porting your scripts. The savings cover a lot of weekends.

Repo: github.com/vercel-labs/agent-browser

ai-toolsbrowser-automationvercelagents

On this page