Shipping with Your Agent: The Stay-in-IDE Workflow for Production Apps (CLAUDE.md, .cursorrules, Design-System Context)
The workflow is the product — a CLAUDE.md, a .cursorrules file, a design-token manifest, and a prompt library are the difference between an agent that regenerates and one that ships.
The agents are good now. Claude Code, Cursor, Codex — point any of them at a small, well-organized repo and they write code that compiles, reads like a human wrote it, and lands close to what you asked for. That part is mostly solved.
What is not solved is the second hour. The agent that wrote a clean component in session one rewrites a competing version of it in session three, because it never saw the first. It picks a fresh hex value instead of your brand color. It invents a date helper you already have three of. It "fixes" a thing you deliberately locked. The model didn't get worse — it never had the context that would have stopped it.
That gap is a workflow problem, not a model problem. And the workflow has a shape you can build deliberately: a handful of context files that live in your repo, get read on every session, and tell the agent what already exists, what the conventions are, and what it is not allowed to touch. This guide is the actual system — what goes in each file, why, and how the pieces fit. The thesis underneath it: the workflow is the product. The diff between an agent that thrashes and one that ships is not a better prompt. It is a repo the agent can read once and act on correctly.
Why most AI coding sessions produce throwaway output
Open a fresh session, ask for a feature, and the agent starts from nothing. It doesn't know your folder structure, your naming, the component you built last week, or the three architectural calls you'd never reverse. So it does the only thing it can: it infers. It guesses a reasonable structure, writes reasonable code, and the result is plausible but disconnected from everything around it.
Agents without codebase context regenerate instead of reuse
The default failure mode is regeneration. The agent needs a button, doesn't find yours (or doesn't look), and writes a new one. Now you have two buttons with different prop shapes. Next session it needs a modal and does the same. Each artifact is locally fine and globally wrong, because nothing told the agent reuse this, don't rebuild it. The cost compounds — every regenerated piece is one more thing to reconcile, and the code your agent regenerates every session instead of reusing is the single largest source of waste in an agentic workflow. It also feeds straight into the broader vibe coding cost trap: longer sessions, bigger context, more tokens, less to show.
The session-by-session memory loss problem
Agents don't remember. Every session is a cold start — the model has no durable record of what it decided yesterday or why. The conventions you established by hand in session one are gone by session four unless they are written down somewhere the agent reads. This is exactly why so much AI output ends up feeling like scaffolding you'll throw away: there's no continuity, so nothing accretes into a codebase you trust. We've written about why AI-coded projects start to feel disposable — the root cause is almost always missing, persistent context, not the model.
How design-system context changes what the agent generates
The fix is to stop relying on the model's memory and start relying on the repo's. When the agent can read your tokens, your component list, and your locked decisions on every run, "infer a reasonable button" becomes "use Button from the kit with the primary variant." The output stops being plausible-and-disconnected and starts being correct-and-consistent. That shift — from the model guessing to the repo telling it — is what treating your design system as agent context actually buys you, and it's the foundation everything else in this guide sits on.
The three files that define what your agent can do
Most of the payoff comes from three files. None of them are exotic. All of them are plain text that lives in your repo and gets read on every session. Set them up once and the agent inherits them forever.
CLAUDE.md — the project memory file (and why it must stay under 200 lines)
CLAUDE.md is the file Claude Code reads top to bottom at the start of every session. It is the closest thing the agent has to long-term memory of your project, which makes it the highest-leverage file in the repo — and the easiest to ruin.
The failure mode is bloat. People treat CLAUDE.md as a wiki and stuff it with every detail of the system, and past a certain length the model stops meaningfully processing the back half. The discipline that works is keeping CLAUDE.md under 200 lines and treating it as a router, not an encyclopedia: it says here is what exists and where to find it, not here is everything you could ever need to know. Long sessions get compacted and later sections get summarized, but the top of the file stays verbatim — so the load-bearing rules go at the top.
What earns a place in CLAUDE.md:
# MyApp — agent context
## Commands
- Dev: `pnpm dev`
- Test: `pnpm test`
- Typecheck: `pnpm typecheck` (must pass before every commit)
## Locked decisions (do not change without asking)
- Auth lives in `lib/auth.ts` — one provider, do not add a second
- All money is integer cents, never floats
- DB access goes through `db/` repositories, never raw SQL in routes
## Conventions
- UI primitives import from the kit only — never raw stack imports in a screen
- Design tokens only, no hex literals in feature code
- New routes follow `app/<feature>/page.tsx`
## Deep context (read when relevant)
- Data model → docs/data-model.md
- Payment flow → docs/payments.mdEverything else — the long explanations, the edge cases, the history — goes in docs/ and gets referenced by one line. The agent reads the router, then pulls the deep file only when the task needs it. That keeps the always-on context small and the verbatim top of the file dense with rules that actually constrain behavior.
.cursorrules — the Cursor-specific conventions layer
If your team (or you) work in Cursor, .cursorrules plays the same role for that agent. It carries the conventions, the forbidden patterns, and the "always do it this way" rules into Cursor's context on every edit. The content overlaps heavily with CLAUDE.md — same locked decisions, same naming, same banned patterns — because the rules are about your codebase, not about which agent reads them.
Keep the two in sync. The simplest way is to write the conventions once and mirror them: the architectural decisions, the import rules, the "tokens not hex" constraint, the folder layout. A divergence between CLAUDE.md and .cursorrules is a slow-acting bug — one agent follows a rule the other doesn't, and the codebase drifts depending on who touched it last. Treat them as two readers of one source of truth.
The design token manifest — what the agent uses instead of guessing hex values
The third file is the one most people skip, and it's the one that most visibly separates output that looks like a product from output that looks like a template. When an agent has no design system, it picks colors, spacing, and type sizes by feel — and "by feel" across thirty sessions is thirty slightly different blues.
A design token manifest fixes that by giving the agent named values to reach for instead of raw ones: bg-card instead of #fafafa, a spacing scale instead of padding: 14px, a type scale instead of font-size: 15px. Once those tokens exist and the conventions say use tokens, never hex, the agent stops inventing values and starts composing from your system. This is the practical core of using your design system as the agent's context: the tokens are not just for humans reading the styleguide, they're the vocabulary the agent generates against. Ban hex literals in feature code, name the tokens, and consistency stops being something you police in review and starts being the default.
Extending your codebase for agent reads
Context files tell the agent the rules. The next layer is making the code itself legible — structured so the agent can find what already exists and extend it instead of starting over.
The component registry economy — registering components so agents find them
A component the agent can't discover is a component the agent will rebuild. The fix is a registry: a machine-readable index of what exists, where it lives, and how to install or import it. This is the pattern behind the component registry economy — a structured manifest that an agent (or a CLI) can read to answer "do we already have a date picker?" without grepping the whole tree and guessing.
The registry does two jobs. It makes existing components findable, which kills regeneration. And it makes new components installable in a known shape, so when the agent does need to add one, it lands in the right place with the right metadata instead of somewhere arbitrary. A registry plus the convention check the registry before building is a large fraction of the reuse problem solved.
How to extend a codebase so the agent never reinvents it
Beyond components, the broader skill is structuring the repo so extension is the path of least resistance. That means clear seams: a place where new routes go, a place where new data models go, a repository layer the agent fills in rather than bypasses. When the structure is obvious, the agent follows it; when it's ambiguous, the agent invents. We go deeper on extending a codebase so the agent reads it once and builds on top — the short version is that legible structure is itself context. The folder layout is a message to the agent about where things belong, and a consistent layout is a message it can actually act on.
Dynamic workflows and context walls
Even with perfect context files, agents hit a hard limit: the context window. On a long or branching task, the relevant information stops fitting, and the agent starts losing the thread mid-job.
When context windows run out mid-task
The symptom is recognizable — the agent was tracking the task fine, then halfway through it forgets a constraint it followed ten minutes ago, or re-asks something already answered, or quietly drops a requirement. That's a dynamic workflow hitting the context wall: the window filled with intermediate state and pushed the original instructions out of effective range. The defense is to keep the durable rules small and at the top (the CLAUDE.md-as-router discipline again) so they survive compaction, and to break large tasks into chunks that each fit comfortably.
Claude Code auto mode and the context trap
Autonomous, multi-step modes make this sharper. When the agent runs a long chain on its own, context fills faster and the failure is less visible — you're not watching every step, so by the time you notice, it's already drifted. The Claude Code auto mode context trap is exactly this: more autonomy means more accumulated state, which means the original constraints have to be more durable to survive the run. Auto mode is powerful, but it raises the bar on how well-anchored your context has to be — which is an argument for tighter context files, not looser ones.
Parallel agents and what Cursor 3 changed
Running several agents at once changes the geometry again. With Cursor 3's parallel agents, the constraint moves from "one agent's context window" to "do all of them share the same rules?" If every agent reads the same .cursorrules and design tokens, parallelism multiplies throughput. If they don't, parallelism multiplies divergence — three agents each making locally reasonable, mutually incompatible choices. The shared context files are what make parallel agents add up instead of cancel out.
Multi-tool agentic setups — when to use more than one agent
You don't have to standardize on one tool. Plenty of real workflows route different jobs to different agents — one for planning, one for implementation, one wired into a knowledge base. The trick is making the handoffs clean.
Notion + Claude Code + Codex in a single workflow
A working multi-tool setup might keep planning and specs in one surface, drive implementation with a filesystem agent, and use a second agent for parallel or specialized work — a pattern we've broken down in running Notion, Claude Code, and Codex together. It works when there's a shared source of truth (the same spec, the same conventions) and breaks when each tool carries its own private version of the rules. The same principle as parallel agents: the context has to be shared, or the tools pull in different directions.
The tool pipeline security wall and how to reason about it
Wiring agents into tools and data also wires in a risk surface. An agent that can read your repo, hit your APIs, and run commands is an agent whose inputs need to be trusted. The agent tool-pipeline security wall is worth understanding before you connect an agent to anything with side effects: know what each tool in the chain can touch, and don't grant the agent reach it doesn't need for the task. This is workflow design too — least privilege is a context decision as much as a security one.
The production hardening layer — what the workflow needs beyond code generation
Generating code is the first half. Shipping it to production — where it survives real users, real payments, and real failure modes — is the second half, and it's a part of the workflow agents help with most when you give them a repeatable shape to follow.
Error remediation as a repeatable workflow step
Errors are not a detour from the workflow; they're a stage in it. The agents that ship treat a failing build or a runtime error as a structured loop: read the error, locate the cause, fix, verify the fix actually ran. That discipline is the difference between systematic error remediation and the thrash where the agent claims a fix that never landed. Pairing the deep loop with a fast agent-driven error-fix pattern for the small stuff keeps the workflow moving — the key in both is verify it ran, not assume it's fixed.
Idempotency, webhooks, and infrastructure that survives production
Some production requirements an agent will not infer unless you tell it. Idempotency is the canonical example — a webhook that fires twice should not charge a customer twice, and an agent writing a payment handler from a bare prompt usually won't build that in. The lesson from why webhook idempotency saved a production system is that these hardening requirements belong in your context files as locked rules, so the agent applies them by default instead of learning them the expensive way. Every external event handler must be idempotent is one line in CLAUDE.md that prevents a class of outages.
OTA updates as part of the mobile production workflow
For mobile, shipping doesn't end at the store submission — it includes how you push fixes after launch. Over-the-air updates let you ship a JS-layer fix without a full app-store round trip, and wiring that into the workflow from the start is what makes a mobile app maintainable in production. OTA updates as part of the production workflow is the kind of infrastructure decision that's cheap to set up early and painful to retrofit — so it goes in the workflow plan, not the someday list.
The workflow template — what a pre-wired codebase includes
Step back and the system is small: a handful of files, a legible structure, a few locked rules. But assembling it from scratch for every new project is real work — and most of it is the same work every time.
CLAUDE.md + .cursorrules + design tokens + 20+ prompts
A complete workflow scaffold is four things. A CLAUDE.md that routes — under 200 lines, locked decisions at the top. A .cursorrules that mirrors it for Cursor users. A design-token manifest the agent generates against instead of guessing hex values. And a prompt library — the dozen-plus tested prompts for the operations you'll actually run (add an entity, add a screen, wire a payment flow, add an auth step). The prompt library matters more than it looks: it encodes how to ask this codebase for a change so you're not re-deriving the right prompt every time. Together these four are the workflow, made portable.
Why this is the diff between an agent that thrashes and one that ships
None of these files make the agent smarter. They make the agent informed — and an informed agent on an average model beats a blind agent on a great one, because most of what goes wrong in an agentic session is missing context, not missing capability. This is also the deeper reason a filesystem agent with a wired-up repo outruns a sandbox tool that can't see your conventions at all — the difference is sandboxed versus filesystem agents, and the wired-up repo is the whole advantage. The context files are the moat. They're what the model can't bring and the sandbox can't reach.
Building vs buying the workflow scaffolding
You can build all of this yourself, and for a codebase you'll live in for years, you should understand every piece of it — which is what this guide is for. The other path is to start from a codebase where the scaffold is already wired: the CLAUDE.md, the .cursorrules, the tokens, and the prompts shipped with the code, tuned to the stack. That's a build-versus-buy call that depends on your timeline and which tool you've committed to, and it's worth weighing against an honest comparison of what each AI coding tool can actually do before you decide. Either way, the lesson holds: the workflow is the product, and the context files are the workflow.
Where OTF fits
OTF kits ship the workflow pre-wired. Every kit comes with a CLAUDE.md tuned to its architecture, a .cursorrules that mirrors it, a design-token system the agent generates against instead of guessing, and 20+ tested prompts in ai/prompts/ for the operations you'll actually run. You buy a codebase your agent can read in one pass — locked decisions at the top, conventions enforced, the reuse-not-regenerate path made the obvious one — so your sessions stay short and your output stays consistent.
The point isn't the kit. The point is the workflow underneath it: the same component on web and mobile, the context files that keep your agent on rails, and a structure it extends instead of reinvents. Build that yourself with this guide, or start from a kit that has it wired — but ship with the agent you already use, on a codebase it can actually read.