Skip to content
OTFotf
All posts

How AI coding tools boost writing but face hurdles in shipping software

D
DaveAuthor
7 min read
How AI coding tools boost writing but face hurdles in shipping software

How AI coding tools impact productivity: from writing code to shipping software

Generative AI is writing a staggering share of the world’s software — yet total software output, measured in code shipped and real products delivered, has barely budged. The paradox is not just academic: as AI coding tools race ahead, the aggregate productivity effects of AI coding tools have become a top concern for engineering leaders, policymakers, and hands-on developers. Data from over 100,000 GitHub users, tracked across three generations of AI tools, now lets us see exactly where the productivity gains land — and where they stall out. The takeaway is both promising and sobering: AI makes individual coding tasks much faster, but bottlenecks in review, integration, and release mean the leap from raw code written to working software remains stubbornly human.

What are the productivity effects of AI coding tools?

AI coding tools enable dramatic gains in code-writing activity, but the translation to end-to-end productivity is smaller than advertised in press releases. The headline numbers are real: experimental studies cited in the CEPR 2024 report find 15-50% speed-ups in software development tasks. That means an engineer with the right tool can finish a coding assignment in hours, not days. On the ground, tool-level adoption across 100,000+ GitHub developers shows that every generation of AI tool — from code-completion assistants to code-gen copilots — boosts coding activity on a measurable scale. Push frequency, lines changed, and pull request openings all climb where AI is used.

But there’s a distinction the data makes painfully clear: writing code is only the first mile. The aggregate productivity gains of generative AI in software development are much more muted when measured by code reviewed, merged, or shipped to production. Most of the acceleration is isolated at the code-writing stage. Final output — the code that ships, the features that reach users — increases only modestly.

The CEPR (2024) study is precise: AI coding tool usage correlates with higher activity, but aggregate software output remains restrained by bottlenecks further down the pipeline. The jump in code written outpaces the jump in code shipped. The productivity effects of AI coding tools are real, but they aren’t (yet) a panacea for software throughput.

How do successive generations of AI tools differ in productivity impact?

Each generation of AI coding tool drives bigger code-writing spikes — but the pattern flattens further down the stack. The CEPR 2024 dataset breaks out three distinct generations:

  • First: simple code completions (inline suggestions, autocomplete).
  • Second: generative code snippets, templates, or copilot-style agents.
  • Third: integrated AI systems that handle larger chunks, refactoring suggestions, or code transformations.

With every step, the data shows larger immediate productivity gains in code written: adoption of third-generation tools often means project activity surges, measured in more PRs, more lines, faster iteration cycles. The gain from first- to second-generation tools is notably higher than the status quo; the leap to the third is higher still.

But, as the authors show, diminishing returns set in hard as you move from writing code to integrating, testing, and shipping it. Why? The constraint is not just the tool’s ability to generate code, but also everything downstream that makes software work at scale. Integration conflicts, human review, test coverage, and deployment delays become proportionally more salient as the raw codewriting bottleneck narrows.

Using data from 100,000+ developers, the study finds the “activity gains” (code written, PRs opened) compound, but the “final output” (code merged, software released) doesn’t follow at the same multiple. Task-level performance shoots up with each generation, but aggregate software productivity does not. Put simply: AI is an accelerant for writing, not for shipping.

11 production screens. Auth, DB, Stripe — all wired.

The SaaS Dashboard Kit ships everything already connected. No Vercel config, no Supabase account. Live demo at saas.otf-kit.dev.

See the live demo

Why doesn’t more code writing equal more software output?

Human bottlenecks at critical stages mean that more code written by AI does not equal more shipped, working software. The core obstacle is the “O-ring” or bottleneck effect (Kremer 1993, adapted to AI by Jones 2026 and Aghion et al. 2019): real-world production involves multiple interdependent stages. If AI speeds up only some, the slowest — often review, integration, or approval — sets the ceiling for throughput.

The CEPR 2024 findings are blunt: code review, testing, integration, and release are still mostly human, with tools for these stages lagging behind the pace of code-writing assistants. Even as automated test suggestion and static analysis improve, the challenges of making code safe, maintainable, and aligned with product needs are far from solved by LLMs. Each new PR written by AI creates more review load and more integration complexity. More suggestions from Copilot, or its successors, mean more human sorting, analysis, and triage.

Scarce consumer attention adds a second-order constraint — not every feature or fix, no matter how fast it’s coded, is valuable to end users. Output incentives remain bound by human judgment, business strategy, and product-market fit, none of which can be brute-forced by generating more code.

This is the Solow Paradox, version 2.0. Forty years after Solow noted, "You can see the computer age everywhere but in the productivity statistics," the same ghosting is visible with AI in software development. Task-level gains, proven in studies like Brynjolfsson et al. (2025), are real; aggregate output is gated by the slowest, most human-dependent links.

How can developers use AI coding tools to improve productivity today?

The winners are using AI coding tools not just for writing, but for a simplified flow from pull request to production. The research is unambiguous: maximal productivity gain requires combining AI-generated code with solid automation and human-in-the-loop best practices downstream.

First, treat AI as a co-pilot, not an autopilot. Use code generation and completion aggressively for rapid prototyping, spike solutions, and low-risk boilerplate. But immediately pipe PRs through automated linting, static analysis, and test suites — don’t trust, but verify. Configure CI/CD to flag integration issues and regressions before they reach review. For code review, pair AI assistance with rules-based reviewers or selective human gating, so that the most complex or risky changes always get flagged.

Second, invest in automated release and deployment pipelines. Let AI take on tedious merge chores (e.g., trivial conflict resolution, rebase suggestions), but keep release decisions closely guarded. Where possible, use generative AI to generate or update tests, but require human signoff for final merges.

Examples from industry reports — and hard lessons from those 100,000+ developers — show that teams that pair rapid AI-supported coding with disciplined review, test, and deploy automation ship more reliably. The pattern is always the same: most of the value leaks if raw code generation isn’t bolted to high-coverage, fast-feedback automation.

A practical flow looks like:

# Step 1: Write code with AI assistant (e.g., Copilot)
git add .
git commit -m "AI: implement feature X"

# Step 2: Open PR, auto-run static analysis & tests
gh pr create --title "Add feature X" --body "AI-generated, review needed"

# Step 3: Require all checks to pass before merging
gh pr checks --watch

# Step 4: Merge & auto-deploy approved PRs
gh pr merge --auto --rebase

Treat the review, test, and deploy steps as inviolable. The gains compound only when the bottlenecks are addressed.

What do productivity forecasts say about AI’s impact on software development?

The industry’s forecasts for aggregate productivity effects of AI coding tools are wildly divergent. Some (Acemoglu 2025) project percentage-point boosts in national or global productivity, betting that compounding software efficiency will translate to more products, faster. Others (Jones 2026) and Filippucci et al. (2024) find that, so far, the net effects are modest and often lost in the statistical noise.

This divergence is not just academic. At the early-firm level, the CEPR 2024 report finds that real-world companies see only modest gains in delivered software per developer, even as code-writing activity explodes. Why the gap? All the evidence points at the bottleneck hypothesis: if AI accelerates only writing, but review and deployment remain human-driven, growth is bounded. Policymakers and developers betting on an AI-driven software boom should temper their expectations, plan for incremental improvement, and measure outcomes across the whole lifecycle — not just code written.

For now, forecasts depend critically on which part of the pipeline gets AI-enhanced next. If review, test, and release automation can meaningfully catch up, expectations may reset sharply higher.

Shipping code is harder than writing it — that won’t change soon

AI coding tools have slashed the effort needed to write code and close simple tickets — the productivity effects of AI coding tools are real at the task level. But shipping software is still a marathon of review, integration, and human judgment. The gains are gated by lasting bottlenecks, not just by model improvements. Until more of the pipeline is automated, expect spectacular demos but incremental throughput. The teams that win will be the ones pairing AI writing with solid review and deploy automation — and measuring themselves by code shipped, not code written.

ai-toolsbackendagents
OTF SaaS Dashboard Kit

Ship the product, not the setup.

  • 11 production screens — auth, billing, team, analytics, settings
  • Real Postgres + Stripe + Better Auth, all wired on day 1
  • CLAUDE.md pre-tuned so your agent extends instead of regenerates