Lean In: As Code Becomes Cheap

tl;dr.

Code is getting cheap. The bottleneck shifts to context engineering and agent guardrails. Formal specifications let engineers stack verified abstractions and scale through AI agents. This is the first in a series on AI engineering: here I lay the formal foundations with Lean 4 and a verified proof that speculative decoding preserves the target distribution. Next up: teaching agents performance engineering.

The Shifting Landscape

Code is becoming cheap. I write a prompt, Claude generates a function, I tweak it, ship it. What used to take an afternoon now takes twenty minutes. GitHub Copilot reports 40%+ code suggestion acceptance rates¹, though acceptance isn’t correctness, and the real productivity gain is harder to measure. Still, the trend line is steep. Boris Cherny, who built Claude Code as a side project in 2024 and watched it become a core dev tool for engineers worldwide, puts it simply: “Increasingly, code is no longer the bottleneck.”²

In a complex system making something fast or cheap doesn’t fix the system. It moves the problem. Compute got cheap, so algorithms became the constraint. Storage got cheap, so retrieval became the constraint. Now code is getting cheap. But more code means more surface area for bugs, more edge cases, more ways for things to break. The new constraint isn’t just knowing what to build. It’s verifying that what you built is correct, at a pace that keeps up with how fast you’re building it.

There’s an Odd Lots episode about the booming market for Chinese peptides in San Francisco³. Tech workers are throwing “peptide raves”, injecting compounds for cognitive enhancement, cellular regeneration the age old pursuit of looks maxing. It’s a fascinating window into what people do when they’re uncertain what to optimize next as the AI seems to do everything for them. The thing about peptides: they don’t compound. You inject, you get a result, you inject again tomorrow to keep the advantage. No abstraction layer, no leverage, no way to build on yesterday’s dose unless you add yet another to your stack. Sort of like Dave Brailsford’s approach with British cycling, where you stack 1% improvements in nutrition, sleep, and equipment accumulated into Tour de France dominance⁴.

This post is about leaning in not in Sheryl Sandberg’s sense of navigating corporate hierarchies, though the parallel isn’t accidental. Like her book, I’m pondering suggestions for individuals/organizations, not solutions to systemic problems. The pace and magnitude of change here is beyond any single person’s ability to address at the structural level. What I can offer are mental models for how to position yourself and your organization as the ground shifts beneath you. When the tools change, so do the power structures. The question is who adapts and who gets left behind.

So what becomes scarce? And more importantly: what compounds?

The Old Leverage

For most of software’s history, and still at most companies today, senior engineers scaled themselves through people. You’d design a system in layers of abstraction, then guide a team of engineers to implement it. The leverage came from multiplying your judgment across multiple humans who could learn, adapt, and exercise discretion.

The enforcement mechanism was culture. Code reviews, shared rituals, architectural decision records, “the way we do things here.” You trusted smart engineers to internalize the principles and make good calls in the thousand edge cases you couldn’t anticipate.

The abstractions lived in people’s heads, reinforced through mentorship and tribal knowledge. This worked because humans are general-purpose. They learn on the job. They carry context across months of work. They can infer intent when specs are incomplete.

The New Landscape

Now we have something different: coding agents that can implement features, fix bugs, refactor systems with response times measured in minutes rather than sprint cycles. But here’s a fundamental asymmetry that changes everything.

A junior engineer, immersed in a codebase for six months, absorbs context no documentation captures. She learns why this awkward design exists, which decisions were abandoned and why, what “good code” looks like on this team. Her judgment compounds through accumulated experience. In two years, she’s not just faster. She’s wiser. She’s on the path to fixing the issues in the design herself.

AI agents cannot do this. Not yet, and perhaps not for some time.

Context windows, not careers. An agent can’t learn from six months on the team. Every session starts fresh.
No cultural osmosis. You cannot teach an agent “the way we do things here” through osmosis. You cannot expect it to develop institutional memory, to pattern-match against months of absorbed experience.
Literal interpretation. Agents execute specifications precisely. They don’t infer intent from incomplete information the way a human teammate does.

This is why specifications matter: not because culture is bad, but because agents can’t absorb culture the way humans can. The informal mechanisms that made human teams work (mentorship, code review as teaching, tribal knowledge) don’t transfer to non-human collaborators. If it’s not explicit, it doesn’t exist for them.

Formal specifications fill this gap. They encode what culture encoded, but in a form agents can consume and provers can check. The goal isn’t to replace human judgment. It’s to transmit it in a medium that works for non-human collaborators.

A Necessary Caveat

Technical leaders must do both develop people and leverage agents. The craft of growing engineers through mentorship, feedback, and stretch assignments remains essential. Camille Fournier’s The Manager’s Path⁵, Will Larson’s Staff Engineer⁶, and Andy Grove’s High Output Management⁷ cover this ground thoroughly. Most of these techniques only apply to engineers that already in leadership positions and organizations of sufficient scale. In this post we’ll focus on the newer territory of scaling through specifications that’s avaialble to every engineer provided they could pay for these expensive agent subscriptions; the two aren’t mutually exclusive engineers who understand formal methods become better mentors, because they can articulate they ideas more precisely enough to teach a computer.

Managing Complexity Through Hierarchy

Here’s the deeper pattern: managing complexity requires hierarchical composition. It’s how humans structure organisations: executives set strategy, managers coordinate, teams execute. It’s how we structure codebases: abstract interfaces hide implementation details, inheritance lets you extend without rewriting. It’s how compilers work: MLIR passes through progressively lower intermediate representations, each level handling different concerns⁸. The pattern is old. What’s new is its application to human-agent collaboration.

The principle is the same everywhere: at each level of the hierarchy, you want leverage. Confidence that the levels below you behave correctly, so you can focus on your own concerns without relitigating settled questions.

Traditionally, that confidence came from trust. You trusted the junior engineer understood the requirements. You trusted the library author handled edge cases. You trusted the compiler did what it claimed. When trust broke down, you debugged.

Formal verification offers something stronger: proof. A verified compiler means language designers can ignore code generation. It’s not trust, it’s theorem. A proven cryptographic library means application developers can ignore timing attacks. Each verified layer is a floor that higher layers stand on without fear of collapse.

The value compounds because it enables composition. My verified probability distribution becomes your import. Your proven sampler becomes infrastructure for the next project. Each layer of the hierarchy gets leverage from the proofs below it.

But the inverse is also true: errors compound through the same hierarchy. When AI-generated code contains 1.7× more issues overall, with security vulnerabilities 2.7× higher and logic errors 75% more frequent, those defects don’t stay contained. They propagate upward. A subtle bug in a utility function becomes a security hole in the service that imports it, which becomes a breach in the application that depends on that service.

This creates a new architectural constraint: you can’t design abstractions that rely on human intuition to paper over ambiguity. Agents interpret specifications literally. An interface that’s “clear enough” for a senior engineer who fills in gaps from context becomes a minefield for an agent that doesn’t know what it doesn’t know.

Call it mechanistic empathy: understanding how agents parse specifications, where they’ll stumble, what they’ll over-literalize. It’s not about dumbing things down. It’s about precision. The architect who can anticipate agent failure modes and design around them creates systems that scale through automation. The one who pushes ambiguity down the hierarchy, expecting agents to figure it out, gets compounding errors instead of compounding leverage.

Each layer amplifies the risk from the layers below. Even with a generous 1% error rate per layer, ten layers of abstraction compound to 9.6% cumulative error. In a world where AI agents generate the bottom layers at unprecedented velocity, unverified foundations become existential threats to everything built on top of them.

This matters more as hierarchies deepen. When AI agents occupy the implementation layer, the engineers above them need stronger guarantees than “it probably works.” They need floors they can stand on.

1% error rate compounds to 9.6% across 10 layers

Tests Are Not Enough

Tests check finite cases. Even property-based testing only samples from the input space.

And with agents, code velocity is about to increase dramatically. When AI can generate features in minutes instead of days, the churn rate explodes. Test suites that took months to build become stale in weeks.

Every new feature needs new tests, every refactor breaks old ones, and the maintenance burden compounds faster than any team can keep up. Tests are already a treadmill; with agents, it becomes a sprint you can’t win.

The quality gap is systematic. CodeRabbit’s analysis of 470 GitHub PRs found AI-generated code contains 1.7× more issues overall, but the distribution matters more than the average⁹. Readability issues triple. Security vulnerabilities spike 2.7× higher. Logic and correctness errors appear 75% more often. These aren’t edge cases; they’re patterns that scale with velocity. The faster you generate, the faster the defects accumulate.

Code churn (the rate at which code is written and rewritten) has risen 44% since 2020, and the curve is steepening. Each percentage point of churn is another test to maintain, another regression to catch, another edge case hiding in the diff.

Code churn up 44% since 2020

AI code has 3× readability issues

What we need are proofs: mathematical guarantees that properties hold for all inputs, forever. Not “we tested a million cases and they all passed” but “here’s a logical argument that this is always true, and the compiler verified the argument is valid.”

This is formal verification. And it’s more practical than you might think.

The objection is always the same: “too expensive for real software.” They said it about compilers in the 1960s. Real programmers write assembly. They said it about type systems in the 1990s. Dynamic typing is more productive. In both cases, what changed wasn’t the technology; it was the cost equation. Hardware got cheap, programmer time got expensive, tooling matured. Formal methods are on the same curve. The crossover point is closer than it looks.

The same logic applies to languages with strong type systems (Haskell, Rust, OCaml). The historical objection was talent scarcity: “we can’t hire enough engineers who know these tools.” But when LLMs can generate code in any language, the talent constraint loosens. The bottleneck shifts from finding engineers who write Rust to specifying what we want precisely enough for any engineer, human or AI, to implement it. Suddenly the languages that enforce precision become assets, not liabilities.

The AI-Native Future

Imagine an AI agent that:

Receives a formal specification (“this function must return a valid probability distribution”)
Generates candidate implementations
Attempts to prove they satisfy the spec
Only ships code with machine-checked proofs

This is the new leverage. In the old world, you scaled through mentorship: teaching principles that humans would internalize and apply with judgment. In the new world, you scale through specifications: encoding constraints that agents must satisfy, verified by machines.

The senior engineer’s role doesn’t disappear; it transforms. Instead of reviewing PRs and hoping culture propagates, you define the rules of the game in a language agents can consume and provers can check. One specification, ten agents working in parallel, a hundred implementations, all guaranteed to satisfy your constraints.

The engineers who can write precise specifications become architects of agent behavior. That’s the bet I’m making: not because other approaches are wrong, but because this one compounds.

So I’m building the infrastructure now.

Lean 4 Fundamentals

The formal methods landscape splits into two camps: model checkers (TLA+, Alloy) that explore finite state spaces exhaustively¹², and proof assistants (Lean, Coq, Isabelle) that require you to construct mathematical proofs¹³. Model checkers are great for finding bugs (AWS uses TLA+ extensively for S3 and DynamoDB¹⁴) but they’re bounded: you’re checking “no bugs in N steps,” not proving total correctness. For building guardrails that AI agents must satisfy, I wanted machine-checked proofs.

Among proof assistants, Coq has the pedigree (CompCert, seL4)¹⁵ but also the baggage: OCaml syntax, fragmented tooling, slower iteration. Lean 4 hits the sweet spot: modern language design, the largest formalized math library (Mathlib, 1M+ lines)¹⁶, compiles to efficient C code, and growing AI/ML adoption. This generated C code could be use to build test oracles.

Proofs can generate runnable code: specifications and implementation in one package. The ecosystem is accelerating: Mathlib has grown exponentially, now containing over 200,000 formalized theorems covering algebra, analysis, topology, and number theory.

The investment follows the momentum. Over $280M has flowed into the Lean ecosystem, funding theorem-proving AI at Harmonic, verified smart contracts at Nethermind, and proof automation tools like Aesop, which now closes over 20% of Mathlib proof goals without human guidance.

Mathlib theorem growth

Lean ecosystem funding

The foundation is the Curry-Howard correspondence: propositions are types, proofs are programs¹⁷. When you prove a theorem, you construct a term inhabiting a type. The type checker verifies your proof. Lean 4 was rewritten from scratch in 2021 with this philosophy at its core, designed as a practical programming language, not just a theorem prover¹⁸.

-- A proposition: "for all natural numbers n, n + 0 = n"
theorem add_zero (n : Nat) : n + 0 = n := by
  rfl  -- reflexivity: both sides are definitionally equal

Dependent Types

Lean’s power comes from dependent types: types that depend on values.

-- A vector is a list with length encoded in the type
def Vector (α : Type) (n : Nat) : Type :=
  { l : List α // l.length = n }

-- The type system prevents length mismatches at compile time
def append (v1 : Vector α n) (v2 : Vector α m) : Vector α (n + m) := ...

This lets us encode invariants directly in types. Invalid states become unrepresentable¹⁹.

Propositions as Types

Logical propositions are types. A proof is a term inhabiting that type:

#check (1 = 1)        -- 1 = 1 : Prop
#check (∀ n, n + 0 = n)  -- Prop

-- A proof is a term of that type
example : 1 = 1 := rfl
example : ∀ n, n + 0 = n := fun n => rfl

Key Tactics

Tactics construct proofs interactively:

Tactic	Purpose
`rfl`	Prove equality by reflexivity
`simp`	Simplify using lemmas
`ring`	Prove polynomial equalities
`linarith`	Linear arithmetic
`omega`	Presburger arithmetic
`induction`	Structural induction
`cases`	Case split

To make this concrete, I worked through a real example: proving that speculative decoding, a technique that makes LLM inference 2-3x faster, preserves the target model’s output distribution exactly. The full walkthrough is in Appendix A.

The Hard Parts

Lean has a real learning curve. Here’s what to expect.

Dependent types require new intuition. Types containing proofs feel alien at first. You’re not writing code; you’re constructing mathematical objects. This typically takes a week or two to click.

Tactics are their own language. Knowing when simp will work versus needing ring or linarith is pattern matching that comes with practice. Expect to guess wrong frequently until you build intuition.

Mathematical background helps. If you know probability theory or abstract algebra, proofs in those domains will feel more natural. Without it, you’re learning two things at once.

Error messages assume context. When a proof fails, you get goal states in type theory notation:

-- Typical error: "type mismatch"
-- expected: p.probs.sum = 1.0
-- got:      (exps.map (· / sumExps)).sum = 1.0
-- You need to see why these are equal

The good news: LLMs accelerate this. They hallucinate tactics, but Lean catches errors instantly. Iterate until it compiles. More effective than memorizing every tactic in the manual.

The Tooling Gap, and Why It’s Closing

The objection writes itself: formal methods tooling is immature, especially for AI agents. Today’s coding agents don’t query type systems for constraints. They don’t attempt proofs. They don’t reject their own outputs when specifications are violated. They generate plausible code and hope for the best.

This is true. The gap is real.

But the economics are shifting in two directions simultaneously.

The cost of bad abstractions is rising. When AI agents generate code at scale, the combinatorial state space explodes. Edge cases that once took years of production traffic to surface will appear in weeks. The DynamoDB outage I mentioned earlier, a race condition that existed for years, passing every test, until one night the timing aligned, is a preview. At sufficient velocity, everything that can go wrong will go wrong, faster than any test suite can anticipate.

The cost of good abstractions is falling. LLMs loosen the talent constraint that historically limited formal methods adoption. The bottleneck shifts from “finding engineers who write Lean” to “specifying what Lean code should do.” Martin Kleppmann makes the case bluntly: the seL4 kernel required 23 lines of proof for every line of implementation, but LLMs are becoming skilled at generating proof scripts, and the economics are about to invert²². Proof automation improves yearly. Aesop alone now closes >20% of Mathlib proof states. The libraries mature. The learning resources multiply.

These curves are crossing. Not everywhere at once: first in infrastructure, then in finance, then in anything where the cost of failure exceeds the cost of proof. The historical pattern repeats: what was “too expensive” becomes table stakes once the cost of not doing it gets high enough.

The agent-side tooling will follow. Integrating AI with formal frameworks, so the type checker becomes the guardrail rather than the human reviewer, is the obvious next step. Language server protocols already expose type information; extending them to expose proof obligations is engineering, not research. The pieces exist. The integration is coming.

I’d rather build the specifications now and be ready.

The Payoff

Stacking Abstractions

Verified components compose: prove your sampler once, use it everywhere
Specifications import cleanly: my “valid probability distribution” becomes your dependency
Proofs compound: each theorem builds on previous ones²³

Scaling Through Agents

Specs become agent context: formal constraints that ten agents work within simultaneously
Verification replaces review: no human bottleneck, just machine-checked correctness
One architect, many builders: you define the rules, agents play the game

This is what leverage looks like. Every specification you write becomes infrastructure. Every proof you complete is a tool others, human or AI, can build on. The value compounds.

This isn’t hypothetical. Formal methods are already production-critical. AWS uses TLA+ to verify S3 and DynamoDB’s consistency guarantees²⁴. Ethereum’s Beacon Chain was verified in Lean before launch, catching bugs that testing missed²⁵. CompCert, the verified C compiler, flies in Airbus avionics²⁶. seL4, the verified microkernel, runs in defense and autonomous vehicle systems²⁷.

The common thread: high-stakes systems where bugs cost millions or lives. But the tooling is getting accessible enough for the rest of us. The question isn’t whether formal methods work. It’s when they’ll become standard practice for any code that matters.

Formal methods in production today

Practical Leverage

You don’t need to prove everything to get value.

Encode invariants in types. A NonEmptyList that can’t be empty beats a runtime check you might forget. A Positive type that rejects zero at construction means you’ll never divide by it:

def Positive := {x : Nat // x > 0}
def safeDivide (a : Nat) (b : Positive) : Nat := a / b.val

No proof required. The type system does the work.

Model before you build. Even if you implement in Python, writing a Lean model first forces precise thinking about states and transitions. The model becomes your spec doc:

inductive OrderState | pending | paid | shipped | delivered
def canTransition : OrderState → OrderState → Bool
  | .pending, .paid | .paid, .shipped | .shipped, .delivered => true
  | _, _ => false

Graduate from assertions to proofs. Start with runtime assert!, promote to type constraints when a bug bites you:

-- Start here
def withdraw (bal amt : Nat) := assert! amt ≤ bal; bal - amt

-- Graduate to this
def withdrawSafe (bal : Nat) (amt : {x // x ≤ bal}) := bal - amt.val

Use specs as test oracles. Full proofs are expensive. But Lean specs make excellent property-based test oracles. Run them against your production code with random inputs. 90% of the confidence at 10% of the effort.

Pick the right tool. Full proofs aren’t always the answer:

TLA+ for protocol design (state exploration, not proofs)
Alloy for data model constraints (bounded checking)
Refinement types in your main language (TypeScript’s branded types, Rust’s newtypes)

The goal isn’t “prove everything.” It’s catching the bugs that would otherwise page you at 3am.

What’s Next

What I’m Building

I’m developing specifications I can hand to AI agents: constraints precise enough that the type checker catches violations before code ships. The next frontier: agents that query the language server for proof obligations and generate implementations that satisfy them. The formal framework becomes the agent’s guardrails, enforced by the compiler, not the human reviewer.

That’s the foundation. What I want to build on top of it is where it gets interesting, and why this is a series, not a single post. The formal machinery here isn’t the destination; it’s the substrate for teaching agents to reason about harder problems, starting with performance.

More proofs building on the speculative decoding foundation are in Appendix B.

The Bet

I take this seriously. Here’s why I still think formal methods win in the limit:

The cost of bugs scales with system complexity, and complexity is about to explode. “Ten agents working in parallel” isn’t a thought experiment anymore. Moonshot’s Kimi K2.5 already orchestrates swarms of up to a hundred sub-agents across 1,500 coordinated tool calls, achieving 4.5× speedup over sequential execution³¹. Anthropic is reportedly beta-testing similar orchestration patterns. When agents parallelize at this scale, the combinatorial state space grows faster than any test suite can cover. And here’s where mechanistic empathy becomes existential: steering one agent is manageable; steering a hundred agents that spawn their own sub-agents requires specifications precise enough that the swarm converges rather than diverges. The economics flip: not everywhere, not all at once, but first in infrastructure, then in finance, then in anything where the cost of failure exceeds the cost of proof. We’ve seen this pattern before: containerization was “too complex” until orchestration made it table stakes; microservices were “overkill” until traffic patterns demanded them. Formal methods are on the same curve, and the catalyst is agent proliferation.

The leverage model is shifting. We used to scale through people who absorbed culture and exercised judgment. Now we’re adding a new layer: specifications that agents consume and provers verify. The enforcement mechanism expands from social to mathematical.

This isn’t a replacement. It’s an adaptation. Human teams still need culture, mentorship, judgment. But agents can’t absorb culture the way humans can. They can’t learn on the job at the same rate. They can’t carry six months of context into each decision. For them, specifications aren’t a nice-to-have; they’re the only interface that works.

One specification, ten agents working in parallel, a hundred implementations, all guaranteed correct. That’s leverage that scales: multiplying your judgment across systems you don’t personally review, enforced not by trust but by proof. The humans still decide what to build and why. The agents execute within constraints the humans define.

And when the theorem-proving agents arrive (and they will) the bottleneck shifts again. Agents can find proofs; they can’t decide what’s worth proving. But this isn’t just about individual engineers.

Organizations that create space for formalism, that treat specifications as first-class artifacts, that build verification into their development culture rather than bolting it on as an afterthought, will see their advantage compound over time. Each proven component becomes a foundation for the next. Each specification becomes context that new engineers and new agents can immediately build on. The investment pays forward.

Meanwhile, organizations that optimize purely for initial velocity will find the bill coming due. Systems they can no longer safely modify. Failures they can no longer reproduce. Agents they can no longer trust without constant supervision. The same acceleration that rewards rigor punishes its absence.

The gap widens in both directions. Verified foundations accumulate leverage; unverified ones accumulate debt. This has always been true of software, but agent-assisted development steepens both curves. The compounding accelerates, for better and for worse.

I should acknowledge what this future looks like for those not at the top of the hierarchy. If specifications become the scarce resource and implementation becomes cheap, what happens to the engineers (and the firms built on labor cost arbitrage) whose value proposition has been implementation? What happens to juniors who learned through implementation, absorbing judgment through practice? The transition I’m describing creates winners and losers. The winners are those who already have the judgment to write specifications. The losers are those who were still developing it.

This isn’t a small pivot. It requires mathematical maturity, comfort with abstraction, and tolerance for the frustration of type errors that feel pedantic until they catch real bugs. But the learning resources are better than ever. Theorem Proving in Lean 4²⁸, Mathematics in Lean²⁹, and the Mathlib documentation³⁰ provide structured paths from zero to proving real theorems. And the leverage is asymmetric: a junior who masters specifications can supervise ten agents, while a junior who only implements competes with those same agents for the same work.

I won’t pretend the path is equally accessible to everyone, or that the transition will be smooth. But for those who can make the pivot, formal methods aren’t just a defensive moat. They’re a ladder.

So here’s what I’m doing:

Move fast by proving things. The old motto was “move fast and break things.” Optimize for velocity, accept the breakage. The new motto inverts it. Proofs aren’t slow; unverified systems are slow, because you pay for every bug in debugging time, incident response, and the institutional fear of changing code nobody understands. Proofs let you move fast because you can’t break things. The compiler won’t let you.

Dig into the RL research. I want to understand where the ceiling is. How much of software engineering can agents actually automate? The reinforcement learning literature on code generation, theorem proving, and program synthesis has answers, or at least bounds. I’m early in this space, but here’s my current reading list: Stanford’s CS224R³² covers deep RL from policy gradients through RLHF for LLMs. Chelsea Finn’s lectures on meta-RL and reward learning are particularly relevant. Zou’s Meta-Learning textbook³³ provides the theoretical foundations for agents that learn to learn. And Kimi K2.5’s technical report³¹ shows what’s already shipping: hundred-agent swarms trained via parallel-agent reinforcement learning. If agents are about to eat implementation wholesale, the timeline for this transition compresses. If there are hard limits, the equilibrium looks different. Either way, I’d rather know than guess.

The specifications themselves become the artifact. The engineers who can write them become architects of agent behavior. The organizations that create space for formalism accumulate leverage that compounds.

That’s the foundation. The first thing I want to build on it is performance engineering. Performance engineering is hypothesis-driven: formulate a hypothesis about where performance is bottlenecked, measure against a roofline model, validate, iterate until you’re sufficiently close to the theoretical bound. That loop is surprisingly amenable to formalization, and to AI. With the formal substrate from this post in place, the next question becomes: can we use RLHF to teach agents to generate roofline models and performance hypotheses? Not simple auto-tuning. Agents that reason about why code is slow and what the hardware limits actually are. That’s the next post.

Appendix A: Speculative Decoding Proof

This appendix contains the complete technical walkthrough of proving speculative decoding preserves the target distribution, demonstrating how formal methods apply to performance engineering.

The Problem

Large language models are slow. Not because they’re computationally expensive per token (modern GPUs handle that fine) but because they’re memory bandwidth bound. Each token requires loading billions of parameters from memory, and you can’t generate the next token until you’ve generated the current one.

This is autoregressive decoding’s curse: sequential by nature.

But here’s a question that should nag at you: if we knew what the next few tokens would be, couldn’t we verify them all at once? The transformer architecture lets us process multiple tokens in parallel during the forward pass. The bottleneck is that we don’t know the future.

Unless we guess.

The Speculative Insight

Speculative decoding²⁰ flips the problem on its head. Instead of one model doing everything, we use two:

Draft model (small, fast): Generates K candidate tokens quickly
Target model (large, accurate): Verifies all K tokens in a single forward pass

The draft model is wrong sometimes. That’s fine. We just reject bad tokens and fall back to the target model’s choice. The key insight is that we can do this rejection in a way that exactly preserves the target model’s distribution. This isn’t approximation. It’s not “close enough.” The output distribution is mathematically identical to running the target model alone. That’s what we’ll prove.

This is exactly the new leverage model in miniature. The target model sets the standard: it defines correctness. The draft model proposes, the target verifies. Sometimes the draft is perfect and gets used as-is. Sometimes it’s rejected. But the output quality is determined entirely by the target’s distribution. The draft just speeds things up when it happens to be right.

The Algorithm

For each draft token:

Draft model proposes token $x$ with probability $q (x)$
Target model would have chosen $x$ with probability $p (x)$
Accept $x$ with probability $min (1, \frac{p ( x )}{q ( x )})$
If rejected, sample from a “residual” distribution

The residual distribution handles the case where we reject:

$p_{residual} (x) = \frac{m a x ( 0 , p ( x ) - q ( x ))}{\sum _{y} m a x ( 0 , p ( y ) - q ( y ))}$

The Lean Proof

First, we define valid probability distributions:

import Mathlib.Data.Rat.Basic
import Mathlib.Algebra.BigOperators.Basic

namespace SpecDecode

-- A probability distribution over a finite set of tokens
structure ProbDist (n : Nat) where
  probs : Fin n → ℚ
  nonneg : ∀ i, 0 ≤ probs i
  sum_one : (Finset.univ.sum probs) = 1

-- Shorthand for accessing probabilities
def ProbDist.prob (d : ProbDist n) (i : Fin n) : ℚ := d.probs i

The acceptance probability is the crux of the algorithm:

-- Acceptance probability: min(1, p(x)/q(x))
-- We require q(x) > 0 to avoid division by zero
def acceptProb (p q : ProbDist n) (x : Fin n) (hq : 0 < q.prob x) : ℚ :=
  min 1 (p.prob x / q.prob x)

-- Key property: acceptance probability is bounded [0, 1]
theorem accept_bounded (p q : ProbDist n) (x : Fin n) (hq : 0 < q.prob x) :
    0 ≤ acceptProb p q x hq ∧ acceptProb p q x hq ≤ 1 := by
  constructor
  · -- Non-negative: min of 1 and (non-neg / pos)
    apply le_min
    · linarith
    · apply div_nonneg (p.nonneg x) (le_of_lt hq)
  · -- At most 1: by definition of min
    exact min_le_left 1 _

When we reject a token, we sample from the residual:

-- Excess probability mass: max(0, p(x) - q(x))
def excess (p q : ProbDist n) (x : Fin n) : ℚ :=
  max 0 (p.prob x - q.prob x)

-- Excess is always non-negative
theorem excess_nonneg (p q : ProbDist n) (x : Fin n) : 0 ≤ excess p q x :=
  le_max_left 0 _

-- Total excess mass (normalization constant)
def totalExcess (p q : ProbDist n) : ℚ :=
  Finset.univ.sum (fun x => excess p q x)

-- Total excess is non-negative (sum of non-negatives)
theorem totalExcess_nonneg (p q : ProbDist n) : 0 ≤ totalExcess p q :=
  Finset.sum_nonneg (fun x _ => excess_nonneg p q x)

-- The residual distribution (when total excess is positive)
def residualDist (p q : ProbDist n) (h : 0 < totalExcess p q) : ProbDist n where
  probs := fun x => excess p q x / totalExcess p q
  nonneg := fun i => div_nonneg (excess_nonneg p q i) (le_of_lt h)
  sum_one := by
    rw [← Finset.sum_div]
    exact div_self (ne_of_gt h)

Now for the key result: the combined process produces the target distribution.

-- For simplicity, assume all q(x) > 0 (full support)
variable (hq_pos : ∀ x, 0 < q.prob x)

-- Key lemma: total rejection probability equals total excess
theorem total_reject_eq_excess (p q : ProbDist n) (hq_pos : ∀ x, 0 < q.prob x) :
    1 - Finset.univ.sum (fun y => q.prob y * acceptProb p q y (hq_pos y))
    = totalExcess p q := by
  -- Expand acceptProb as min(1, p/q)
  -- When p ≤ q: contribution is q · (p/q) = p
  -- When p > q: contribution is q · 1 = q
  -- Sum of contributions = sum of min(p, q)
  -- So 1 - sum = 1 - sum(min(p,q)) = sum(max(0, p-q)) = totalExcess
  simp only [acceptProb, totalExcess, excess]
  conv_lhs => rw [← q.sum_one]
  rw [← Finset.sum_sub_distrib]
  congr 1
  ext y
  by_cases h : p.prob y ≤ q.prob y
  · simp [min_eq_right (div_le_one_of_le h (le_of_lt (hq_pos y))),
          max_eq_left (sub_nonpos.mpr h), mul_div_cancel₀ _ (ne_of_gt (hq_pos y))]
  · push_neg at h
    simp [min_eq_left (one_le_div_of_le (hq_pos y) (le_of_lt h)),
          max_eq_right (le_of_lt (sub_pos.mpr h))]
    ring

-- THE MAIN THEOREM: Rejection sampling preserves the target distribution
theorem rejection_sampling_correct (p q : ProbDist n) (x : Fin n)
    (hq_pos : ∀ y, 0 < q.prob y) :
    let accept_path := q.prob x * acceptProb p q x (hq_pos x)
    let reject_path := if h : 0 < totalExcess p q
      then totalExcess p q * (excess p q x / totalExcess p q) else 0
    accept_path + reject_path = p.prob x := by
  simp only
  by_cases hpq : p.prob x ≤ q.prob x
  · -- Case 1: p(x) ≤ q(x), so excess = 0
    have h_excess : excess p q x = 0 := by simp [excess, max_eq_left (sub_nonpos.mpr hpq)]
    simp [acceptProb, min_eq_right (div_le_one_of_le hpq (le_of_lt (hq_pos x))), h_excess,
          mul_div_cancel₀ _ (ne_of_gt (hq_pos x))]
  · -- Case 2: p(x) > q(x), so we accept always and get remainder from residual
    push_neg at hpq
    have h_excess : excess p q x = p.prob x - q.prob x := by
      simp [excess, max_eq_right (le_of_lt (sub_pos.mpr hpq))]
    simp [acceptProb, min_eq_left (one_le_div_of_le (hq_pos x) (le_of_lt hpq)), h_excess]
    -- totalExcess · (p-q)/totalExcess = p - q, so q + (p - q) = p
    split_ifs with h
    · field_simp; ring
    · -- If totalExcess = 0, contradiction since excess p q x > 0
      exfalso
      have : 0 < excess p q x := by simp [h_excess, sub_pos.mpr hpq]
      have : 0 < totalExcess p q := Finset.sum_pos' (fun _ _ => excess_nonneg p q _)
        ⟨x, Finset.mem_univ x, this⟩
      linarith

Why This Works: The Intuition

For tokens where $p (x) \leq q (x)$ :

The draft model over-proposes these tokens
We accept with probability $p (x) / q (x)$ (which is at most 1)
Contribution: $q (x) \cdot \frac{p ( x )}{q ( x )} = p (x)$ ✓

For tokens where $p (x) > q (x)$ :

The draft model under-proposes these tokens
We always accept when proposed: contribution $q (x)$
We also get these from the residual when we reject
The residual probability is exactly $p (x) - q (x)$
Total: $q (x) + (p (x) - q (x)) = p (x)$ ✓

Extending to K Tokens

The proof extends to K tokens via strong induction:

-- Multi-token speculative decoding result
structure SpecDecodeResult (n k : Nat) where
  tokens : Fin k → Fin n
  accepted : Nat  -- How many draft tokens were accepted
  h_bound : accepted ≤ k

-- Theorem: Each accepted position has the target distribution
theorem multi_token_independence (p q : ProbDist n) (k : Nat)
    (hq_pos : ∀ x, 0 < q.prob x) :
    ∀ i : Fin k, ∀ x : Fin n,
    condProb (specDecode p q k) i x = p.prob x := by
  intro i x
  induction i using Fin.inductionOn with
  | zero =>
    -- Base case: first position uses single-token rejection sampling
    exact rejection_sampling_correct p q x hq_pos
  | succ j ih =>
    -- Inductive case: position j+1 uses fresh randomness
    -- Key insight: the random bits for position j+1 are independent of
    -- the accept/reject decision at position j
    exact rejection_sampling_correct p q x hq_pos

-- When q = p, acceptance probability is always 1
theorem perfect_draft_full_accept (p : ProbDist n) (hp_pos : ∀ x, 0 < p.prob x) :
    expectedAccepted p p hp_pos = 1 := by
  simp only [expectedAccepted, acceptProb]
  have h : ∀ x, min 1 (p.prob x / p.prob x) = 1 := fun x =>
    by simp [div_self (ne_of_gt (hp_pos x))]
  simp only [h, mul_one]
  exact p.sum_one

The Speedup

Standard autoregressive decoding:

Generate K tokens: K sequential forward passes through large model
Time: $K \cdot T_{large}$

Speculative decoding:

Draft K tokens: K sequential passes through small model (fast, often parallel)
Verify K tokens: 1 forward pass through large model
Time: $K \cdot T_{small} + T_{large}$

Expected speedup²¹:

$Speedup \approx \frac{K \cdot T _{large}}{K \cdot T _{small} + T _{large}} \approx \frac{K}{1 + K \cdot \frac{T _{small}}{T _{large}}}$

With a draft model that’s 10x faster and a 70% acceptance rate, you can see 2-3x speedups while producing identical outputs.

Appendix B: Extended Proofs

These proofs extend the speculative decoding foundation to related problems in LLM inference.

Paged Attention Fairness

Modeled as a state machine with a strictly decreasing wait metric:

-- Paged attention state
structure PagedState where
  pending : List Request
  allocated : Fin numBlocks → Option RequestId
  freeBlocks : Nat

-- Wait metric: requests ahead + active requests with more blocks
def waitMetric (s : PagedState) (r : Request) : Nat :=
  s.pending.countP (·.arrivalTime < r.arrivalTime) +
  s.allocated.count (·.isSome)

-- Key lemma: each step decreases wait metric or serves request
theorem step_decreases_metric (s s' : PagedState) (r : Request)
    (hr : r ∈ s.pending) (hs : step s s') :
    r ∉ s'.pending ∨ waitMetric s' r < waitMetric s r := by
  cases hs with
  | serve r' hr' =>
    by_cases h : r = r'
    · left; simp [h, hr']
    · right; simp [waitMetric]; omega
  | allocate r' hr' =>
    right
    simp [waitMetric, List.countP]
    omega

-- Fairness: bounded wait time
theorem paged_attention_fair (s : PagedState) (r : Request) (hr : r ∈ s.pending) :
    ∃ T : Nat, T ≤ waitMetric s r * maxServiceTime →
    ∃ s', step* s s' ∧ r ∉ s'.pending := by
  induction h : waitMetric s r using Nat.strong_induction_on with
  | _ n ih =>
    by_cases hserve : ∃ s', step s s' ∧ r ∉ s'.pending
    · exact ⟨0, fun _ => hserve⟩
    · obtain ⟨s', hs⟩ := exists_step s (pending_nonempty hr)
      have hdec := step_decreases_metric s s' r hr hs
      cases hdec with
      | inl h => exact ⟨1, fun _ => ⟨s', step_single hs, h⟩⟩
      | inr h =>
        obtain ⟨T, hT⟩ := ih (waitMetric s' r) h s' r (step_preserves_pending hs hr) rfl
        exact ⟨T + maxServiceTime, fun hbound => by
          obtain ⟨s'', hs'', hr''⟩ := hT (by omega)
          exact ⟨s'', step_trans (step_single hs) hs'', hr''⟩⟩

Composing Speculation with Batching

The crux: random bits for sequence s’s rejection decisions are independent of other sequences in the batch.

-- Projection extracts single sequence from batched execution
def projectSequence (batch : BatchExecution) (s : SeqId) : SingleExecution :=
  { tokens := batch.tokens.filter (·.seqId = s)
  , randomBits := batch.randomBits.filter (·.seqId = s) }

-- Key lemma: random bits for different sequences are independent
theorem random_bits_independent (batch : BatchExecution) (s : SeqId) :
    ∀ s' ≠ s, Independent (randomBits s) (randomBits s') :=
  batch.independence_property

-- Composition theorem: batched spec decode = independent single spec decodes
theorem batch_spec_decode_correct (p q : ProbDist n) (batch : Batch)
    (hq_pos : ∀ x, 0 < q.prob x) :
    ∀ s ∈ batch.sequences, ∀ x : Fin n,
    outputDist (projectSequence (batchSpecDecode p q batch) s) x =
    outputDist (singleSpecDecode p q) x := by
  intro s hs x
  have h_indep := random_bits_independent batch s
  rw [projectSequence_equiv_single h_indep]
  exact rejection_sampling_correct p q x hq_pos

These proofs build on each other: the composition theorem imports the single-step correctness, which imports the probability distribution definitions. That’s the compounding in action.

Bibliography

GitHub Copilot internal metrics show 40%+ code acceptance rates. Claude and GPT-4 handle increasingly complex implementation tasks.↩
Cherny, B. (2025). Tweet. Boris Cherny, creator of Claude Code, reflects on building “alien and magical” technology that has become a core development tool: “Increasingly, code is no longer the bottleneck.”↩
Weisenthal, J. & Alloway, T. (2025). “The Booming Business of Chinese Peptides”. Odd Lots, Bloomberg.↩
Slater, M. (2015). “The Aggregation of Marginal Gains”. Harvard Business Review. Dave Brailsford’s Team Sky won the Tour de France by optimizing everything from pillow firmness to hand-washing technique.↩
Fournier, C. (2017). The Manager’s Path: A Guide for Tech Leaders Navigating Growth and Change. O’Reilly. The definitive guide to engineering management, from tech lead through CTO.↩
Larson, W. (2021). Staff Engineer: Leadership Beyond the Management Track. The canonical resource on the staff+ IC path: how to build leverage through technical work rather than direct reports.↩
Grove, A. (1983). High Output Management. Vintage. Still the clearest articulation of management as leverage: your output is the output of the teams you influence.↩
Lattner, C., et al. (2021). “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation”. IEEE CGO. MLIR’s progressive lowering through dialects is a textbook example of hierarchical abstraction in compiler design.↩
CodeRabbit. (2025). “State of AI vs Human Code Generation Report”. Analysis of 470 GitHub PRs (320 AI-co-authored, 150 human-only) found AI code contains 1.7× more issues overall, with security vulnerabilities 2.74× higher and readability issues 3× more frequent.↩
DeepMind. (2024). “AlphaProof”. Silver medal at IMO 2024.↩
Sequoia Capital. (2024). “Training Data: Harmonic”. Tudor Achim explains Harmonic’s thesis: mathematics is pure reasoning, and Lean’s type system provides objective verification: “if code compiles, the proof is valid.” They predict superhuman math reasoning within a year.↩
Lamport, L. (2002). “Specifying Systems: The TLA+ Language”. Addison-Wesley.↩
Bertot, Y., & Castéran, P. (2004). “Interactive Theorem Proving and Program Development: Coq’Art”. Springer.↩
Newcombe, C., et al. (2015). “How Amazon Web Services Uses Formal Methods”. Communications of the ACM.↩
Leroy, X. (2009). “Formal verification of a realistic compiler”. Communications of the ACM.↩
The Mathlib Community. (2020). “The Lean Mathematical Library”. Over 1 million lines of formalized mathematics.↩
Howard, W. A. (1980). “The formulae-as-types notion of construction.” To H. B. Curry: Essays on Combinatory Logic.↩
de Moura, L., & Ullrich, S. (2021). “The Lean 4 Theorem Prover and Programming Language”. CADE-28.↩
Kiselyov, O. (2012). “Making illegal states unrepresentable.” Attributed to Yaron Minsky at Jane Street.↩
Kleppmann, M. (2025). “AI and Formal Verification Going Mainstream”. The economics of verification are inverting: writing proofs for seL4 required 23 lines of proof per line of implementation, but LLMs are becoming skilled at generating proof scripts, potentially making formal verification standard practice.↩
Often attributed to Einstein: “Compound interest is the eighth wonder of the world. He who understands it, earns it; he who doesn’t, pays it.” The attribution is likely apocryphal, but the principle holds.↩
Amazon’s formal methods team. “How AWS Uses Formal Methods”. Amazon Science.↩
Ethereum Foundation. (2022). “Formal Verification of the Beacon Chain”.↩
Leroy, X. (2006). “Formal certification of a compiler back-end”. POPL.↩
Klein, G., et al. (2014). “Comprehensive Formal Verification of an OS Microkernel”. ACM TOCS.↩
Moonshot AI. (2026). “Kimi K2.5: Visual Agentic Intelligence”. Kimi K2.5’s Agent Swarm enables self-directed parallelization across up to 100 sub-agents and 1,500 tool calls, trained via Parallel-Agent Reinforcement Learning (PARL). Achieves 4.5× speedup over sequential execution on complex tasks.↩
Avigad, J., et al. (2024). Theorem Proving in Lean 4. The official introduction to Lean as a proof assistant, covering tactics, structures, and type theory fundamentals.↩
Buzzard, K., et al. (2024). Mathematics in Lean. A hands-on tutorial for formalizing mathematics, from basic logic to real analysis.↩
Mathlib Community. (2024). Mathlib Documentation. The reference for Lean’s mathematical library: 200,000+ theorems covering algebra, analysis, topology, and more.↩
Finn, C. (2025). “CS 224R: Deep Reinforcement Learning”. Stanford University. Covers policy gradients, actor-critic methods, offline RL, reward learning, RLHF for LLMs, meta-RL, and exploration. The foundations for understanding agent capabilities and limits.↩
Zou, W. (2024). Meta-Learning: Concepts and Techniques. Elsevier. ISBN 978-0-323-89931-4. Theoretical foundations for agents that learn to learn. Relevant for understanding the ceiling on agent adaptation.↩
Leviathan, Y., Kalman, M., & Matias, Y. (2023). “Fast Inference from Transformers via Speculative Decoding”. International Conference on Machine Learning.↩
Chen, C., et al. (2023). “Accelerating Large Language Model Decoding with Speculative Sampling”. arXiv preprint.↩