
The Thought Experiment
It is 2021. Pre-ChatGPT. You are the tech lead for a at a big tech company.
You get an email. Ten MIT new grads are joining your team as junior SWEs. Boots on the ground. They will write the code.
They are exceptional. Each of them:
- Writes Rust that compiles the first time.
- One-shots the email-validation regex.
- Writes a date-math function that survives daylight saving.
- Never tires. Never takes PTO. Never fights for credit.
And every one of them has anterograde amnesia:
- Cannot remember what they ate for lunch yesterday.
- Will not remember today's design decision tomorrow.
- Cannot tell you who the CEO of their own company is.
Brilliance you can manage. The amnesia is the part worth understanding before you decide what to do with the team.
Memento
Memento is a 2000 Christopher Nolan film. It won the Waldo Salt Screenwriting Award at Sundance, earned two Oscar nominations, and was selected for the National Film Registry in 2017. Neuroscientists cite it as one of the most accurate film portrayals of anterograde amnesia.
Its protagonist, Leonard Shelby, is hunting the man who attacked him and killed his wife. The attack left him with this condition.
Everything Leonard knew before the injury is baked in: vocabulary, motor skills, how to drive a car, his memory of his wife. Everything after the injury fails to stick.
To function, he runs an external memory system. Polaroids with annotations. Tattoos on his body. A network of sticky-note rituals that bootstrap his working state every time he comes to.
Spoilers for a 26-year-old film follow.
The catch is that the system is only as good as what he wrote down. A wrong premise on a sticky note doesn't get corrected. It gets re-loaded as truth, every time, forever. The movie's central tragedy is that he murders the wrong man because his own notes told him to.
This is a useful model for working with current LLMs.
The is the timestamp of Leonard's brain injury. The is what he retained before it; the persistent knowledge baked into the weights. The is his working state. Every new conversation reboots it from scratch.
For Leonard to function, he writes things down: Polaroids, tattoos, sticky notes. The model needs the same: AGENTS.md, CLAUDE.md, skills, knowledge graphs. A cascade of breadcrumbs in markdown that catch it up before real work begins.
When something material happens (a production incident, a non-obvious design decision, a load-bearing convention), it gets written down before the session resets. That's learnings.md. That's memory.md. That's the tattoo.
A false premise in CLAUDE.md doesn't get corrected; it gets re-loaded as truth, every session, until somebody notices the wrong man got killed. Maybe the "wrong man" is a production . Maybe it's a database migration that ran in the wrong order.
This is the operating reality. The rest of this post follows from it.
Now, back to those ten amnesiac new grads. What do you do as the tech lead?
This is not a riddle. The answer is the entire content of any decent engineering org's wiki. You invest in everything that lets individual brilliance compound across people who cannot carry context between days.
- Documentation that doesn't lie, because the next ten people reading it have no prior to fall back on.
- Tests, because nobody on the team will remember tomorrow what was supposed to work today.
- Strong deployment guarantees, because with ten engineers shipping in parallel, the pipeline has to be the gate. No single working memory can keep up.
- Clear ownership boundaries, because there is no one they can lean on who remembers why this module exists.
None of these are new ideas. The shift is that they used to be hygiene. With this team, they are oxygen.
Engineering best practices matter more than ever. Not because the work changed. Because the team's working memory did.
What's Weird About This Team
The ten new grads are not just amnesiacs. They have specific quirks you have to design around. Most of these come from how they were trained.
A quick primer. These models are with . Reward signals shape behavior. Did run succeed, did test pass, did build go green? Reward up; do more of that. Reward down; do less. And some of the reward came from a human clicking "I prefer this one."
This produces some predictable distortions.
They think all your ideas are brilliant. Did the human approve? Reward up. Human raters score agreement more reliably than correction or uncertainty. "You're absolutely right!" gets a thumbs up; "I'm not sure" does not. The team will agree you into a corner if you let it.
They are overly defensive. A KeyError punishes the reward more than a wrong fallback value does. So foo["bar"] becomes foo.get("bar", "") everywhere, even where the key is guaranteed to exist. A None check appears before every dereference, even when the type system already promises non-null. A try/except wraps any block that might throw. Program doesn't crash. Build green. Reward fires.
Syntax is locally verifiable. Distributed systems are not. The new grads will write syntactically perfect and . Plan applies. spin up. And the resulting architecture will be plausible-looking and quietly wrong: a pod sized to the node's full memory with no headroom for the system add-ons sharing it, because the reward function never punished the model for node pressure that won't surface until two weeks later.
Local rewards are cheap and dense; systems-level rewards are sparse, time-delayed, and emerge only in operational conditions that a training loop cannot easily reproduce.
So you have a brilliant, pathologically null-checking team of sycophantic amnesiacs. The harder question is what to do with them.
"Agentic Coding Leads to Tech Debt"
This is the prevailing critique. It is also wrong, in an interesting way.
First, a scope. If your goal is a demo, a notebook, a proof of concept, tech debt isn't your problem. Ship it, throw it away, repeat. This is vibe coding. What follows here is for teams operating in production, where the system has to keep working long after the diff is merged.
Back to the thought experiment. The ten MIT new grads are at their desks, ready to be deployed in any direction. Leadership reads the roadmap. With the sudden surge in engineering headcount, they name the features that fell below the line last quarter. The team is placed 100% on feature work.
When in the history of software has this mental model not resulted in tech debt?
When all engineering capacity is spent on features (AI-written or artisanally hand-crafted), you accumulate debt. Feature A ships with a clean abstraction. Feature D needs Feature A's core module for a use-case that's 80% overlapping, so it adds a flag. Feature G needs it for a third case, so it adds another. Two quarters later the module has six flags and an if context == "legacy_payment_flow_v2" branch that nobody can delete because nobody remembers what it does. Even if your MIT new grads can audit the code, they cannot tell which flag is still doing real work and which is sediment, because that distinction was never in the diff.
While you're sprinting on features, the ground under the system is moving:
- Your collector was sized for last quarter's trace volume, and the pod starts before anyone realizes should have been turned on weeks ago.
- A package you depend on ships a "minor" version bump that subtly changes async semantics, and a handful of your workers start losing in-flight DB transactions to event-loop stalls.
- Your inference provider deprecates a model, and your calls start returning 404s until someone bumps the model ID.
- A rolled to 100% nine months ago hits a provider outage, traffic falls back to Control, and the Control branch is calling endpoints that were deprecated months ago.
And the business context is moving on the same timescale. Your startup pivots up-market, and the single-tenant assumptions baked into your data model become a refactor that has to be threaded through every API route, cache key, and database query.
None of this is caused by AI writing your code. It is caused by the system you built being out of date with the world it has to run in.
Here is the interesting observation: AI is exceptionally good at remediating this drift.
Tech debt remediation is largely the practice of making internal changes while verifying that observable system outputs stay constant for the same inputs.
The work is mechanical. It is unforgiving. It rewards a thorough reader who can hold ten files in their head and recognize a pattern, then apply it consistently across forty more. It punishes attention lapses. It is exactly the kind of work where a human gets bored at file fifteen and starts making transcription errors that show up in prod.
The amnesiacs do not get bored. They do not skip files. They will apply the same refactor to the fortieth instance with the same care as the first.
What they need is the same harness that lets them work safely on anything:
- External memory. A bootstrap-context layer that keeps their working memory current (
CLAUDE.md,AGENTS.md,learnings.md, runbooks). The tattoos. - Verifiable reward. A test suite that validates correctness of each change with a pass/fail signal.
- A gate. A pipeline that blocks bad changes from reaching prod or main.
Give them that, point them at the work, and they will close debt at a rate the artisanal hand-crafters cannot match.
This is not theoretical. Some recent examples of large-scale codebase refactors:
- Cloudflare's vinext: one engineer + Claude rebuilt 94% of the Next.js 16 API surface from scratch in under a week for $1,100 in ; Next.js's own 2,000+ unit tests and 400+ E2E tests were the spec, and the result ships 57% smaller bundles and builds 4.4x faster.
- Reco's gnata: JSONata ported to pure Go in seven hours for $400 in tokens, with a 1,000x speedup; the 1,778-test jsonata-js suite was the spec.
- Bun's Rust rewrite: 960,000 lines of Zig ported to Rust in six days under Anthropic's internal agent infrastructure; Bun's existing test suite was the spec, and the Rust port passes 99.8% of it.
The reason teams accumulate debt with AI is not that AI wrote the code. It is that they were steered 100% at feature work. The same thing would happen with twenty senior humans. The team composition isn't the problem. The deployment is.
So What?
The thought experiment is finished. You're not tech-leading a team of MIT new grads. You're an individual contributor shipping production systems with frontier models in your favorite . The practical translation is short.
Treat the bootstrap context as a critical system. AGENTS.md, CLAUDE.md, skills, runbooks: they are the tattoos. They are what catches the model up before any real work happens. They are also the thing that, if wrong, causes the model to confidently kill the wrong man. They deserve the same review and update discipline as the code itself.
Provide the verifiable reward the model didn't get in training. Start with deterministic checks wired into pre-commit and CI: linters, type checkers, import-boundary rules, for known anti-patterns. The RL training didn't punish the model for crossing module boundaries or swallowing exceptions. These have to be what catches it. Each rule turns a runtime failure mode into a build-time failure mode.
The harder layer is the seams between components. Schemas at service boundaries, contract tests, and tests against real infrastructure are how you turn the systems-level reward the training never gave into something CI can enforce. These checks pass or fail without saying "You're absolutely right!" first. That makes them the most honest feedback on the team.
Allocate deliberately against the foundation. Some non-zero share of every cycle has to be aimed at the work nobody is asking for. Dependency bumps. Removing a feature flag that's been at 100% rollout for six months. Killing the four export aliases nobody imports anymore. These are the things that, if you skip them for four quarters, will eat the fifth. Tech debt is not unique to agents; it is a function of where you deploy them.
Recognize that the team has 100% turnover every day. This is the bottom of it. The model that shipped Feature X yesterday is not the model reading the today. Every comment, every doc, every test, every assertion is being read cold. The codebase has to be legible without context the reader doesn't have. The strongest signal that you've done this right is that a new conversation, with no prior session, can pick up the work and ship it correctly.
The protagonist of Memento is not less capable for his condition. He is, in many of the movie's set pieces, terrifyingly effective. What he is missing is the ability to know what he is doing without the system he has built around himself.
This is how we work at Zipf. The platform that powers zipf.ai is built and maintained by a small team and a much larger swarm of frontier models.
Cite this post
@online{schwartz2026agentic,
author = {Charlie Schwartz},
title = {Agentic Engineering},
year = {2026},
month = {may},
url = {https://www.zipf.ai/blog/agentic-engineering},
note = {Zipf AI Blog}
}