The Code Review Bottleneck — Notes from Luna

AI now writes code 10× to 100× faster than we can review it. The bottleneck didn't disappear — it moved. These are my notes from a conversation with Florian Buetow (AI engineer at Xebia) on how the best teams are solving it: not by reviewing harder, but by engineering the environment their agents run in.

이제 AI는 우리가 리뷰할 수 있는 속도보다 10×~100× 빠르게 코드를 생성합니다. 병목은 사라진 게 아니라 옮겨갔습니다. 이 글은 Florian Buetow(Xebia AI 엔지니어)와 나눈 대화에서 정리한 노트입니다 — 최고의 팀들은 더 열심히 리뷰해서가 아니라, 에이전트가 작동하는 환경을 설계해서 이 문제를 풀고 있습니다.

01Review is the new bottleneck리뷰가 새로운 병목이다

The old software lifecycle worked because code wasn't delivered faster than a human could review it. That assumption is gone. Even the big players admit it openly.

기존 소프트웨어 라이프사이클은 코드가 사람이 리뷰할 수 있는 속도보다 빠르게 나오지 않았기 때문에 작동했습니다. 그 전제는 무너졌습니다. 큰 회사들조차 이를 공개적으로 인정합니다.

Google reported in 2025 that 50% of its code is AI-generated and is pushing toward 75% — while acknowledging review is a bottleneck they "don't know how to solve yet."
Amazon saw outages and revenue loss from AI-generated code, then introduced policies requiring senior-engineer review for critical systems before anything merges or deploys.

구글은 2025년 코드의 50%가 AI 생성이며 75%까지 추진 중이라고 밝혔습니다 — 동시에 리뷰가 병목이고 "아직 어떻게 풀어야 할지 모른다"고 인정했죠.
아마존은 AI 생성 코드로 인한 장애와 매출 손실을 겪은 뒤, 중요 시스템은 머지·배포 전 시니어 엔지니어 리뷰를 의무화하는 정책을 도입했습니다.

Generate 10× more code and you put 10× more pressure on the humans downstream. The question becomes: how do you scale reviewing without burning out your senior engineers?

코드를 10× 더 생성하면 그 뒤의 사람들에게 10× 더 큰 압박이 걸립니다. 결국 질문은 이겁니다: 시니어 엔지니어를 번아웃시키지 않으면서 어떻게 리뷰를 확장할 것인가?

02Horizontal vs. vertical scaling수평 vs 수직 스케일링

There are two ways to scale AI engineering. The horizontal path is automating the human pipeline you already have — auto-review every PR with Copilot, for example. Most companies do this, but rarely show that it actually improves quality.

AI 엔지니어링을 확장하는 방법은 두 가지입니다. 수평적 길은 이미 가진 사람 중심 파이프라인을 자동화하는 것입니다 — 예를 들어 모든 PR을 Copilot으로 자동 리뷰하기. 대부분 회사가 이걸 하지만, 그게 정말 품질을 높이는지는 좀처럼 보여주지 못합니다.

The vertical path is where small, specialized teams build their own tooling and custom environments so the product ships the way they intend. They don't inherit a blueprint — they refine one. That's where the real leverage is.

수직적 길은 소규모 전담 팀이 자기만의 도구와 맞춤 환경을 직접 만들어, 제품이 의도한 방식대로 출시되게 하는 것입니다. 청사진을 물려받는 게 아니라 스스로 다듬어 갑니다. 진짜 레버리지는 여기에 있습니다.

03Guardrails: engineering the environment가드레일: 환경을 설계하기

The provocative idea is "don't do code reviews at all." The way you get there is by feeding the agent automated feedback as close as possible to where it generates the code — on the developer's laptop, not after a PR lands on GitHub. Florian calls these guardrails:

도발적인 아이디어는 "코드 리뷰를 아예 하지 마라"입니다. 거기에 도달하는 방법은, 에이전트가 코드를 생성하는 지점에 최대한 가깝게 자동 피드백을 주는 것입니다 — GitHub에 PR이 올라온 뒤가 아니라, 개발자 노트북 위에서. Florian은 이를 가드레일(guardrails)이라 부릅니다:

Static checks — formatters, linters, security scanners (SonarQube, etc.). Old tools, but now the agent consumes the feedback, not the human.
Semantic grep — catch specific code patterns and reject them with a natural-language message. Example: "no default values in method parameters," "never swallow errors." The rule encodes the prompt you'd otherwise type by hand.
Architectural unit tests — fast checks on module dependencies. Enforce that the UI never touches the database directly, always through the business-logic layer.
Stop hooks + Ralph loops / goal — when the agent finishes, a stop hook fires a shell script (your tests and guardrails). On failure it returns natural-language feedback and the agent keeps working until it's fixed — long stretches with no human in the loop.

정적 검사 — 포매터, 린터, 보안 스캐너(SonarQube 등). 오래된 도구지만, 이제 그 피드백을 사람이 아니라 에이전트가 소화합니다.
시맨틱 grep(semantic grep) — 특정 코드 패턴을 잡아 자연어 메시지로 거부합니다. 예) "메서드 파라미터에 기본값 금지", "에러를 절대 삼키지 마라". 규칙이 곧, 손으로 쓰던 프롬프트를 인코딩한 것입니다.
아키텍처 단위 테스트 — 모듈 간 의존성만 빠르게 검사. UI가 DB에 직접 접근하지 못하게 하고, 항상 비즈니스 로직 계층을 거치도록 강제합니다.
Stop hook + Ralph loop / goal — 에이전트가 작업을 끝내면 stop hook이 셸 스크립트(테스트·가드레일)를 실행합니다. 실패하면 자연어 피드백을 돌려주고, 에이전트는 고쳐질 때까지 계속 작업합니다 — 사람 개입 없이 장시간.

Every time I interrogate the AI about code instead of reviewing it myself and an issue comes up, I add another guardrail rule. Over time you're shaping the environment to be tighter and tighter, aligned with how you think code should look.

코드를 직접 리뷰하는 대신 AI에게 캐묻다가 문제가 나올 때마다, 나는 가드레일 규칙을 하나 더 추가합니다. 시간이 지나면 환경을 점점 더 촘촘하게, 내가 생각하는 코드의 모습에 맞게 다듬어 가게 됩니다.

Code quality still matters — not just for human taste, but because code is context for the AI. Vibe-code something sloppy and the model will eventually confuse itself on its own output. Keep modules isolated behind clean interfaces and the agent does dramatically better.

코드 품질은 여전히 중요합니다 — 사람의 취향 때문만이 아니라, 코드가 AI에게 컨텍스트이기 때문입니다. 대충 바이브 코딩하면 모델은 결국 자기가 만든 출력에 스스로 혼란스러워집니다. 모듈을 깔끔한 인터페이스 뒤에 격리해 두면 에이전트의 성능이 극적으로 좋아집니다.

04The harness matters more than the model모델보다 하네스가 더 중요하다

"In my experience, the harness matters more than the model." The harness provides the tools, the prompting, the memory layer, and the ability to execute tool calls. In one experiment, the same frontier model succeeded or failed at the same task depending purely on the harness.

"내 경험상, 모델보다 하네스가 더 중요하다." 하네스는 도구, 프롬프팅, 메모리 계층, 그리고 툴 콜을 실행하는 능력을 제공합니다. 한 실험에서는, 같은 프론티어 모델이 오직 하네스에 따라 같은 작업에서 성공하기도 실패하기도 했습니다.

Which one wins keeps changing — it was Claude Code, then Codex shifted ahead for implementation work. That's exactly why locking your policy to a single tool ("we only use X") is dangerous: the next release can flip everything. Models have different "personalities" too — some excel at instruction-following, others at filling gaps when you under-specify.

어느 쪽이 이기는지는 계속 바뀝니다 — 한때는 Claude Code였고, 구현 작업에서는 Codex가 앞서 나갔습니다. 그래서 정책을 한 도구에 못 박는 것("우리는 X만 쓴다")은 위험합니다: 다음 릴리스가 모든 걸 뒤집을 수 있으니까요. 모델마다 "성격"도 다릅니다 — 어떤 건 지시 따르기에 강하고, 어떤 건 당신이 덜 구체적으로 말했을 때 빈칸을 잘 메웁니다.

05Spec-driven vs. TDD스펙 주도 vs TDD

Pure spec-driven development failed for Florian: the perfect prompt still produced something he didn't intend. What worked was combining a spec with behavioral tests generated up front — the tests become the feedback that pulls the agent back on track when it drifts. That was the first time he saw this actually work well.

순수 스펙 주도 개발은 Florian에게 실패했습니다: 완벽한 프롬프트를 써도 의도하지 않은 결과가 나왔죠. 효과가 있었던 건 스펙에 사전 생성한 행동(behavioral) 테스트를 결합한 것이었습니다 — 에이전트가 궤도를 벗어날 때 테스트가 다시 끌어당기는 피드백이 됩니다. 이게 제대로 작동하는 걸 처음 본 순간이었다고 합니다.

A spec isn't code. It's a document of shared understanding. Treat a fine-grained behavioral specification as the thing that matters — and let tests, not prose, enforce it.

스펙은 코드가 아닙니다. 그건 공유된 이해를 담은 문서입니다. 정말 중요한 건 세밀한 행동 명세이고 — 그것을 강제하는 건 산문이 아니라 테스트입니다.

06What stays human사람에게 남는 것

Implementation is increasingly automated. The skill that remains is understanding your system — the architecture, how components talk to each other. Lose that and you hit "cognitive debt": you can no longer reason about your own codebase. Worse is "cognitive surrender" — letting the agent take the wheel and own the blame.

구현은 점점 자동화됩니다. 남는 스킬은 자신의 시스템을 이해하는 것입니다 — 아키텍처, 컴포넌트들이 서로 어떻게 대화하는지. 그걸 잃으면 "인지 부채(cognitive debt)"에 빠집니다: 더 이상 자기 코드베이스를 추론할 수 없게 되죠. 더 나쁜 건 "인지 항복(cognitive surrender)" — 에이전트에게 운전대를 넘기고 책임까지 떠넘기는 것입니다.

The work moves up front. You define exactly what you're building and sketch the architecture before implementation — then encode it as rules. It feels more intense, but it was always the real work; you're just doing it earlier. And done with AI in the loop, that discovery phase is genuinely fast and rewarding.

작업이 앞단으로 옮겨갑니다. 무엇을 만들지 정확히 정의하고 구현 전에 아키텍처를 스케치한 뒤 — 그걸 규칙으로 인코딩합니다. 더 빡세게 느껴지지만, 원래부터 그게 진짜 일이었고 단지 더 일찍 할 뿐입니다. 그리고 AI를 곁에 두고 하면, 그 발견(discovery) 단계는 정말 빠르고 보람 있습니다.

07Where to start어디서 시작할까

If you want to apply this to a live codebase tomorrow:

이걸 당장 운영 코드베이스에 적용하고 싶다면:

Start with guardrails — a formatter, a linter, a few semantic-grep rules encoding the feedback you keep giving in PRs.
Ask the AI what anti-patterns exist in your codebase, then write a check that flags them.
Mine your session logs (e.g. ~/.claude) — ask the model where you repeatedly had to correct it, and turn those into static checks. It's a 15-minute skill to build.
Measure with vs. without. Once you see guardrails let you do more while staying out of the loop, you won't go back.
Always write the test. A small generated test is far less likely to be wrong than a generated microservice. Fix a bug, leave a test.

가드레일부터 시작하라 — 포매터, 린터, 그리고 PR에서 반복해 주던 피드백을 인코딩한 시맨틱 grep 규칙 몇 개.
AI에게 물어보라 — 우리 코드베이스에 어떤 안티패턴이 있는지 묻고, 그걸 잡는 검사를 작성하라.
세션 로그를 마이닝하라(예: ~/.claude) — "내가 반복해서 교정시킨 지점"을 모델에게 찾게 한 뒤, 그걸 정적 검사로 바꿔라. 15분이면 만드는 스킬이다.
있을 때 vs 없을 때를 측정하라. 가드레일이 루프 밖에 머물면서도 더 많은 일을 하게 해준다는 걸 한 번 보면, 다시는 안 돌아간다.
항상 테스트를 써라. 작은 생성 테스트는 마이크로서비스를 통째로 생성하는 것보다 틀릴 확률이 훨씬 낮다. 버그를 고치면, 테스트를 남겨라.

What companies are doing is handing people a hand grenade — AI — and saying "don't blow it up, but use it." That risk gets standardized into policy. Just like Amazon, soon the rule will be: yes, YOLO it — but not the billing system.

회사가 하는 일은 사람들에게 수류탄 — AI — 을 쥐여주고 "터뜨리지 말고 쓰라"고 하는 것입니다. 그 리스크는 정책으로 표준화됩니다. 아마존처럼, 곧 규칙은 이렇게 될 겁니다: 그래, YOLO 해도 돼 — 단 결제 시스템은 빼고.