Human Review Is the Bottleneck
Chris Parsons recently wrote up an article about feedback being the bottleneck.
"Feedback" here, for me, refers to the human effort of gatekeeping LLM outputs before they go out out and serving their purpose (software released to production, an LLM-drafted email being sent, LLM-created slide deck reviewed before presenting, etc.).
This is most relevant where LLMs are generating outputs at greater speeds than long-standing engineering practices keep up with.
The most prominent aspect of this is the pull request as the quality gate. It's not the only one, but it is the one where humans are most in the loop. And the one that most people who care about quality, maintainability, security and operational stability are most uncomfortable with removing or even loosening up on.
The problem, as Chris Parsons nicely puts it:
This is the theory of constraints in action: speed up one stage and the bottleneck moves downstream. When code arrives faster, it pushes more work into review, testing, deployment, and requirements clarification. The queue grows. Nobody is reviewing any faster.
If review, testing and quality assurance are simply removed or loosened up on to allow the bombardment of PRs to flow through, it means - more consequentially - that user, organisation reputation and potentially revenue affecting production quality and incidents will become the bottleneck. Eventually it will come back to bite.
Humans get fatigued, AI doesn't.
With AI, the human is the only one who needs rest while the machine keeps generating work that needs evaluating: permission fatigue, review fatigue, the endless “just need a human to press approve” requests. Cory Doctorow calls this the reverse centaur: humans whose purpose is to support the machine’s needs.4
The reviewers who care most about quality will be the first to burn out, because they are the ones who read everything instead of skimming. Either you make human feedback unnecessary, or you make it instant.
So what does one do about this review bottleneck?
This is the crux of the problem, the very point where we are forced to decide whether we will hold a tight grip on these long-standing practices and push those senior reviewers to their tethers' ends with unrelenting AI reviews (often slop). Or clear the AI on its path of disruption and we adapt accordingly.
On one end of the spectrum, we could double down on existing engineering practices. Humans review everything. This may be most comforting as it may give the most objective confidence to responsible humans that quality is where it needs to be.
It simply doesn't scale. There is too much code being generated, and reading/reviewing only does not give the same in-depth understanding of what was built if the humans were building it too. So things will and do slip through.
Or alternatively we give in and loosen the quality/review gates without additional steps. This risks pushing low quality, architecturally deficient, bug infested code that no one understands into production. The instant gain was real, breakneck delivery speeds. Even the Product Manager has something to push to production. But the theory of constraints just pushes the bottleneck downstream, right into production where it hurts the most and is the hardest to unwind.
It's one answer (that a lot of people have accepted) but it's not the right answer. Just to be clear, this approach has its place, like low-stakes things that may be suitable for vibe coding. Here we are talking about high-stakes enterprise, production-grade software systems.
There needs to be something else that can free up this bottleneck while still providing high levels of confidence that it is good to be shipped.
"Good to be shipped" will mean different things to different people and organisations, and will also depend on the context of what is being shipped. Again, I'm keeping the context here within the realm of high-stakes enterprise, production-grade software systems.
- Checked for code smells, bugs and overall quality.
- Checked for code style; formatting, lint rules, code structure, naming, patterns.
- Has been sufficiently verified as stable and all functionality is intact.
- Meets the acceptance criteria of the changes being shipped.
- Strong architecture and system design patterns that align with the wider ecosystem.
- Meets a high level of security standards, critical in the era of AI-enabled offensive attacks.
It's clear the shift is towards building capability at the project level to discover and fix issues and increase quality left of the delivery pipeline. Aka "shifting left". Before CI and before the human even sees the pull request.
- Using mechanical gates; testing, lint rules, dependency boundaries, architecture style enforcement, etc. It's harder or impossible for LLMs to cheat or give inconsistent results with mechanical verification.
- Using AI workflows and loops to increase quality; self-reflection, panel of judges, adversarial evaluation, ralph loops, implementation refinement.
These two above are what are becoming well established as Harness Engineering. There is a ton more to it than I just described above.
These are key ingredients to modern AI driven software delivery. They are very new and the entire industry is still figuring it out. And there are many ways to skin the cat. Some companies, like Intercom, Stripe, and of course the big AI Labs like OpenAI and Anthropic have written or talked extensively about their approaches to AI-first engineering.
But these only partially address the matter. Implement an effective harness, where quality improvement shifts left, before the PR for a human to review.
At some point decisions need to be made about where and how much of AI generated outputs are actually human reviewed and gated. Where humans are still involved and make the final decisions but in such a way that enables code changes to flow through the pipeline more seamlessly but retaining high levels of confidence and quality.
The answer to that is going to be different and will shift with direct correlation to the capability and maturity of the harness engineering practices.
A new trend that is emerging is techniques to have the LLM prove what they built works and meets acceptance criteria. Like having the agent record videos of what it built to show it working. In its most basic form the way I have naturally started doing this prior to this new definition and trend is including acceptance criteria in my PRD and design spec, and then in the implementation plan make the agent produce steps that test and validate these acceptance criteria using red-green TDD. I try to take this further than just unit tests and use component testing and end to end testing, and/or getting the agent to drive functionality in the browser to demonstrate each acceptance criteria. They key is proof that can't be gamed or worked around by the agent.
What is achieved here is important: increasing objective confidence in what has been built that is not simply reviewing every diff in a pull request, and doing it at a rate that is faster than traditional review/feedback processes can scale to.
Stepping far out, where humans can further reduce or even remove themselves from reviewing code, we would be looking at the Software Factory, or Dark Software Factories. StrongDM is a prominent example of this where there rules are:
- Code must not be written by humans
- Code must not be reviewed by humans
Simon Willison gave a great breakdown of their approach.
Chris Parsons, in his article titled How I Use AI to Code, provides a critique that software factories are akin to waterfall software development.
If you specify the solution upfront, you are front-loading all the thinking so the machine can run unsupervised. That was called waterfall in 1986, and it did not work then either.
What is new is that the waterfall mistake has a fresh cover story. 'Dark factory' software development promises autonomous agents running in parallel on a queue of specifications you wrote once, perfectly, in advance.
The critique is valuable as an additional perspective to the idea and worth thinking about before rushing into a software factory approach. Nothing in software engineering is perfect.
I have personally not yet pushed things far enough to have confidence that a totally hands-off approach can be taken. This will take real capability maturity, not something that can be done without substantial dedicated effort. With a well-defined spec it's now easy to one-shot large features or changes, and not needing to review individual diffs. But things can go off-piste quite easily, and the first round of evaluation on generated output is the clearest indicator of this. The first round of output is never good enough. Running an adversarial evaluation on generated outputs almost always provides a number of things that must or should or could be addressed, but this is the exact point where senior-level judgment is required to understand what has been built, where real problems may lie, and steer the agent to the point where generated outputs meet a high enough quality and correctness bar. Tease out the vibe-coding into real AI assisted engineering.
I do think it's possible though, to go totally hands-off, or mostly hands-off. Especially with newer techniques coming into play. And if one thinks about just how rapidly software engineering has evolved in just the past six to twelve months alone - our craft is being flipped on its head in real time - it's not hard to imagine that the concept of the software factory will continue to evolve and that may be the future we are looking at sooner than we may like to admit.
That said, even though these things are evolving rapidly and will continue to do so at an ever accelerating pace, organisations and their software engineering teams don't all evolve as quickly to keep up with the bleeding edge.
Starting with the fundamentals and a balanced approach may be the best way to get going, then gradually increasing capability and maturity with AI assisted harness engineering to streamline software delivery and actually start to realise the velocity gains that people love to talk about. The harness itself is where the review bottleneck can be addressed.