Loosely Coupled Thoughts

AI Subscription plans are more generous than API pricing

2026-06-12T21:10:19.104437+12:00

AI Subscription plans are more generous than API pricing

SemiAnalysis did a test on Anthropic/Claude subscription plans, running long horizon tasks until they maxed out.

The subscriptions are far more generous than API pricing.

When I first started using these tools I went the API route as my personal usage fluctuated a lot. But after a couple of months it was clear this worked out a lot more expensive and so switched to the paid plans even if not using all the quota.

Go with the paid plan.

Implementing a Full-Stack Feature Using Claude Code's Dynamic Workflows

2026-06-07T11:52:52.865346+12:00

This is a continuation from the post I wrote recently titled Designing a Feature with Claude Design – Then Handing It to Claude Code.

In that post, I stepped through how I used Claude Design to create a UX design as well as a set of initial requirements for new tag selection and suggestion functionality in my blog's custom CMS.

This post steps through using those artefacts to drive the actual implementation in the frontend and backend apps. Specifically the goal here is to build this end-to-end using Claude Code's new Dynamic Workflows.

The outputs of the Claude Design exercise were a fully functional interactive prototype of the design and a markdown file with a set of requirements and acceptance criteria.

I already have a structured process for defining per-feature design specs, requirements and implementation plans in the projects repo. It consists of several things, but most relevant here is a folder structure and set of markdown files which I would consider the "spec" as part of the planning phase of this AI-driven engineering workflow. These are used as context for the implementation and verification, which often run across several separate sessions.

Usually when starting a feature I would be starting from scratch, or just a high-level set of requirements from the initial blog design and feature roadmap. For this piece of work, Claude Design gave me a working prototype that was already using my frontend style guide and a detailed set of requirements, which I put into the code base for the duration of the build. So I had a solid starting point. However, those didn't fit into my structured process that I already have in place, and I still needed a design spec for the full end-to-end solution.

For larger pieces of work I like to use Jesse Vincent's superpowers plugin. He has done a fantastic job of baking in real software development methodologies and workflows into a set of agent skills that drastically increase the quality and coherence of what's being built.

The first two skills I normally reach for are the brainstorming skill, which is for creating a design spec and the writing-plans skill, which is for writing an implementation plan for the spec. The design spec is the most important part. For any given feature this is where I spend most of my time. The clearer this is defined upfront, the more seamless the rest of the process will be and the higher the chance of it building what I actually want.

Once I'm happy with the design spec and implementation plan, after rounds of refinement, I would then hand it off to Claude Code to start building, usually with subagent development or agent teams.

I already found a general workflow that I follow when building software with AI. In its most simple form it's the Plan-Generate-Evaluate methodology. But that's a gross oversimplification of the entire underlying process. In fact, for each of the Plan-Generate-Evaluate phases I will run inner Evaluate-Regenerate flows, iterating over this at different phases until outputs are where they need to be. The first outputs are rarely good enough.

For example, when in the planning phase, for the design spec planning I will start with some initial context and then have a back and forth conversation with Claude to flesh out the requirements, acceptance criteria and architecture. Once it generates the initial design spec I then go through usually multiple rounds of adversarial evaluation and refinement of the spec until it's at a place that I am happy with. When I do the implementation plan I do the exact same thing, and then during the build I will do the same thing again. The build often gets broken down into separate chunks of work, usually to reduce the size of PRs and because Claude's context will start to fill up, at which point the generated outputs can start to deteriorate in quality and things can go off-piste. So, for each build phase I will go through rounds of Generate-Evaluate loops.

This has worked well for me so far, in the sense that I can effectively build substantially sized chunks of work without writing a single line of code myself, and at a quality level that meets my own pedantic standards.

However, it can be incredibly time-consuming and painfully arduous. It's making full use of AI to generate the design spec and implementation plan, do the build and then do the evaluation and verifications, all with their own inner loops of Generate-Evaluate.

Some might say this is over the top, and it can take a lot of time to go through this for any given piece of work. But so far this is the level of coercing, so to speak, that I have found to be necessary to achieve high quality outputs, especially those suitable for production-grade software. Anyone can ship some code fast; not everyone can do it well too.

But it's been me manually driving this workflow.

This is where I took a different route than usual as Opus 4.8 had just been released and along with that their new Dynamic Workflows capability, which I was super keen to understand and try out.

Here is how Thariq Shihipar and Sid Bidasaria from Anthropic describe Dynamic Workflows:

While the default Claude Code harness is built for coding, it is also useful for many other types of tasks because, as it turns out, many tasks resemble coding tasks. But there are certain classes of tasks where we have had to build custom harnesses on top of Claude Code to achieve peak performance such as Research, security analysis, agent teams, or Code Review.

Workflows allow you to dynamically create harnesses built on top of Claude Code that enable Claude to solve all of those problems more natively. You can also share and reuse these workflows with others.

And this is where I wanted to see what the new Dynamic Workflows could take off of my hands.

Creating and Using a Workflow

A while ago Anthropic published Building effective agents. In that they talk about what agents are, and more relevant here, what workflows are, which they define as follows:

Workflows are systems where LLMs and tools are orchestrated through predefined code paths.

And the Claude Agent SDK can be used to create these workflows in code.

Dynamic Workflows is the combination of these two built natively into Claude Code, with Claude being able to intelligently create these based on the task at hand. You can create them using natural language to describe a workflow, or use the new ultracode effort level (which is actually effort level xhigh plus instructing it to use workflows).

Nice touch on the pixelated visuals when selecting ultracode.

Since I have a relatively pre-defined workflow, I used natural language to describe to Claude how I normally work.

What I wanted to see was whether Claude could do everything after the design spec in this workflow. So the implementation planning, the build (or builds), and self-evaluation. And for each of those, also run inner loops of Generate-Evaluate and refinement. Get it to the point of fully completed and verified, leaving me to do final review.

Essentially I want to hand my design spec to Claude, have it build it to a high level of quality and then create a PR for me to review. 100% hands-off the code for the entire process.

The first question I had was: would Claude even be able to recreate my normal workflow? It's quite involved. Secondly, where I am normally involved in all the inner loops, making human judgement calls and steering things in a direction that I want, can Claude do this all on its own? And lastly – most critically – can it actually produce a good result?

Well, I'll be damned. It did all of that. And it felt like a moment of reckoning.

I had a relatively large prompt, explaining in detail the end-to-end process I normally go through. Pointed it to my design spec and requirements and asked it to create a workflow.

It then asked me whether it should just run the workflow or save it to disk for me to run it later. I got it to save it and then kick it off.

Here's what that looked like:

As you can see it incorporated the implementation planning in the "Preflight & Plan" phase with self-evaluation and refinement in the "Plan Review" phase. Excellent, this is usually painful and one part I'm not too concerned about.

This may sound alarming, not being concerned about the implementation plan. However my prompts already guide a plan in a certain direction, and I have a ton of things in the repo that tell Claude what style, architecture, structure, patterns, quality checks etc. I use in this repo. It has deterministic guardrails in place that Claude can't cheat on. Along with that there is a ton of already implemented code that follows the patterns I set in place. LLMs are really good at following style and patterns, good or bad! With the design spec and plan, and the surrounding context and environment, my confidence has grown that what will be implemented is in line with what I expect, and that gets verified by me at review time. If something is off, I take a close look to see where in my setup needs tweaking and feed that back into the harness, or if I underspecified something somewhere it gets updated. If things went totally sideways, my work gets done in an isolated git worktree and branch, and the reimplementation cost is very low, if it came to that.

The workflow broke the work down into three tracks of work: backend, frontend and then integrating those two tracks together. Claude decided this part of the workflow. In each track it ran the inner loops, not just for the track, but also for particular tasks in the track. This was following good engineering practice: breaking the work down into smaller chunks, using red-green TDD, building and then doing evaluation and refinement. All per task, per track of work.

Along with this, when I described the workflow I also instructed it to ensure that all acceptance criteria must be verified, with component testing (end-to-end against a locally running service) and testing the functionality in a browser against the running application. It did all of that. This is an area that is critically important to ensuring high quality, and the one that I see as one of the most valuable aspects to addressing the human review bottleneck dilemma that is a limiting factor for the potential productivity gains from AI-first software engineering to flow through to organisational productivity and efficiency.

After the build was complete, it did a final comprehensive "Final review" phase, which is what I normally do from multiple angles myself: implementation to spec alignment, code quality, security, etc. Anything it found there, it sent back to build to apply fixes and then re-ran the final review.

60 agents, 5.2m tokens, ~3.9 hours later (3 of which while I was sleeping), giving me a final report below:

Four hours is a long time, relatively speaking. But it's time that can be spent doing other things while Claude cooks. Building other things, sleeping, or going outside to touch grass. I didn't spend any time manually driving this workflow, so that was time saved and a not-so-enjoyable aspect removed from the process. Importantly, instead of me driving the reviewing and refining on the inner loops and becoming fatigued by this, I can focus my energy on scrutinising the final outcome, as well as feeding back any process improvements back into the harness and workflow.

It ran successfully, barring being able to test the AI tag suggestions due to an invalid token. My first test after fixing the API token, it worked perfectly.

For more visual context, this is how the workflow was structured to work its way through the various phases all the way through to completion, with some stats reported at the bottom.

So it worked, and it sure looked good based on the rigorous process it went through – a close approximation of what I drive manually. I was genuinely impressed by how much of my process it was able to autonomously do, and while I was sleeping!

But being the first time I have run this workflow, and not being in the loop to apply my human judgement, the next step was to scrutinise everything myself. Firstly, does it actually work? Yes, and damn it looks and feels good in the UI too. Was the code that it generated good quality and in line with what's expected in this code base? It was very much, and Opus 4.8 seems to go the extra mile and pick up and address some things as it goes, like a good engineer continuously improving as it stumbled upon things. Like tidying up a log line or updating affected docs, or noticing and calling out some preexisting bug. All the quality checks look good.

You may have seen that the very last phase was called Decision Log and Report. Because I was out of the loop and these workflows can't stop to ask for direction, I asked Claude to keep a record of all decisions the agents had to make along the way, like if it had to deviate from the design spec or plan. This happens and it's important to know about them in case it's something that needs to be adjusted. The majority of this was noise in this run, but I can see this proving valuable as in my manual process Claude often stops to ask me how to proceed, where in a workflow it can't do that.

The left side is the working prototype from the Claude Design exercise; the right side is the final implementation.

I also decided to run my own final verifications using my regular multi-angle review prompts. Normally when I run these after going through my usual manual process it will find some critical and high issues, which I evaluate and make a judgement call on. I was very surprised that it only found some medium and low-severity issues. Most of the time these are just noise or stylistic things. So there were a few minor things that I asked it to tidy up post-implementation, including a UI element overflowing into another. But I would say it got about 95% of the way there across the entire full-stack feature. I was not expecting this.

I have since run these workflows a few times, experimenting first on my personal projects, but have now also used them in a professional team environment and so far it has worked remarkably well.

My takeaway here is that these are incredibly powerful and flexible, and I think these kinds of workflows are exactly what is needed to get to fully automated, hands-off software delivery. Whether they are Claude's Dynamic Workflows, Codex or some other tool. The methodology and approach is what is important, Claude just seems to have really nailed making these workflows easy to use and available to anyone.

The quality of what these workflows can deliver is going to be highly dependent on the quality of the environment and harness that it's operating in. And I am skeptical that all my workflow runs will go this well on first pass. Workflows alone are not some magical silver bullet that will just start creating production-grade software. I have put considerable effort into my setup for this project (albeit more can be done). I would attribute this as the primary factor for this workflow being able to successfully one-shot build an entire full-stack feature to a high level of quality just from a well-defined design spec. Drop the same workflow into a sparse or legacy code base with weak conventions, lack of system context, insufficient tests and deterministic checks, and I'd expect a very different result.

The workflow isn't the moat; the environment is.

I have only used these workflows a handful of times and the complexity of the task at hand may result in varying results. But from what I have seen so far, I'm cautiously optimistic. And it's most welcome because implementing a workflow like this manually, often with multiple separate features on the go, can be incredibly time-consuming and draining. Workflows carry a lot of the weight and let you get on with the fun part, building.

Maintainability sensors for coding agents

2026-06-06T10:31:00+12:00

Maintainability sensors for coding agents

This is a great post from Birgitta Böckeler on martinfowler.com talking about the sensors that can be used with generative AI engineering to shift quality assurance left of the delivery pipeline.

One big part of this is the general state of the codebase and repository, even before adding in sensors. I like the definition used to describe this: internal quality.

There are multiple dimensions we usually want to achieve and monitor in our codebases: Functional correctness (works as intended), architectural fitness (is fast/secure/usable enough), and maintainability. I define maintainability here as making it easy and low risk to change the codebase over time - also known as “internal quality”.

And why this is important for engineering with AI.

Internal quality problems affect AI agents in similar ways that they affect human developers. An agent working in a tangled codebase might look in the wrong place for an existing implementation, create inconsistencies because it has not noticed a duplicate, or be forced to load more context than a task should require.

I have heard developers say their agents go off track easily and implement things with mixed styles. There are likely a few reasons for this happening, but this internal quality is probably the top of the list.

She talks about a number of different approaches to implementing these sensors, both computational and inferential: linting (standard and custom), static code analysis to control cross-dependencies, coupling data and modularity patterns, and test suites as a regression sensor.

There are a lot of good things in there, but the section on test suites I found to be valuable. LLMs love to write tests and they can be highly verbose, testing things that are not important while totally skipping things that are critical. Or worse, fixing failing tests by making them pass against broken or buggy code. Like most other things with generative AI, these things can be steered to create genuinely meaningful tests that the LLM can largely manage with light oversight. I'm a huge fan of end-to-end tests, or a subset of those, component-testing that runs locally against running services, with local or mocked dependencies. These allow for greater control of different scenarios - edge cases, acceptance criteria and user scenarios - that would likely result in heavier, slow running and flaky tests if running in a deployed environment.

There currently seems to be a trend towards more end-to-end style acceptance tests. As mentioned in the beginning, AI has gotten really good at generating tests, so it has become quite normal for developers to just let AI generate lots of tests, without much review. Reviewing unit tests in particular can be very tedious. I'm not saying it's a good thing not to look at them at all — but I acknowledge the reality that it is unrealistic to think that human review of all tests is sustainable, and it's unrealistic to think that people will actually do it. So while we search for the appropriate testing pyramid/ice cream cone/muffin shape of the AI coding future, techniques like approved scenarios are becoming popular. As demonstrated above, acceptance tests increase coverage, but are often not very assertion-heavy, giving us a false sense of security in test effectiveness — mutation testing helps us monitor that gap.

Designing a Feature with Claude Design — Then Handing It to Claude Code

2026-05-31T17:48:59.372831+12:00

I've been wanting to give Claude Design a try. In this post I'll walk through my first use of it for designing some new functionality I wanted for my blog CMS. My use here is probably very basic, but it was an interesting exercise, and in particular I wanted to see how I could hand the design off for building.

Just a small disclaimer: I'm not a UX designer, or a creative person in general. But that's exactly why this exercise was interesting — it did a far better job than I ever could.

Getting into it...

I've recently built this blog site where this article is being read (I will write about this separately). It's very new and the feature/capability set is minimal, just enough functionality to manage, publish and serve blog content.

It has a custom-built CMS for managing the content, and supports slug-like tags that I can selectively apply to any of the content. These tags are visible to the right side of this article (or at the bottom on mobile).

When writing or editing content, it has a section where I can add new tags or use existing ones. Below shows editing an article in the CMS I wrote recently about human reviews being a bottleneck.

The behaviour of the tags input field is: On post creation, any never-before-seen tags will be created in the tags table in the database, existing tags will be referenced.

At first sight it looks ok, but:

It has no option to select existing tags.
It doesn't indicate if a tag being added was one that already existed or would be created.
It doesn't help me avoid creating near-duplicate tags with a similar name or even typos.

One wouldn't want different blog articles using tags that mean the same thing but with slightly different names. For example ai-engineering and engineering-with-ai, or even a typo in the tag like ai-eginering. So without any kind of tag selection, and the behaviour of the existing tag selection and creation, this made the UX painful (I had to open the tag management section in a different tab to remember what tags were available) and susceptible to tags becoming a mess across all the content.

Using Claude Design

Claude Design was introduced on 17 April. I had had a bit of a poke around in it before to see what it was about, but this was the first chance I had to try it out on something real.

I started by pasting four screenshots of the CMS and giving it this prompt. (I used the term "upsert" for tags, which wasn't quite correct, but Claude got it).

Attached are four screenshots of my blog CMS. One is where tags are managed, show when two tags existed; ai and harness-engineering. The second one is showing where I can add a post and enter tags. How it currently works is I can enter any tag value and when the post is created/published then the app(or probably the backend for the CMS) will upsert tags.

So you can see in the third image I enter two tags, one existing and the second one (new-tag) is new, so when the blog post got created is used the existing tag and then created a the new-tag, as shown in the fourth image, where I then have a total of three tags. (Two were existing one new one got created on that new post).

I like this functionality. However, right now when I enter tags on the create post page I don't get offered to select from existing tags, so if my new post should use an existing tag I have to carefully remember or go back and look in the tags section to remember the exact name of the tag so I don't end up with near duplicated for what should be the same tag.

Ideally when I start typing for a new tag, it should show existing tags that match (like contain) the text I have entered, with the option of selecting one of those. If none match it should allow me to add a new tag.

Come up with three options for the post create screen for selecting existing or adding tags. Should be simple and seamless and easy to use.

The prompt is relatively verbose (easy to do with voice dictation), but I wanted Claude to really understand how it currently works and what I am looking for.

Importantly at the end I asked for three design options.

After a fair bit of time it presented me with a fully functional interactive prototype, with three separate design concepts which I could toggle between with a selector at the bottom. And the generated designs looked very close to my actual CMS.

Concept A: auto-complete (shows matching tags as you type, with option to create)

Concept B: command-popover (shows all tags, filtering as you type, with option to create)

Concept C: A tag shelf

To be honest, I was really impressed with this. All these options look really slick. How long would it have taken a designer to come up with this before these kinds of AI-assisted design tools existed, especially with a fully functional interactive prototype? I am not a creative person and for personal things like my blog CMS these kinds of designs would never have seen the light of day.

Stretching a bit further with related functionality

I had already also been thinking about adding additional AI-generated tag suggestions functionality as a separate piece of work. I spend a ton of time trying to think about tag names for content. So I thought I would see how Claude Design would go with extending what it had already proposed while also incorporating this new AI tag suggestion functionality.

Here's the prompt I gave it:

These look really great. And actually now while looking at it, it has made me think a bit further about this. I would like you to perhaps create another version of this (leaving the existing on in place so I can still refer back to it), where another killer piece of functionality would be tag suggestions. The suggestions here would be based on the content in the post being written, and would use a combination of matching to existing tags, but also potential new tags. I'm not sure if this is different from the selection of tags that you have already designed for, but I would imagine there would be an endpoint in the server that could get tag suggestions based on the blog content. Not sure if it would be automatic as typing the post (seems heavy on the server, especially as it may be doing outbound AI calls to get suggestions), or just a button to get suggestions, or perhaps something a little smarter like suggestions after a certain about of text, or based on the title + blog text, maybe some initial suggestion and then a small refresh button or AI type button to suggest tags. Would exclude suggestions that are already added to the post. Also need to consider when editing a draft, should also work. What nice clean simple design options would you propose with this, fitting into the design we have already designed for the tag selection.

I tried to give it my thinking but also leaving it open-ended for Claude Design to be creative with. It gave me another interactive prototype with three modes to choose from, and incorporated one of the tag selection design options from the first round.

Mode A: A button with the name "Suggest Tags", or as shown below "Refresh" which would generate new suggestions to choose from.

Mode B: Smart auto, which generated suggestions after enough post content existed, and a small refresh button.

Mode C: Assistant panel. This felt super chaotic.

I decided I liked "Concept A: Auto-complete" for tag selection and "Mode B: Smart auto" for tag suggestions. These two seemed to complement each other.

I got it to create a final version of the working prototype with those two options. And this time I also gave it my existing style guide for the CMS (I probably should have done that at the start).

Ok, I like the tag selection Concept A, along with Mode B (smart auto) with tag suggestions. Can you do one more round showing a functional version with these two options baked in. I have also attached the style-guide for the existing application as it's currently implemented, so if we could align as much as possible to that.

This image below shows the two separate pieces of functionality working together. Selection as a slick fly-over style drop down when typing, and option to select tags from the tag suggestion box.

Moving From Design to Implementation

So now I had a flash new design that I was pretty excited about. But now how to feed this to Claude Code so it can build it as accurately as possible to the designs?

The conversation I had with Claude during this design process contained a lot of critical context that the design artefacts alone wouldn't hold. So, not knowing what Claude Design would be able to do here, I just asked it to write up a requirement doc based on the conversation and final design options chosen.

Ok, can you write up a set of requirements and acceptance criteria for this new functionality. Note that the develoers are never have going to have seen the other options, they just need to see the final solution and spec. The exact endpoint name for example is not that important, but what is important is how it's expected to behave and return so the UI functionality can work as expected. How the UX and interface behaves does need to be quite specific. Write this as a markdown document.

And lo and behold, it gave me a detailed requirements document! This is what it produced.

There are some options to hand off to Claude Code. I just downloaded the zip file.

I added the functional design prototype and the requirements document to an in-progress folder in my already created git worktree, ready for the next phase of planning and building.

The prototype has a lot of boilerplate code as it was a fully functional web site. However, for Claude to reference this, along with the requirements document it generated during the planning and implementation phases meant it would greatly increase the probability of getting it as close as possible to the design. And it did.

My code base for the blog application is already very well primed with context, structure and tooling that allows streamlining real engineering with AI. And I have templates and agent skills that help me create design specs and implementation plans for Claude to execute on.

This design exercise above happened the day after Opus 4.8 was released, so I was super keen to do some building with it, and also try out their new Dynamic Workflows.

The design and build was all fully completed on the same day (well to be more accurate, about two-thirds of the build was done while I was sleeping).

I will follow this article with another one running through how I did the build. I was quite happy with how that went.

Higher usage limits for Claude and a compute deal with SpaceX

2026-05-08T09:51:27.357827+12:00

Higher usage limits for Claude and a compute deal with SpaceX

Anthropic partnering with SpaceX to lease their Colossus 1 data center (over 220,000 NVIDIA GPUs).

While that on it's own is interesting, the astonishing side effect of that is a sharp increase in usage limits in paid Claude plans and their APIs.

The following three changes—all effective today—are aimed at improving the experience of using Claude for our most dedicated customers.

First, we’re doubling Claude Code’s five-hour rate limits for Pro, Max, Team, and seat-based Enterprise plans.

Second, we’re removing the peak hours limit reduction on Claude Code for Pro and Max accounts.

Third, we’re raising our API rate limits considerably for Claude Opus models.

I think people will generally appreciate this too:

Finally, we recently made a commitment to cover any consumer electricity price increases caused by our data centers in the US. As part of our international expansion, we’re exploring ways to extend that commitment to new jurisdictions, as well as partnering with local leaders to invest back into the communities that host our facilities.

A model that produces code which compiles and passes the tests it was given is not the same as a model that produces correct, secure, maintainable, well-architected software

2026-05-03T09:52:50.518811+12:00

A model that produces code which compiles and passes the tests it was given is not the same as a model that produces correct, secure, maintainable, well-architected software

The title here, a paraphrased quote from [Gary Marcus], on TNW, today, evaluating a claim from “OpenAI president [who] says AI is now writing 80% of the company’s code”.

Marcus' specific point about coding is structurally important: a model that produces code which compiles and passes the tests it was given is not the same as a model that produces correct, secure, maintainable, well-architected software. The first is verifiable in seconds; the second requires the kind of judgement that has been the historical bottleneck on engineering productivity. Brockman acknowledges the gap, even as he argues it is closing. "The technology we have right now is very jagged," he said in the Big Technology interview. "It is absolutely superhuman at many tasks. When it comes to writing code, those kinds of things, the AI can just do it. But there's some very basic tasks that a human can do that our AI still struggles with."

Realism re AI coding is knowing that next-word prediction gets us a surprisingly long way in writing code, but less far in making sure that code is robust. Coders (especially vibe coders with little experience) beware

As good as these tools are getting — and they are getting really good and helpful — I don't see a non-technical person, say a product manager or marketing person, being able to steer and coerce these LLMs into producing software that any company should be willing to expose to the internet and their user base if they care about robustness, security, reliability and maintainability of the system. Especially if revenue and reputation are on the line.

Human Review Is the Bottleneck

2026-04-24T16:13:00+12:00

Chris Parsons recently wrote up an article about feedback being the bottleneck.

"Feedback" here, for me, refers to the human effort of gatekeeping LLM outputs before they go out out and serving their purpose (software released to production, an LLM-drafted email being sent, LLM-created slide deck reviewed before presenting, etc.).

This is most relevant where LLMs are generating outputs at greater speeds than long-standing engineering practices keep up with.

The most prominent aspect of this is the pull request as the quality gate. It's not the only one, but it is the one where humans are most in the loop. And the one that most people who care about quality, maintainability, security and operational stability are most uncomfortable with removing or even loosening up on.

The problem, as Chris Parsons nicely puts it:

This is the theory of constraints in action: speed up one stage and the bottleneck moves downstream. When code arrives faster, it pushes more work into review, testing, deployment, and requirements clarification. The queue grows. Nobody is reviewing any faster.

If review, testing and quality assurance are simply removed or loosened up on to allow the bombardment of PRs to flow through, it means - more consequentially - that user, organisation reputation and potentially revenue affecting production quality and incidents will become the bottleneck. Eventually it will come back to bite.

Humans get fatigued, AI doesn't.

With AI, the human is the only one who needs rest while the machine keeps generating work that needs evaluating: permission fatigue, review fatigue, the endless “just need a human to press approve” requests. Cory Doctorow calls this the reverse centaur: humans whose purpose is to support the machine’s needs.4

The reviewers who care most about quality will be the first to burn out, because they are the ones who read everything instead of skimming. Either you make human feedback unnecessary, or you make it instant.

So what does one do about this review bottleneck?

This is the crux of the problem, the very point where we are forced to decide whether we will hold a tight grip on these long-standing practices and push those senior reviewers to their tethers' ends with unrelenting AI reviews (often slop). Or clear the AI on its path of disruption and we adapt accordingly.

On one end of the spectrum, we could double down on existing engineering practices. Humans review everything. This may be most comforting as it may give the most objective confidence to responsible humans that quality is where it needs to be.

It simply doesn't scale. There is too much code being generated, and reading/reviewing only does not give the same in-depth understanding of what was built if the humans were building it too. So things will and do slip through.

Or alternatively we give in and loosen the quality/review gates without additional steps. This risks pushing low quality, architecturally deficient, bug infested code that no one understands into production. The instant gain was real, breakneck delivery speeds. Even the Product Manager has something to push to production. But the theory of constraints just pushes the bottleneck downstream, right into production where it hurts the most and is the hardest to unwind.

It's one answer (that a lot of people have accepted) but it's not the right answer. Just to be clear, this approach has its place, like low-stakes things that may be suitable for vibe coding. Here we are talking about high-stakes enterprise, production-grade software systems.

There needs to be something else that can free up this bottleneck while still providing high levels of confidence that it is good to be shipped.

"Good to be shipped" will mean different things to different people and organisations, and will also depend on the context of what is being shipped. Again, I'm keeping the context here within the realm of high-stakes enterprise, production-grade software systems.

Checked for code smells, bugs and overall quality.
Checked for code style; formatting, lint rules, code structure, naming, patterns.
Has been sufficiently verified as stable and all functionality is intact.
Meets the acceptance criteria of the changes being shipped.
Strong architecture and system design patterns that align with the wider ecosystem.
Meets a high level of security standards, critical in the era of AI-enabled offensive attacks.

It's clear the shift is towards building capability at the project level to discover and fix issues and increase quality left of the delivery pipeline. Aka "shifting left". Before CI and before the human even sees the pull request.

Using mechanical gates; testing, lint rules, dependency boundaries, architecture style enforcement, etc. It's harder or impossible for LLMs to cheat or give inconsistent results with mechanical verification.
Using AI workflows and loops to increase quality; self-reflection, panel of judges, adversarial evaluation, ralph loops, implementation refinement.

These two above are what are becoming well established as Harness Engineering. There is a ton more to it than I just described above.

These are key ingredients to modern AI driven software delivery. They are very new and the entire industry is still figuring it out. And there are many ways to skin the cat. Some companies, like Intercom, Stripe, and of course the big AI Labs like OpenAI and Anthropic have written or talked extensively about their approaches to AI-first engineering.

But these only partially address the matter. Implement an effective harness, where quality improvement shifts left, before the PR for a human to review.

At some point decisions need to be made about where and how much of AI generated outputs are actually human reviewed and gated. Where humans are still involved and make the final decisions but in such a way that enables code changes to flow through the pipeline more seamlessly but retaining high levels of confidence and quality.

The answer to that is going to be different and will shift with direct correlation to the capability and maturity of the harness engineering practices.

A new trend that is emerging is techniques to have the LLM prove what they built works and meets acceptance criteria. Like having the agent record videos of what it built to show it working. In its most basic form the way I have naturally started doing this prior to this new definition and trend is including acceptance criteria in my PRD and design spec, and then in the implementation plan make the agent produce steps that test and validate these acceptance criteria using red-green TDD. I try to take this further than just unit tests and use component testing and end to end testing, and/or getting the agent to drive functionality in the browser to demonstrate each acceptance criteria. They key is proof that can't be gamed or worked around by the agent.

What is achieved here is important: increasing objective confidence in what has been built that is not simply reviewing every diff in a pull request, and doing it at a rate that is faster than traditional review/feedback processes can scale to.

Stepping far out, where humans can further reduce or even remove themselves from reviewing code, we would be looking at the Software Factory, or Dark Software Factories. StrongDM is a prominent example of this where there rules are:

Code must not be written by humans
Code must not be reviewed by humans

Simon Willison gave a great breakdown of their approach.

Chris Parsons, in his article titled How I Use AI to Code, provides a critique that software factories are akin to waterfall software development.

If you specify the solution upfront, you are front-loading all the thinking so the machine can run unsupervised. That was called waterfall in 1986, and it did not work then either.

What is new is that the waterfall mistake has a fresh cover story. 'Dark factory' software development promises autonomous agents running in parallel on a queue of specifications you wrote once, perfectly, in advance.

The critique is valuable as an additional perspective to the idea and worth thinking about before rushing into a software factory approach. Nothing in software engineering is perfect.

I have personally not yet pushed things far enough to have confidence that a totally hands-off approach can be taken. This will take real capability maturity, not something that can be done without substantial dedicated effort. With a well-defined spec it's now easy to one-shot large features or changes, and not needing to review individual diffs. But things can go off-piste quite easily, and the first round of evaluation on generated output is the clearest indicator of this. The first round of output is never good enough. Running an adversarial evaluation on generated outputs almost always provides a number of things that must or should or could be addressed, but this is the exact point where senior-level judgment is required to understand what has been built, where real problems may lie, and steer the agent to the point where generated outputs meet a high enough quality and correctness bar. Tease out the vibe-coding into real AI assisted engineering.

I do think it's possible though, to go totally hands-off, or mostly hands-off. Especially with newer techniques coming into play. And if one thinks about just how rapidly software engineering has evolved in just the past six to twelve months alone - our craft is being flipped on its head in real time - it's not hard to imagine that the concept of the software factory will continue to evolve and that may be the future we are looking at sooner than we may like to admit.

That said, even though these things are evolving rapidly and will continue to do so at an ever accelerating pace, organisations and their software engineering teams don't all evolve as quickly to keep up with the bleeding edge.

Starting with the fundamentals and a balanced approach may be the best way to get going, then gradually increasing capability and maturity with AI assisted harness engineering to streamline software delivery and actually start to realise the velocity gains that people love to talk about. The harness itself is where the review bottleneck can be addressed.

Mozilla Used Anthropic's Mythos to Fix 271 Bugs In Firefox

2026-04-22T21:13:32.816976+12:00

Mozilla Used Anthropic's Mythos to Fix 271 Bugs In Firefox (via Simon Willison)

Mozilla has been one of the companies to get access to Anthropic's new Mythos Preview model. And have put it to good use.

As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week's release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation.

Our experience is a hopeful one for teams who shake off the vertigo and get to work. You may need to reprioritize everything else to bring relentless and single-minded focus to the task, but there is light at the end of the tunnel. We are extremely proud of how our team rose to meet this challenge, and others will too. Our work isn't finished, but we've turned the corner and can glimpse a future much better than just keeping up. Defenders finally have a chance to win, decisively.

This seems to validate a lot of Anthropic's claims about the new model's capabilities. And it's encouraging to know that this will likely soon be available to the general public, at some point, where they can be used to strengthen the security postures of existing and new systems.

When that eventually happens, people are going to need to act fast to find and resolve security vulnerabilities before bad actors get a chance to exploit them.

There will be casualties.

Building Has Become Really Fun (again)

2026-04-22T20:44:28.955632+12:00

It's becoming apparent that people are really... really enjoying using AI to build things.

Not just developers, non-technical people too, from product managers to CEOs to business owners. Pretty much anyone.

People are excite to build. Building at night, on the weekends and during their holiday. I can relate.

It has almost become addictive. "Please, just one more feature...".

Quoting Ryan Lopopolo

2026-04-21T21:51:00+12:00

Every company should have a full stack team of 5 building a product who are banned from directly writing code; they must force the agents to do it. In 2 months they will be your most productive team.

— Ryan Lopopolo

Quoting Devansh

2026-04-20T16:48:23.277771+12:00

This is kind of a big deal. The performance difference between Cursor and Claude Code? Or Claude Code and OpenCode? All that comes from the system around the model. That’s why Google can have one of the best models in the market, and still produce the Shakespearean tragedy that is the Gemini CLI. When people say “Mythos found a zero-day,” the truth is more “Mythos, orchestrated by a purpose-built vulnerability research pipeline, found a zero-day.

— Devansh

Anthropic Opus 4.7 Released Today

2026-04-17T16:07:52.60723+12:00

The much-anticipated Opus 4.7 was released today. It's the only 4.7 in the model family, with Sonnet and Haiku still at 4.6.

According to the claimed benchmarks, it shows a substantial jump in capability across the board, with notable improvements in:

Visual reasoning — It can now "see" higher resolution pictures up to 2,576 pixels on the long edge, 3x more than Opus 4.6.
Instruction following — It takes instructions more literally than the previous version. The called-out side effect is that users may need to re-tune any prompts and harnesses.
Memory — It's better at using file-system-based memory, remembering important notes across long-running, multi-session work.
Real-world work — Areas like financial analysis, legal, and professional slide presentations.

I like that they have also included benchmark numbers for the new unreleased Mythos Preview model. The jump in SWE-bench agentic coding from Opus 4.6 to Opus 4.7 is already substantial, and then there's a further leap to 93.9% for Mythos!

Interesting that the cybersecurity vulnerability reproduction score is actually slightly lower on 4.7 than it was on 4.6 — although it appears this may have been intentional.

We stated that we would keep Claude Mythos Preview's release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

And...

Security professionals who wish to use Opus 4.7 for legitimate cybersecurity purposes (such as vulnerability research, penetration testing, and red-teaming) are invited to join our new Cyber Verification Program.

This seems to imply the model was intentionally dialled back to reduce the risk of misuse, with access to the full capabilities gated behind the new Cyber Verification Program.

Other Notable Changes

Along with the model, they are releasing these notable controls:

New xhigh reasoning effort — an effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, the default effort level has been raised to xhigh for all plans.
New /ultrareview slash command in Claude Code — produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch.

Opus 4.7 does use more tokens, and it's easy to see why they introduced this new effort level. xhigh scores substantially higher than high at the cost of more tokens burned — but max uses double the tokens of xhigh for a smaller score jump than high to xhigh.

Opus 4.7 Preparing For Release

2026-04-15T07:47:10.094599+12:00

Opus 4.7 Preparing For Release (via Alberto Romero)

While all the talk is about Anthropic’s “terrifying” new Mythos model, and it being too dangerous to release to the public, it looks like we may be able to get our hands on an upgraded Opus 4.7 in the meantime, possibly as soon as this week.

While 4.5 and then 4.6 were total game changers, there is still often frustration working with them, as with all other models. If 4.7 is anything like the upgrade from 4.5 to 4.6, then we should see a notable and much-welcomed increase in capability.

Feedback Flywheel

2026-04-13T21:35:33.966412+12:00

Feedback Flywheel

Rahul Garg at Thoughtworks talks about the Feedback Flywheel, a practice that encourages paying attention to signals in AI engineering that can be used to continuously improve the AI engineering setup and workflows to get better outcomes. Not at a personal level but at the team level.

Every AI interaction generates signal: prompts that worked, context that was missing, patterns that succeeded, failures worth preventing. Most teams discard this signal. I propose a structured feedback practice that harvests learnings from AI sessions and feeds them back into the team's shared artifacts, turning individual experience into collective improvement.

Anyone who has used AI when building software has felt all kinds of frustration when it's not doing what you expect, and what seemingly should be obvious.

Every interaction like this is a signal that the LLM hasn't been primed sufficiently with the right information or instruction for the task or workflow. At this point one could start instructing and steering the LLM right there and then with prompts to get a better result. But it will likely need to be repeated again later.

The idea with the Feedback Flywheel is to pay attention to these, ask what went wrong and what can be improved so next time it works better. Not just for yourself but for anyone on the team working on the project. And then taking steps to bake these in so the next encounter of that situation works better, and broadly for everyone.

Adopting AI practices can plateau once everyone gets comfortable.

With AI coding assistants, most teams reach a plateau. They adopt the tools, develop some fluency, and then stay there. The same prompting habits, the same frustrations, the same results month after month. Not because the tools stop improving, but because the team's practices around the tools stop improving

Worse than this, when every developer on the team is encountering their own flavours of these issues, and each has their own level of skills, prompting style, local setup and even different AI tools, what you can and likely will quickly end up with is snowflake pull requests of wildly varying style and quality.

All of this is signal that should be used to continuously make adjustments to the context and knowledge the LLM is primed with. These can be done in various ways, like adding specific content to the agent specific files (AGENTS.md, CLAUDE.md etc), well structured and discoverable context in additional markdown files, agent skills and custom commands.

These benefits will quickly start to compound and result in better outcomes across the team and improve the quality of the codebase and the pace at which things can be delivered with a high level of quality.

LLM Knowledge Bases

2026-04-06T20:51:56.111754+12:00

LLM Knowledge Bases

Andrej Karpathy recently shared his approach to building LLM-managed knowledge bases, and it resonates with me.

I've never been great at organizing notes. I suspect most people are the same — they want things organized, they just don't want to be the one doing it. Which makes this a perfect job for an LLM.

The core idea:

TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM.

You provide raw information, the LLM organizes and makes it discoverable, and then it's accessible by you directly, and also made available to the LLM that you work with for looking up information, Q&A and outputting it in different formats.

In a way it's similar to RAG but less complicated, and probably a bit more like the progressive disclosure approach with agent skills where information is discoverable when needed in the context of the conversation but without blowing out the context with wiki information that's not relevant to the conversation.

The real payoff comes with scale:

Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc.

I can see this extending further — multiple wikis, all accessible to the LLM, with controlled access say by different agents. A personal wiki, a business wiki, and a general one for collected articles and research interests. Different agents, different scopes, same underlying approach. Or a single agent with access to all of them.

This feels like something we are going to see more of as a new product or enhancement to existing products.