Implementing a Full-Stack Feature Using Claude Code's Dynamic Workflows

2026-06-07T11:52:52.865346+12:00

This is a continuation from the post I wrote recently titled Designing a Feature with Claude Design – Then Handing It to Claude Code.

In that post, I stepped through how I used Claude Design to create a UX design as well as a set of initial requirements for new tag selection and suggestion functionality in my blog's custom CMS.

This post steps through using those artefacts to drive the actual implementation in the frontend and backend apps. Specifically the goal here is to build this end-to-end using Claude Code's new Dynamic Workflows.

The outputs of the Claude Design exercise were a fully functional interactive prototype of the design and a markdown file with a set of requirements and acceptance criteria.

I already have a structured process for defining per-feature design specs, requirements and implementation plans in the projects repo. It consists of several things, but most relevant here is a folder structure and set of markdown files which I would consider the "spec" as part of the planning phase of this AI-driven engineering workflow. These are used as context for the implementation and verification, which often run across several separate sessions.

Usually when starting a feature I would be starting from scratch, or just a high-level set of requirements from the initial blog design and feature roadmap. For this piece of work, Claude Design gave me a working prototype that was already using my frontend style guide and a detailed set of requirements, which I put into the code base for the duration of the build. So I had a solid starting point. However, those didn't fit into my structured process that I already have in place, and I still needed a design spec for the full end-to-end solution.

For larger pieces of work I like to use Jesse Vincent's superpowers plugin. He has done a fantastic job of baking in real software development methodologies and workflows into a set of agent skills that drastically increase the quality and coherence of what's being built.

The first two skills I normally reach for are the brainstorming skill, which is for creating a design spec and the writing-plans skill, which is for writing an implementation plan for the spec. The design spec is the most important part. For any given feature this is where I spend most of my time. The clearer this is defined upfront, the more seamless the rest of the process will be and the higher the chance of it building what I actually want.

Once I'm happy with the design spec and implementation plan, after rounds of refinement, I would then hand it off to Claude Code to start building, usually with subagent development or agent teams.

I already found a general workflow that I follow when building software with AI. In its most simple form it's the Plan-Generate-Evaluate methodology. But that's a gross oversimplification of the entire underlying process. In fact, for each of the Plan-Generate-Evaluate phases I will run inner Evaluate-Regenerate flows, iterating over this at different phases until outputs are where they need to be. The first outputs are rarely good enough.

For example, when in the planning phase, for the design spec planning I will start with some initial context and then have a back and forth conversation with Claude to flesh out the requirements, acceptance criteria and architecture. Once it generates the initial design spec I then go through usually multiple rounds of adversarial evaluation and refinement of the spec until it's at a place that I am happy with. When I do the implementation plan I do the exact same thing, and then during the build I will do the same thing again. The build often gets broken down into separate chunks of work, usually to reduce the size of PRs and because Claude's context will start to fill up, at which point the generated outputs can start to deteriorate in quality and things can go off-piste. So, for each build phase I will go through rounds of Generate-Evaluate loops.

This has worked well for me so far, in the sense that I can effectively build substantially sized chunks of work without writing a single line of code myself, and at a quality level that meets my own pedantic standards.

However, it can be incredibly time-consuming and painfully arduous. It's making full use of AI to generate the design spec and implementation plan, do the build and then do the evaluation and verifications, all with their own inner loops of Generate-Evaluate.

Some might say this is over the top, and it can take a lot of time to go through this for any given piece of work. But so far this is the level of coercing, so to speak, that I have found to be necessary to achieve high quality outputs, especially those suitable for production-grade software. Anyone can ship some code fast; not everyone can do it well too.

But it's been me manually driving this workflow.

This is where I took a different route than usual as Opus 4.8 had just been released and along with that their new Dynamic Workflows capability, which I was super keen to understand and try out.

Here is how Thariq Shihipar and Sid Bidasaria from Anthropic describe Dynamic Workflows:

While the default Claude Code harness is built for coding, it is also useful for many other types of tasks because, as it turns out, many tasks resemble coding tasks. But there are certain classes of tasks where we have had to build custom harnesses on top of Claude Code to achieve peak performance such as Research, security analysis, agent teams, or Code Review.

Workflows allow you to dynamically create harnesses built on top of Claude Code that enable Claude to solve all of those problems more natively. You can also share and reuse these workflows with others.

And this is where I wanted to see what the new Dynamic Workflows could take off of my hands.

Creating and Using a Workflow

A while ago Anthropic published Building effective agents. In that they talk about what agents are, and more relevant here, what workflows are, which they define as follows:

Workflows are systems where LLMs and tools are orchestrated through predefined code paths.

And the Claude Agent SDK can be used to create these workflows in code.

Dynamic Workflows is the combination of these two built natively into Claude Code, with Claude being able to intelligently create these based on the task at hand. You can create them using natural language to describe a workflow, or use the new ultracode effort level (which is actually effort level xhigh plus instructing it to use workflows).

Nice touch on the pixelated visuals when selecting ultracode.

Since I have a relatively pre-defined workflow, I used natural language to describe to Claude how I normally work.

What I wanted to see was whether Claude could do everything after the design spec in this workflow. So the implementation planning, the build (or builds), and self-evaluation. And for each of those, also run inner loops of Generate-Evaluate and refinement. Get it to the point of fully completed and verified, leaving me to do final review.

Essentially I want to hand my design spec to Claude, have it build it to a high level of quality and then create a PR for me to review. 100% hands-off the code for the entire process.

The first question I had was: would Claude even be able to recreate my normal workflow? It's quite involved. Secondly, where I am normally involved in all the inner loops, making human judgement calls and steering things in a direction that I want, can Claude do this all on its own? And lastly – most critically – can it actually produce a good result?

Well, I'll be damned. It did all of that. And it felt like a moment of reckoning.

I had a relatively large prompt, explaining in detail the end-to-end process I normally go through. Pointed it to my design spec and requirements and asked it to create a workflow.

It then asked me whether it should just run the workflow or save it to disk for me to run it later. I got it to save it and then kick it off.

Here's what that looked like:

As you can see it incorporated the implementation planning in the "Preflight & Plan" phase with self-evaluation and refinement in the "Plan Review" phase. Excellent, this is usually painful and one part I'm not too concerned about.

This may sound alarming, not being concerned about the implementation plan. However my prompts already guide a plan in a certain direction, and I have a ton of things in the repo that tell Claude what style, architecture, structure, patterns, quality checks etc. I use in this repo. It has deterministic guardrails in place that Claude can't cheat on. Along with that there is a ton of already implemented code that follows the patterns I set in place. LLMs are really good at following style and patterns, good or bad! With the design spec and plan, and the surrounding context and environment, my confidence has grown that what will be implemented is in line with what I expect, and that gets verified by me at review time. If something is off, I take a close look to see where in my setup needs tweaking and feed that back into the harness, or if I underspecified something somewhere it gets updated. If things went totally sideways, my work gets done in an isolated git worktree and branch, and the reimplementation cost is very low, if it came to that.

The workflow broke the work down into three tracks of work: backend, frontend and then integrating those two tracks together. Claude decided this part of the workflow. In each track it ran the inner loops, not just for the track, but also for particular tasks in the track. This was following good engineering practice: breaking the work down into smaller chunks, using red-green TDD, building and then doing evaluation and refinement. All per task, per track of work.

Along with this, when I described the workflow I also instructed it to ensure that all acceptance criteria must be verified, with component testing (end-to-end against a locally running service) and testing the functionality in a browser against the running application. It did all of that. This is an area that is critically important to ensuring high quality, and the one that I see as one of the most valuable aspects to addressing the human review bottleneck dilemma that is a limiting factor for the potential productivity gains from AI-first software engineering to flow through to organisational productivity and efficiency.

After the build was complete, it did a final comprehensive "Final review" phase, which is what I normally do from multiple angles myself: implementation to spec alignment, code quality, security, etc. Anything it found there, it sent back to build to apply fixes and then re-ran the final review.

60 agents, 5.2m tokens, ~3.9 hours later (3 of which while I was sleeping), giving me a final report below:

Four hours is a long time, relatively speaking. But it's time that can be spent doing other things while Claude cooks. Building other things, sleeping, or going outside to touch grass. I didn't spend any time manually driving this workflow, so that was time saved and a not-so-enjoyable aspect removed from the process. Importantly, instead of me driving the reviewing and refining on the inner loops and becoming fatigued by this, I can focus my energy on scrutinising the final outcome, as well as feeding back any process improvements back into the harness and workflow.

It ran successfully, barring being able to test the AI tag suggestions due to an invalid token. My first test after fixing the API token, it worked perfectly.

For more visual context, this is how the workflow was structured to work its way through the various phases all the way through to completion, with some stats reported at the bottom.

So it worked, and it sure looked good based on the rigorous process it went through – a close approximation of what I drive manually. I was genuinely impressed by how much of my process it was able to autonomously do, and while I was sleeping!

But being the first time I have run this workflow, and not being in the loop to apply my human judgement, the next step was to scrutinise everything myself. Firstly, does it actually work? Yes, and damn it looks and feels good in the UI too. Was the code that it generated good quality and in line with what's expected in this code base? It was very much, and Opus 4.8 seems to go the extra mile and pick up and address some things as it goes, like a good engineer continuously improving as it stumbled upon things. Like tidying up a log line or updating affected docs, or noticing and calling out some preexisting bug. All the quality checks look good.

You may have seen that the very last phase was called Decision Log and Report. Because I was out of the loop and these workflows can't stop to ask for direction, I asked Claude to keep a record of all decisions the agents had to make along the way, like if it had to deviate from the design spec or plan. This happens and it's important to know about them in case it's something that needs to be adjusted. The majority of this was noise in this run, but I can see this proving valuable as in my manual process Claude often stops to ask me how to proceed, where in a workflow it can't do that.

The left side is the working prototype from the Claude Design exercise; the right side is the final implementation.

I also decided to run my own final verifications using my regular multi-angle review prompts. Normally when I run these after going through my usual manual process it will find some critical and high issues, which I evaluate and make a judgement call on. I was very surprised that it only found some medium and low-severity issues. Most of the time these are just noise or stylistic things. So there were a few minor things that I asked it to tidy up post-implementation, including a UI element overflowing into another. But I would say it got about 95% of the way there across the entire full-stack feature. I was not expecting this.

I have since run these workflows a few times, experimenting first on my personal projects, but have now also used them in a professional team environment and so far it has worked remarkably well.

My takeaway here is that these are incredibly powerful and flexible, and I think these kinds of workflows are exactly what is needed to get to fully automated, hands-off software delivery. Whether they are Claude's Dynamic Workflows, Codex or some other tool. The methodology and approach is what is important, Claude just seems to have really nailed making these workflows easy to use and available to anyone.

The quality of what these workflows can deliver is going to be highly dependent on the quality of the environment and harness that it's operating in. And I am skeptical that all my workflow runs will go this well on first pass. Workflows alone are not some magical silver bullet that will just start creating production-grade software. I have put considerable effort into my setup for this project (albeit more can be done). I would attribute this as the primary factor for this workflow being able to successfully one-shot build an entire full-stack feature to a high level of quality just from a well-defined design spec. Drop the same workflow into a sparse or legacy code base with weak conventions, lack of system context, insufficient tests and deterministic checks, and I'd expect a very different result.

The workflow isn't the moat; the environment is.

I have only used these workflows a handful of times and the complexity of the task at hand may result in varying results. But from what I have seen so far, I'm cautiously optimistic. And it's most welcome because implementing a workflow like this manually, often with multiple separate features on the go, can be incredibly time-consuming and draining. Workflows carry a lot of the weight and let you get on with the fun part, building.

Maintainability sensors for coding agents

2026-06-06T10:31:00+12:00

Maintainability sensors for coding agents

This is a great post from Birgitta Böckeler on martinfowler.com talking about the sensors that can be used with generative AI engineering to shift quality assurance left of the delivery pipeline.

One big part of this is the general state of the codebase and repository, even before adding in sensors. I like the definition used to describe this: internal quality.

There are multiple dimensions we usually want to achieve and monitor in our codebases: Functional correctness (works as intended), architectural fitness (is fast/secure/usable enough), and maintainability. I define maintainability here as making it easy and low risk to change the codebase over time - also known as “internal quality”.

And why this is important for engineering with AI.

Internal quality problems affect AI agents in similar ways that they affect human developers. An agent working in a tangled codebase might look in the wrong place for an existing implementation, create inconsistencies because it has not noticed a duplicate, or be forced to load more context than a task should require.

I have heard developers say their agents go off track easily and implement things with mixed styles. There are likely a few reasons for this happening, but this internal quality is probably the top of the list.

She talks about a number of different approaches to implementing these sensors, both computational and inferential: linting (standard and custom), static code analysis to control cross-dependencies, coupling data and modularity patterns, and test suites as a regression sensor.

There are a lot of good things in there, but the section on test suites I found to be valuable. LLMs love to write tests and they can be highly verbose, testing things that are not important while totally skipping things that are critical. Or worse, fixing failing tests by making them pass against broken or buggy code. Like most other things with generative AI, these things can be steered to create genuinely meaningful tests that the LLM can largely manage with light oversight. I'm a huge fan of end-to-end tests, or a subset of those, component-testing that runs locally against running services, with local or mocked dependencies. These allow for greater control of different scenarios - edge cases, acceptance criteria and user scenarios - that would likely result in heavier, slow running and flaky tests if running in a deployed environment.

There currently seems to be a trend towards more end-to-end style acceptance tests. As mentioned in the beginning, AI has gotten really good at generating tests, so it has become quite normal for developers to just let AI generate lots of tests, without much review. Reviewing unit tests in particular can be very tedious. I'm not saying it's a good thing not to look at them at all — but I acknowledge the reality that it is unrealistic to think that human review of all tests is sustainable, and it's unrealistic to think that people will actually do it. So while we search for the appropriate testing pyramid/ice cream cone/muffin shape of the AI coding future, techniques like approved scenarios are becoming popular. As demonstrated above, acceptance tests increase coverage, but are often not very assertion-heavy, giving us a false sense of security in test effectiveness — mutation testing helps us monitor that gap.

Loosely Coupled Thoughts — tagged agentic-workflows

Implementing a Full-Stack Feature Using Claude Code's Dynamic Workflows

Creating and Using a Workflow

Maintainability sensors for coding agents