generative-ai

AI Subscription plans are more generous than API pricing

SemiAnalysis did a test on Anthropic/Claude subscription plans, running long horizon tasks until they maxed out.

The subscriptions are far more generous than API pricing.

AI Subscriptions vs Pricing comparison

When I first started using these tools I went the API route as my personal usage fluctuated a lot. But after a couple of months it was clear this worked out a lot more expensive and so switched to the paid plans even if not using all the quota.

Go with the paid plan.

Implementing a Full-Stack Feature Using Claude Code's Dynamic Workflows

This is a continuation from the post I wrote recently titled Designing a Feature with Claude Design – Then Handing It to Claude Code.

In that post, I stepped through how I used Claude Design to create a UX design as well as a set of initial requirements for new tag selection and suggestion functionality in my blog's custom CMS.

This post steps through using those artefacts to drive the actual implementation in the frontend and backend apps. Specifically the goal here is to build this end-to-end using Claude Code's new Dynamic Workflows.

The outputs of the Claude Design exercise were a fully functional interactive prototype of the design and a markdown file with a set of requirements and acceptance criteria.

I already have a structured process for defining per-feature design specs, requirements and implementation plans in the projects repo. It consists of several things, but most relevant here is a folder structure and set of markdown files which I would consider the "spec" as part of the planning phase of this AI-driven engineering workflow. These are used as context for the implementation and verification, which often run across several separate sessions.

Usually when starting a feature I would be starting from scratch, or just a high-level set of requirements from the initial blog design and feature roadmap. For this piece of work, Claude Design gave me a working prototype that was already using my frontend style guide and a detailed set of requirements, which I put into the code base for the duration of the build. So I had a solid starting point. However, those didn't fit into my structured process that I already have in place, and I still needed a design spec for the full end-to-end solution.

For larger pieces of work I like to use Jesse Vincent's superpowers plugin. He has done a fantastic job of baking in real software development methodologies and workflows into a set of agent skills that drastically increase the quality and coherence of what's being built.

The first two skills I normally reach for are the brainstorming skill, which is for creating a design spec and the writing-plans skill, which is for writing an implementation plan for the spec. The design spec is the most important part. For any given feature this is where I spend most of my time. The clearer this is defined upfront, the more seamless the rest of the process will be and the higher the chance of it building what I actually want.

Once I'm happy with the design spec and implementation plan, after rounds of refinement, I would then hand it off to Claude Code to start building, usually with subagent development or agent teams.

I already found a general workflow that I follow when building software with AI. In its most simple form it's the Plan-Generate-Evaluate methodology. But that's a gross oversimplification of the entire underlying process. In fact, for each of the Plan-Generate-Evaluate phases I will run inner Evaluate-Regenerate flows, iterating over this at different phases until outputs are where they need to be. The first outputs are rarely good enough.

[... 2746 words]

Maintainability sensors for coding agents

This is a great post from Birgitta Böckeler on martinfowler.com talking about the sensors that can be used with generative AI engineering to shift quality assurance left of the delivery pipeline.

One big part of this is the general state of the codebase and repository, even before adding in sensors. I like the definition used to describe this: internal quality.

There are multiple dimensions we usually want to achieve and monitor in our codebases: Functional correctness (works as intended), architectural fitness (is fast/secure/usable enough), and maintainability. I define maintainability here as making it easy and low risk to change the codebase over time - also known as “internal quality”.

And why this is important for engineering with AI.

Internal quality problems affect AI agents in similar ways that they affect human developers. An agent working in a tangled codebase might look in the wrong place for an existing implementation, create inconsistencies because it has not noticed a duplicate, or be forced to load more context than a task should require.

I have heard developers say their agents go off track easily and implement things with mixed styles. There are likely a few reasons for this happening, but this internal quality is probably the top of the list.

She talks about a number of different approaches to implementing these sensors, both computational and inferential: linting (standard and custom), static code analysis to control cross-dependencies, coupling data and modularity patterns, and test suites as a regression sensor.

There are a lot of good things in there, but the section on test suites I found to be valuable. LLMs love to write tests and they can be highly verbose, testing things that are not important while totally skipping things that are critical. Or worse, fixing failing tests by making them pass against broken or buggy code. Like most other things with generative AI, these things can be steered to create genuinely meaningful tests that the LLM can largely manage with light oversight. I'm a huge fan of end-to-end tests, or a subset of those, component-testing that runs locally against running services, with local or mocked dependencies. These allow for greater control of different scenarios - edge cases, acceptance criteria and user scenarios - that would likely result in heavier, slow running and flaky tests if running in a deployed environment.

There currently seems to be a trend towards more end-to-end style acceptance tests. As mentioned in the beginning, AI has gotten really good at generating tests, so it has become quite normal for developers to just let AI generate lots of tests, without much review. Reviewing unit tests in particular can be very tedious. I'm not saying it's a good thing not to look at them at all — but I acknowledge the reality that it is unrealistic to think that human review of all tests is sustainable, and it's unrealistic to think that people will actually do it. So while we search for the appropriate testing pyramid/ice cream cone/muffin shape of the AI coding future, techniques like approved scenarios are becoming popular. As demonstrated above, acceptance tests increase coverage, but are often not very assertion-heavy, giving us a false sense of security in test effectiveness — mutation testing helps us monitor that gap.