mutation-testing

Maintainability sensors for coding agents

This is a great post from Birgitta Böckeler on martinfowler.com talking about the sensors that can be used with generative AI engineering to shift quality assurance left of the delivery pipeline.

One big part of this is the general state of the codebase and repository, even before adding in sensors. I like the definition used to describe this: internal quality.

There are multiple dimensions we usually want to achieve and monitor in our codebases: Functional correctness (works as intended), architectural fitness (is fast/secure/usable enough), and maintainability. I define maintainability here as making it easy and low risk to change the codebase over time - also known as “internal quality”.

And why this is important for engineering with AI.

Internal quality problems affect AI agents in similar ways that they affect human developers. An agent working in a tangled codebase might look in the wrong place for an existing implementation, create inconsistencies because it has not noticed a duplicate, or be forced to load more context than a task should require.

I have heard developers say their agents go off track easily and implement things with mixed styles. There are likely a few reasons for this happening, but this internal quality is probably the top of the list.

She talks about a number of different approaches to implementing these sensors, both computational and inferential: linting (standard and custom), static code analysis to control cross-dependencies, coupling data and modularity patterns, and test suites as a regression sensor.

There are a lot of good things in there, but the section on test suites I found to be valuable. LLMs love to write tests and they can be highly verbose, testing things that are not important while totally skipping things that are critical. Or worse, fixing failing tests by making them pass against broken or buggy code. Like most other things with generative AI, these things can be steered to create genuinely meaningful tests that the LLM can largely manage with light oversight. I'm a huge fan of end-to-end tests, or a subset of those, component-testing that runs locally against running services, with local or mocked dependencies. These allow for greater control of different scenarios - edge cases, acceptance criteria and user scenarios - that would likely result in heavier, slow running and flaky tests if running in a deployed environment.

There currently seems to be a trend towards more end-to-end style acceptance tests. As mentioned in the beginning, AI has gotten really good at generating tests, so it has become quite normal for developers to just let AI generate lots of tests, without much review. Reviewing unit tests in particular can be very tedious. I'm not saying it's a good thing not to look at them at all — but I acknowledge the reality that it is unrealistic to think that human review of all tests is sustainable, and it's unrealistic to think that people will actually do it. So while we search for the appropriate testing pyramid/ice cream cone/muffin shape of the AI coding future, techniques like approved scenarios are becoming popular. As demonstrated above, acceptance tests increase coverage, but are often not very assertion-heavy, giving us a false sense of security in test effectiveness — mutation testing helps us monitor that gap.