Designing a Feature with Claude Design — Then Handing It to Claude Code

I've been wanting to give Claude Design a try. In this post I'll walk through my first use of it for designing some new functionality I wanted for my blog CMS. My use here is probably very basic, but it was an interesting exercise, and in particular I wanted to see how I could hand the design off for building.

Just a small disclaimer: I'm not a UX designer, or a creative person in general. But that's exactly why this exercise was interesting — it did a far better job than I ever could.

Getting into it...

I've recently built this blog site where this article is being read (I will write about this separately). It's very new and the feature/capability set is minimal, just enough functionality to manage, publish and serve blog content.

It has a custom-built CMS for managing the content, and supports slug-like tags that I can selectively apply to any of the content. These tags are visible to the right side of this article (or at the bottom on mobile).

When writing or editing content, it has a section where I can add new tags or use existing ones. Below shows editing an article in the CMS I wrote recently about human reviews being a bottleneck.

Blog CMS Edit Post

The behaviour of the tags input field is: On post creation, any never-before-seen tags will be created in the tags table in the database, existing tags will be referenced.

At first sight it looks ok, but:

  • It has no option to select existing tags.
  • It doesn't indicate if a tag being added was one that already existed or would be created.
  • It doesn't help me avoid creating near-duplicate tags with a similar name or even typos.

One wouldn't want different blog articles using tags that mean the same thing but with slightly different names. For example ai-engineering and engineering-with-ai, or even a typo in the tag like ai-eginering. So without any kind of tag selection, and the behaviour of the existing tag selection and creation, this made the UX painful (I had to open the tag management section in a different tab to remember what tags were available) and susceptible to tags becoming a mess across all the content.

Using Claude Design

Claude Design was introduced on 17 April. I had had a bit of a poke around in it before to see what it was about, but this was the first chance I had to try it out on something real.

I started by pasting four screenshots of the CMS and giving it this prompt. (I used the term "upsert" for tags, which wasn't quite correct, but Claude got it).

Attached are four screenshots of my blog CMS. One is where tags are managed, show when two tags existed; ai and harness-engineering. The second one is showing where I can add a post and enter tags. How it currently works is I can enter any tag value and when the post is created/published then the app(or probably the backend for the CMS) will upsert tags.

So you can see in the third image I enter two tags, one existing and the second one (new-tag) is new, so when the blog post got created is used the existing tag and then created a the new-tag, as shown in the fourth image, where I then have a total of three tags. (Two were existing one new one got created on that new post).

I like this functionality. However, right now when I enter tags on the create post page I don't get offered to select from existing tags, so if my new post should use an existing tag I have to carefully remember or go back and look in the tags section to remember the exact name of the tag so I don't end up with near duplicated for what should be the same tag.

Ideally when I start typing for a new tag, it should show existing tags that match (like contain) the text I have entered, with the option of selecting one of those. If none match it should allow me to add a new tag.

Come up with three options for the post create screen for selecting existing or adding tags. Should be simple and seamless and easy to use.

[... 1879 words]

Human Review Is the Bottleneck

Chris Parsons recently wrote up an article about feedback being the bottleneck.

"Feedback" here, for me, refers to the human effort of gatekeeping LLM outputs before they go out out and serving their purpose (software released to production, an LLM-drafted email being sent, LLM-created slide deck reviewed before presenting, etc.).

This is most relevant where LLMs are generating outputs at greater speeds than long-standing engineering practices keep up with.

The most prominent aspect of this is the pull request as the quality gate. It's not the only one, but it is the one where humans are most in the loop. And the one that most people who care about quality, maintainability, security and operational stability are most uncomfortable with removing or even loosening up on.

The problem, as Chris Parsons nicely puts it:

This is the theory of constraints in action: speed up one stage and the bottleneck moves downstream. When code arrives faster, it pushes more work into review, testing, deployment, and requirements clarification. The queue grows. Nobody is reviewing any faster.

If review, testing and quality assurance are simply removed or loosened up on to allow the bombardment of PRs to flow through, it means - more consequentially - that user, organisation reputation and potentially revenue affecting production quality and incidents will become the bottleneck. Eventually it will come back to bite.

Humans get fatigued, AI doesn't.

With AI, the human is the only one who needs rest while the machine keeps generating work that needs evaluating: permission fatigue, review fatigue, the endless “just need a human to press approve” requests. Cory Doctorow calls this the reverse centaur: humans whose purpose is to support the machine’s needs.4

The reviewers who care most about quality will be the first to burn out, because they are the ones who read everything instead of skimming. Either you make human feedback unnecessary, or you make it instant.

So what does one do about this review bottleneck?

This is the crux of the problem, the very point where we are forced to decide whether we will hold a tight grip on these long-standing practices and push those senior reviewers to their tethers' ends with unrelenting AI reviews (often slop). Or clear the AI on its path of disruption and we adapt accordingly.

On one end of the spectrum, we could double down on existing engineering practices. Humans review everything. This may be most comforting as it may give the most objective confidence to responsible humans that quality is where it needs to be.

It simply doesn't scale. There is too much code being generated, and reading/reviewing only does not give the same in-depth understanding of what was built if the humans were building it too. So things will and do slip through.

Or alternatively we give in and loosen the quality/review gates without additional steps. This risks pushing low quality, architecturally deficient, bug infested code that no one understands into production. The instant gain was real, breakneck delivery speeds. Even the Product Manager has something to push to production. But the theory of constraints just pushes the bottleneck downstream, right into production where it hurts the most and is the hardest to unwind.

[... 1672 words]

Anthropic Opus 4.7 Released Today

The much-anticipated Opus 4.7 was released today. It's the only 4.7 in the model family, with Sonnet and Haiku still at 4.6.

According to the claimed benchmarks, it shows a substantial jump in capability across the board, with notable improvements in:

  • Visual reasoning — It can now "see" higher resolution pictures up to 2,576 pixels on the long edge, 3x more than Opus 4.6.
  • Instruction following — It takes instructions more literally than the previous version. The called-out side effect is that users may need to re-tune any prompts and harnesses.
  • Memory — It's better at using file-system-based memory, remembering important notes across long-running, multi-session work.
  • Real-world work — Areas like financial analysis, legal, and professional slide presentations.

Anthropic Opus 4.7 Benchmark

I like that they have also included benchmark numbers for the new unreleased Mythos Preview model. The jump in SWE-bench agentic coding from Opus 4.6 to Opus 4.7 is already substantial, and then there's a further leap to 93.9% for Mythos!

Interesting that the cybersecurity vulnerability reproduction score is actually slightly lower on 4.7 than it was on 4.6 — although it appears this may have been intentional.

Anthropic Opus 4.7 Benchmark Cyber

We stated that we would keep Claude Mythos Preview's release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

And...

Security professionals who wish to use Opus 4.7 for legitimate cybersecurity purposes (such as vulnerability research, penetration testing, and red-teaming) are invited to join our new Cyber Verification Program.

This seems to imply the model was intentionally dialled back to reduce the risk of misuse, with access to the full capabilities gated behind the new Cyber Verification Program.

Other Notable Changes

Along with the model, they are releasing these notable controls:

  • New xhigh reasoning effortan effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, the default effort level has been raised to xhigh for all plans.
  • New /ultrareview slash command in Claude Codeproduces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch.

Opus 4.7 does use more tokens, and it's easy to see why they introduced this new effort level. xhigh scores substantially higher than high at the cost of more tokens burned — but max uses double the tokens of xhigh for a smaller score jump than high to xhigh.

Anthropic Opus 4.7 Agentic Reasoning Effort Token Usage