Designing a Feature with Claude Design — Then Handing It to Claude Code

I've been wanting to give Claude Design a try. In this post I'll walk through my first use of it for designing some new functionality I wanted for my blog CMS. My use here is probably very basic, but it was an interesting exercise, and in particular I wanted to see how I could hand the design off for building.

Just a small disclaimer: I'm not a UX designer, or a creative person in general. But that's exactly why this exercise was interesting — it did a far better job than I ever could.

Getting into it...

I've recently built this blog site where this article is being read (I will write about this separately). It's very new and the feature/capability set is minimal, just enough functionality to manage, publish and serve blog content.

It has a custom-built CMS for managing the content, and supports slug-like tags that I can selectively apply to any of the content. These tags are visible to the right side of this article (or at the bottom on mobile).

When writing or editing content, it has a section where I can add new tags or use existing ones. Below shows editing an article in the CMS I wrote recently about human reviews being a bottleneck.

Blog CMS Edit Post

The behaviour of the tags input field is: On post creation, any never-before-seen tags will be created in the tags table in the database, existing tags will be referenced.

At first sight it looks ok, but:

It has no option to select existing tags.
It doesn't indicate if a tag being added was one that already existed or would be created.
It doesn't help me avoid creating near-duplicate tags with a similar name or even typos.

One wouldn't want different blog articles using tags that mean the same thing but with slightly different names. For example ai-engineering and engineering-with-ai, or even a typo in the tag like ai-eginering. So without any kind of tag selection, and the behaviour of the existing tag selection and creation, this made the UX painful (I had to open the tag management section in a different tab to remember what tags were available) and susceptible to tags becoming a mess across all the content.

Using Claude Design

Claude Design was introduced on 17 April. I had had a bit of a poke around in it before to see what it was about, but this was the first chance I had to try it out on something real.

I started by pasting four screenshots of the CMS and giving it this prompt. (I used the term "upsert" for tags, which wasn't quite correct, but Claude got it).

Attached are four screenshots of my blog CMS. One is where tags are managed, show when two tags existed; ai and harness-engineering. The second one is showing where I can add a post and enter tags. How it currently works is I can enter any tag value and when the post is created/published then the app(or probably the backend for the CMS) will upsert tags.

So you can see in the third image I enter two tags, one existing and the second one (new-tag) is new, so when the blog post got created is used the existing tag and then created a the new-tag, as shown in the fourth image, where I then have a total of three tags. (Two were existing one new one got created on that new post).

I like this functionality. However, right now when I enter tags on the create post page I don't get offered to select from existing tags, so if my new post should use an existing tag I have to carefully remember or go back and look in the tags section to remember the exact name of the tag so I don't end up with near duplicated for what should be the same tag.

Ideally when I start typing for a new tag, it should show existing tags that match (like contain) the text I have entered, with the option of selecting one of those. If none match it should allow me to add a new tag.

Come up with three options for the post create screen for selecting existing or adding tags. Should be simple and seamless and easy to use.

[... 1879 words]

Higher usage limits for Claude and a compute deal with SpaceX

Anthropic partnering with SpaceX to lease their Colossus 1 data center (over 220,000 NVIDIA GPUs).

While that on it's own is interesting, the astonishing side effect of that is a sharp increase in usage limits in paid Claude plans and their APIs.

The following three changes—all effective today—are aimed at improving the experience of using Claude for our most dedicated customers.

First, we’re doubling Claude Code’s five-hour rate limits for Pro, Max, Team, and seat-based Enterprise plans.

Second, we’re removing the peak hours limit reduction on Claude Code for Pro and Max accounts.

Third, we’re raising our API rate limits considerably for Claude Opus models.

I think people will generally appreciate this too:

Finally, we recently made a commitment to cover any consumer electricity price increases caused by our data centers in the US. As part of our international expansion, we’re exploring ways to extend that commitment to new jurisdictions, as well as partnering with local leaders to invest back into the communities that host our facilities.

A model that produces code which compiles and passes the tests it was given is not the same as a model that produces correct, secure, maintainable, well-architected software

The title here, a paraphrased quote from [Gary Marcus], on TNW, today, evaluating a claim from “OpenAI president [who] says AI is now writing 80% of the company’s code”.

Marcus' specific point about coding is structurally important: a model that produces code which compiles and passes the tests it was given is not the same as a model that produces correct, secure, maintainable, well-architected software. The first is verifiable in seconds; the second requires the kind of judgement that has been the historical bottleneck on engineering productivity. Brockman acknowledges the gap, even as he argues it is closing. "The technology we have right now is very jagged," he said in the Big Technology interview. "It is absolutely superhuman at many tasks. When it comes to writing code, those kinds of things, the AI can just do it. But there's some very basic tasks that a human can do that our AI still struggles with."

Realism re AI coding is knowing that next-word prediction gets us a surprisingly long way in writing code, but less far in making sure that code is robust. Coders (especially vibe coders with little experience) beware

As good as these tools are getting — and they are getting really good and helpful — I don't see a non-technical person, say a product manager or marketing person, being able to steer and coerce these LLMs into producing software that any company should be willing to expose to the internet and their user base if they care about robustness, security, reliability and maintainability of the system. Especially if revenue and reputation are on the line.

Human Review Is the Bottleneck

Chris Parsons recently wrote up an article about feedback being the bottleneck.

"Feedback" here, for me, refers to the human effort of gatekeeping LLM outputs before they go out out and serving their purpose (software released to production, an LLM-drafted email being sent, LLM-created slide deck reviewed before presenting, etc.).

This is most relevant where LLMs are generating outputs at greater speeds than long-standing engineering practices keep up with.

The most prominent aspect of this is the pull request as the quality gate. It's not the only one, but it is the one where humans are most in the loop. And the one that most people who care about quality, maintainability, security and operational stability are most uncomfortable with removing or even loosening up on.

The problem, as Chris Parsons nicely puts it:

This is the theory of constraints in action: speed up one stage and the bottleneck moves downstream. When code arrives faster, it pushes more work into review, testing, deployment, and requirements clarification. The queue grows. Nobody is reviewing any faster.

If review, testing and quality assurance are simply removed or loosened up on to allow the bombardment of PRs to flow through, it means - more consequentially - that user, organisation reputation and potentially revenue affecting production quality and incidents will become the bottleneck. Eventually it will come back to bite.

Humans get fatigued, AI doesn't.

With AI, the human is the only one who needs rest while the machine keeps generating work that needs evaluating: permission fatigue, review fatigue, the endless “just need a human to press approve” requests. Cory Doctorow calls this the reverse centaur: humans whose purpose is to support the machine’s needs.4

The reviewers who care most about quality will be the first to burn out, because they are the ones who read everything instead of skimming. Either you make human feedback unnecessary, or you make it instant.

So what does one do about this review bottleneck?

This is the crux of the problem, the very point where we are forced to decide whether we will hold a tight grip on these long-standing practices and push those senior reviewers to their tethers' ends with unrelenting AI reviews (often slop). Or clear the AI on its path of disruption and we adapt accordingly.

On one end of the spectrum, we could double down on existing engineering practices. Humans review everything. This may be most comforting as it may give the most objective confidence to responsible humans that quality is where it needs to be.

It simply doesn't scale. There is too much code being generated, and reading/reviewing only does not give the same in-depth understanding of what was built if the humans were building it too. So things will and do slip through.

Or alternatively we give in and loosen the quality/review gates without additional steps. This risks pushing low quality, architecturally deficient, bug infested code that no one understands into production. The instant gain was real, breakneck delivery speeds. Even the Product Manager has something to push to production. But the theory of constraints just pushes the bottleneck downstream, right into production where it hurts the most and is the hardest to unwind.

[... 1672 words]

Mozilla Used Anthropic's Mythos to Fix 271 Bugs In Firefox (via Simon Willison)

Mozilla has been one of the companies to get access to Anthropic's new Mythos Preview model. And have put it to good use.

As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week's release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation.

Our experience is a hopeful one for teams who shake off the vertigo and get to work. You may need to reprioritize everything else to bring relentless and single-minded focus to the task, but there is light at the end of the tunnel. We are extremely proud of how our team rose to meet this challenge, and others will too. Our work isn't finished, but we've turned the corner and can glimpse a future much better than just keeping up. Defenders finally have a chance to win, decisively.

This seems to validate a lot of Anthropic's claims about the new model's capabilities. And it's encouraging to know that this will likely soon be available to the general public, at some point, where they can be used to strengthen the security postures of existing and new systems.

When that eventually happens, people are going to need to act fast to find and resolve security vulnerabilities before bad actors get a chance to exploit them.

There will be casualties.

Building Has Become Really Fun (again)

It's becoming apparent that people are really... really enjoying using AI to build things.

Not just developers, non-technical people too, from product managers to CEOs to business owners. Pretty much anyone.

People are excite to build. Building at night, on the weekends and during their holiday. I can relate.

It has almost become addictive. "Please, just one more feature...".

This is kind of a big deal. The performance difference between Cursor and Claude Code? Or Claude Code and OpenCode? All that comes from the system around the model. That’s why Google can have one of the best models in the market, and still produce the Shakespearean tragedy that is the Gemini CLI. When people say “Mythos found a zero-day,” the truth is more “Mythos, orchestrated by a purpose-built vulnerability research pipeline, found a zero-day.

— Devansh

Anthropic Opus 4.7 Released Today

The much-anticipated Opus 4.7 was released today. It's the only 4.7 in the model family, with Sonnet and Haiku still at 4.6.

According to the claimed benchmarks, it shows a substantial jump in capability across the board, with notable improvements in:

Visual reasoning — It can now "see" higher resolution pictures up to 2,576 pixels on the long edge, 3x more than Opus 4.6.
Instruction following — It takes instructions more literally than the previous version. The called-out side effect is that users may need to re-tune any prompts and harnesses.
Memory — It's better at using file-system-based memory, remembering important notes across long-running, multi-session work.
Real-world work — Areas like financial analysis, legal, and professional slide presentations.

Anthropic Opus 4.7 Benchmark

I like that they have also included benchmark numbers for the new unreleased Mythos Preview model. The jump in SWE-bench agentic coding from Opus 4.6 to Opus 4.7 is already substantial, and then there's a further leap to 93.9% for Mythos!

Interesting that the cybersecurity vulnerability reproduction score is actually slightly lower on 4.7 than it was on 4.6 — although it appears this may have been intentional.

Anthropic Opus 4.7 Benchmark Cyber

We stated that we would keep Claude Mythos Preview's release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

And...

Security professionals who wish to use Opus 4.7 for legitimate cybersecurity purposes (such as vulnerability research, penetration testing, and red-teaming) are invited to join our new Cyber Verification Program.

This seems to imply the model was intentionally dialled back to reduce the risk of misuse, with access to the full capabilities gated behind the new Cyber Verification Program.

Other Notable Changes

Along with the model, they are releasing these notable controls:

New xhigh reasoning effort — an effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, the default effort level has been raised to xhigh for all plans.
New /ultrareview slash command in Claude Code — produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch.

Opus 4.7 does use more tokens, and it's easy to see why they introduced this new effort level. xhigh scores substantially higher than high at the cost of more tokens burned — but max uses double the tokens of xhigh for a smaller score jump than high to xhigh.

Anthropic Opus 4.7 Agentic Reasoning Effort Token Usage

Opus 4.7 Preparing For Release (via Alberto Romero)

While all the talk is about Anthropic’s “terrifying” new Mythos model, and it being too dangerous to release to the public, it looks like we may be able to get our hands on an upgraded Opus 4.7 in the meantime, possibly as soon as this week.

While 4.5 and then 4.6 were total game changers, there is still often frustration working with them, as with all other models. If 4.7 is anything like the upgrade from 4.5 to 4.6, then we should see a notable and much-welcomed increase in capability.

Opus 4.7 X post

Feedback Flywheel

Rahul Garg at Thoughtworks talks about the Feedback Flywheel, a practice that encourages paying attention to signals in AI engineering that can be used to continuously improve the AI engineering setup and workflows to get better outcomes. Not at a personal level but at the team level.

Every AI interaction generates signal: prompts that worked, context that was missing, patterns that succeeded, failures worth preventing. Most teams discard this signal. I propose a structured feedback practice that harvests learnings from AI sessions and feeds them back into the team's shared artifacts, turning individual experience into collective improvement.

Anyone who has used AI when building software has felt all kinds of frustration when it's not doing what you expect, and what seemingly should be obvious.

Every interaction like this is a signal that the LLM hasn't been primed sufficiently with the right information or instruction for the task or workflow. At this point one could start instructing and steering the LLM right there and then with prompts to get a better result. But it will likely need to be repeated again later.

The idea with the Feedback Flywheel is to pay attention to these, ask what went wrong and what can be improved so next time it works better. Not just for yourself but for anyone on the team working on the project. And then taking steps to bake these in so the next encounter of that situation works better, and broadly for everyone.

Adopting AI practices can plateau once everyone gets comfortable.

With AI coding assistants, most teams reach a plateau. They adopt the tools, develop some fluency, and then stay there. The same prompting habits, the same frustrations, the same results month after month. Not because the tools stop improving, but because the team's practices around the tools stop improving

Worse than this, when every developer on the team is encountering their own flavours of these issues, and each has their own level of skills, prompting style, local setup and even different AI tools, what you can and likely will quickly end up with is snowflake pull requests of wildly varying style and quality.

All of this is signal that should be used to continuously make adjustments to the context and knowledge the LLM is primed with. These can be done in various ways, like adding specific content to the agent specific files (AGENTS.md, CLAUDE.md etc), well structured and discoverable context in additional markdown files, agent skills and custom commands.

These benefits will quickly start to compound and result in better outcomes across the team and improve the quality of the codebase and the pace at which things can be delivered with a high level of quality.

LLM Knowledge Bases

Andrej Karpathy recently shared his approach to building LLM-managed knowledge bases, and it resonates with me.

I've never been great at organizing notes. I suspect most people are the same — they want things organized, they just don't want to be the one doing it. Which makes this a perfect job for an LLM.

The core idea:

TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM.

You provide raw information, the LLM organizes and makes it discoverable, and then it's accessible by you directly, and also made available to the LLM that you work with for looking up information, Q&A and outputting it in different formats.

In a way it's similar to RAG but less complicated, and probably a bit more like the progressive disclosure approach with agent skills where information is discoverable when needed in the context of the conversation but without blowing out the context with wiki information that's not relevant to the conversation.

The real payoff comes with scale:

Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc.

I can see this extending further — multiple wikis, all accessible to the LLM, with controlled access say by different agents. A personal wiki, a business wiki, and a general one for collected articles and research interests. Different agents, different scopes, same underlying approach. Or a single agent with access to all of them.

This feels like something we are going to see more of as a new product or enhancement to existing products.