Claude Fable 5 vs GPT-5.5: which AI actually builds a better feature?

Most Fable 5 vs GPT-5.5 comparisons you'll find are benchmark tables. SWE-Bench this, Terminal-Bench that. Useful, but they don't tell me the one thing I actually care about as a designer: when I hand a real feature to each model and let it run, which one ships something I'd actually want in my product?

So I ran the test. Same product, same PRD, same prompt, two agents. Claude Fable 5 vs GPT-5.5 (running through Codex), building a real feature for a tool my team uses every day. One of them clearly understood what "make it feel like Google Docs" meant. The other one was just faster.

Want the PRD skill I used to set up this test? It's free in our skills library.

In this post I'll walk through the exact setup, how each model planned and built the feature, who won and why, and how I'm actually deciding which model to reach for going forward. No synthetic benchmarks, just one real build.

Table of contents:

The test setup
Fable 5 vs GPT-5.5: how they think differently
The build, step by step
How to choose between Fable 5 and GPT-5.5
Tools I used

The test setup

The product is an internal tool we're building called carrot. It's a commenting layer for live products and prototypes, so you can leave feedback directly on the thing, then send those comments straight to an agent to fix. Think leaving a comment on a button, marking it ready, and Claude going in to make the change.

The feature I wanted to build: Google Docs-style inline feedback. Right now you pin a comment to an element. What I wanted was to highlight a specific piece of copy, leave a note on exactly that selection, and have it persist visually, the way a comment thread anchors to highlighted text in a doc.

To keep it fair, I gave both models the same inputs:

One detailed PRD, generated from a project transcript using our PRD skill, so each agent had full context on the codebase and the product
Plan mode first, then build, for both
Their own recommended scope option (no cherry-picking)

Then I let them run. Claude Fable 5 in one window, GPT-5.5 through Codex in the other, both pointed at the same repo.

Fable 5 vs GPT-5.5 vs the benchmarks

Quick context, because the leaderboards do matter. On repository-scale work like SWE-Bench Pro, Fable 5 tends to lead; on terminal-native tasks, GPT-5.5's Codex harness is right there with it. On price, GPT-5.5 is cheaper ($5 in / $30 out per million tokens) than Fable 5 ($10 in / $50 out), per public benchmark roundups. So this isn't a blowout on paper. The interesting question is what that gap feels like on a real, design-heavy feature.

Fable 5 vs GPT-5.5: how they think differently

The personalities showed up before either model wrote a line of code, in plan mode.

GPT-5.5 was fast and legible. It came back quickly with three clean options: A (fast, but "not very Google Docs visually"), B (its recommendation, with text anchors and persistence), and C (the full margin-markers, voting, everything build). It was easy to read and easy to choose from. It also narrated what it was doing the whole way through, which honestly is fun to follow.

Fable 5 took its time and went deeper. It ran three explorations, confirmed the file map against the actual codebase, and asked me a genuinely good scoping question: did I want this shipped as one big change or broken into seven smaller PRs? That's the question a real engineer asks. If I were actually shipping this with my team, I'd want PRs under 500 lines so they're reviewable. Fable was thinking about how this lands in a real workflow, not just whether it works.

The tradeoff was real, though, and worth being honest about:

GPT-5.5 felt more transparent. It walked me through its process so I always knew what was happening. Fable 5 mostly went quiet and worked, which is unusual for Claude and a little nerve-wracking when you're 12 minutes in with no updates.
Fable 5 felt more thorough. It treated the feature as a full-stack problem (schema, API, dashboard, agent prompt), not just a UI task.
Speed was not close. GPT-5.5 finished in about 10 minutes. Fable 5 took nearly 39.

The build, step by step

Here's what each model actually produced. I'll keep the labels blind for a second, the way I did in the video, because the reveal is the fun part.

Option one: fast, but it missed the point

The problem: I wanted to select a line of copy and have it highlight, Google Docs style, then leave a comment anchored to that selection. The solution it shipped: I could leave a comment, but the selected text didn't actually highlight. It dropped a small marker (labeled "KR," which I had to guess meant copy) and I clicked into it to see the note. Functional. Just not the experience I described. The result: It worked, but it didn't feel like the thing I asked for. No persistent highlight, no sense that feedback was living on that specific piece of text.

Option two: it actually felt like Docs

The problem: Same ask. The solution it shipped: I selected the copy, got a clean notification banner to leave feedback, hit send, and the text highlighted afterward to show feedback lived there. Hovering over it triggered a small animation surfacing the comment. It felt like the Docs interaction I had in my head, with a little extra delight on top. The result: Clearly the better build. Not because it had more features, but because it understood the intent behind "make it feel like Google Docs" and designed the interaction around that.

The reveal

Option two was Fable 5. Option one was GPT-5.5.

Fable's final summary tells the story: it started with the schema and migration, built the store layer, the API, the dashboard, and the agent prompt, then wrote and verified tests. It correctly read the feature as more than a UI challenge. It was a question of how the whole thing works end to end. GPT-5.5, from a single prompt, just didn't go that deep. It gave me a working surface, not a finished experience.

The catch, again: Fable took nearly four times as long. For this build, the depth was worth it. For a tiny tweak, it absolutely wouldn't be.

How to choose between Fable 5 and GPT-5.5

Here's the honest takeaway: I'm not picking one and deleting the other. I bounce between models depending on the job and how many tokens I've already burned that day. After this test, here's the rule of thumb I'm using.

1. Match the model to the depth of the task

For small, well-defined design changes (move this, recolor that, tweak copy), reach for the faster, cheaper option. GPT-5.5 turns those around quickly and shows its work. You don't need a 39-minute deep dive to change a button.

2. Use the deeper model when intent matters

When the work is creative, UX-heavy, or you're pushing on what's actually possible, that's where Fable 5 earns its time. If a feature requires interpreting a fuzzy goal ("make it feel like Docs") into the right interaction, the extra exploration pays off.

3. Give both models the same strong context

The reason this test was fair is that both got the same detailed PRD. A model can only be as good as the context you hand it. Investing in a tight PRD up front (I used a skill to generate mine from a transcript) raised the floor for both agents.

4. Let plan mode tell you who understood the brief

Before you commit to a build, read the plan. The scoping questions a model asks are a tell. Fable asking whether to split into reviewable PRs signaled it was thinking like a teammate. That's a green flag worth more than raw speed.

5. Watch the cost, not just the clock

Fast modes and deep agents both burn tokens, just in different ways. Fable's depth costs minutes and tokens; GPT-5.5's speed costs you a shallower first draft you may have to iterate on. Factor in the real cost of the rework, not just the wall-clock time.

The meta-point: these tools change weekly. The actual skill is no longer "which model is best," it's building a process flexible enough to swap models in and out as they leapfrog each other.

Tools I used

A few things that made this test possible:

Conductor CLI for running both agents side by side. As a designer I like it because it's visual: I can see branches, files, and what each agent is doing, and it's friendlier for non-engineers than a bare terminal.
Our PRD skill to generate the detailed product requirements doc from a project transcript. This is what gave both models real codebase context, and it's free in our skills library.
carrot, the commenting tool the feature was built into. If you want to test leaving feedback and shipping it straight to an agent, there's an early signup.

This is exactly the kind of work my team does every day: figuring out how AI fits into a real product cycle without dropping the UX quality bar. We're a design engineering team helping B2B SaaS teams design and ship faster with these tools, and the hardest part is rarely the code. It's the process around it.

If your team is staring at a new model every week wondering how to actually fit it into your workflow, that's the problem I run workshops on. Fill in this form to sign up!

https://youtu.be/lcQgcf-17uA