Claude Code vs Codex: Same prompt, same repo, very different results

# Claude Code vs Codex: Same prompt, same repo, very different results

I gave Claude Code and Codex the exact same job. Same repo. Same prompt. Build a quick design system from what's already in the codebase, then ship a small feature on top of it.

I've been a Claude Code lover for a while now, so going in I had a clear favorite. By the end of this test, my brain was a little exploding. The agent I expected to win didn't win cleanly, and the one I was newer to surprised me in ways I didn't see coming.

If you're trying to decide which AI coding agent to lean on for product work, here's what actually happened when I ran both side by side.

The setup
Pricing and speed
Round 1: Build a design system from the codebase
Round 2: Ship a bulk-confirm feature
Round 3: Debugging a button size mismatch
The verdict
How I'd pick between them

The setup

Two terminal windows. Codex on the left, Claude Code on the right. Both pointed at the same repo, both running the same prompts in the same order.

The first prompt: "Can you look into the current codebase and create a landing page with all the branding elements I'd need for a design system. Include colors, fonts, tokens, everything you see in the current codebase."

The second prompt asked them to ship a feature on top of that codebase, a bulk "confirm all" action on a bookings page.

That's it. No special instructions, no tuning, no babysitting. The point was to see what each one does when you let it run.

Pricing and speed

A quick note on the two plans I'm running.

I'm on the Codex $20 plan and haven't hit the limit yet. That's wild for the volume I've been throwing at it. I'm also on the Claude Code $100 plan and have never hit my limit on that either. Token-wise, Codex is the clear winner on price-to-output.

On speed, my first impression was that Codex felt quicker. It returned questions and plans faster. But once you account for the back-and-forth across a full task, they end up roughly even. On the design system task, Claude Code finished in 7 minutes 45 seconds. Codex took about 8 minutes 45 seconds, mostly because it ran its checks twice.

A minute one way or the other doesn't change my day. The output quality does.

Round 1: Build a design system from the codebase

Both agents inspected the repo, pulled the token files, the global CSS, the Tailwind config, and started generating. Within a minute you could see the difference in how they communicate.

Claude Code shows a live checklist. Every step it plans to do, every step it's currently working on, every step it's checked off. You always know where it is in the task. Codex just narrates. It tells you what it's doing in plain text. Less visual, but the information is there if you read it.

Both produced a landing page. Both shipped light and dark modes, color tokens, typography, components. The outputs were more similar than I expected, but the differences were telling.

Where Claude Code's output was stronger

Type scale. Claude Code laid out the full type scale with weights and ratios. Codex showed typography but not the scale.
Token honesty. Claude Code's tokens felt closer to what's actually in the codebase.

Where Codex's output was stronger

Layout. It felt more like a real landing page. Cleaner opening, better section pacing.
Implementation notes. At the bottom of the page, Codex documented where it pulled each piece from. That kind of citation is useful when you want to verify what's real and what's invented.

Both also hallucinated a little. Claude Code added a "motion" section I don't think exists in the actual product. Codex made up some patterns. Worth flagging: neither output is something you'd ship to production without a review pass.

A bigger test on a real client repo

I ran the same exercise on a client codebase to see if the gap held up at scale. Different result this time.

On the client repo, Codex went deeper. It built a richer component catalog with "when to use" and "when not to use" sections for each one, exactly the kind of usage documentation that makes a design system actually useful for an agent (or a human) to consume. Claude Code's version was nicely organized too but didn't pull out the same level of semantic detail.

So: on the small test, Claude Code felt slightly more grounded. On the bigger codebase, Codex went further on the documentation side. Different strengths.

Round 2: Ship a bulk-confirm feature

This is where it got interesting.

The task: add a "confirm all" action to a bookings page. My instinct was that it belongs in the top bar, next to the saved filters, and that it should not be a primary action. I asked both agents to plan it out first before writing any code.

Codex asked more questions and asked them faster. It came back with a clear plan: place a secondary minimal button in the toolbar, implement it as "confirm all matching the current filter," skip row selection, use the existing mutation. It even proposed a confirmation dialog, which I pushed back on because confirming bookings isn't a destructive action.

Claude Code asked fewer questions but flagged a bigger one. Buried in its plan was this: "confirming will trigger email notifications to all customers." That's the kind of thing I'd want my team to weigh in on before I shipped it. Codex didn't surface it.

On the back-and-forth, Codex was faster. The conversation moved quickly, the plan iterated quickly. Claude Code took about three minutes to come back with its plan, which felt slow in the moment, but the plan it returned was more careful.

Both shipped the button. Both reused the existing mutation. Same general approach.

The thing only one of them got right

When I went to check the UI, here's what happened:

Claude Code put the button in the right place. It showed up on the upcoming bookings view, the unconfirmed view, everywhere it logically should.
Codex put the button in unconfirmed only. It wasn't accessible from upcoming, which is where I actually wanted it. I didn't catch that constraint in the conversation, but Claude Code inferred it from the task itself.

Logically, Claude Code understood what I was trying to do. Codex asked better clarifying questions but missed the intent.

Round 3: Debugging a button size mismatch

After shipping, I noticed the new "confirm all" button was visually smaller than the surrounding filter buttons. Same prompt to both: "Why is this button smaller than the filter button next to it?"

This is where the agents diverged the most.

Codex said the new button was using size="small" while the filter button used the default. It changed the new button to match. Done. Claude Code said something more interesting: "Both buttons are size small. The filter button has a custom full-height override. The actual problem is that your filter buttons aren't following the standard component sizing."

In other words, Codex fixed the symptom. Claude Code questioned the system.

Claude Code was right, and the implication mattered. If I'd taken Codex's fix, I would have made the new button match an already-broken pattern. Claude Code's read made me rethink the component sizing across the toolbar, not just the new button.

This is the moment that flipped my brain. Codex was faster and asked better questions. Claude Code understood the codebase more deeply.

The verdict

I went into this expecting Claude Code to win on every dimension. It didn't.

Dimension	Winner
Price	Codex
Speed (raw response time)	Codex
Speed (full task)	Roughly even
Live progress visibility	Claude Code
Clarifying questions	Codex
Surfacing hidden risk	Claude Code
Understanding intent	Claude Code
Codebase-level reasoning	Claude Code

If I had to pick one to stop using, I couldn't. Right now, I'm leaning into Codex more because I'm less familiar with it and I want to know where its ceiling is. That said, when I need an agent to reason about a codebase, not just produce code in it, Claude Code is still where I go.

What I will not do is keep treating this like a one-or-the-other decision. They think differently, and that difference is the value.

How I'd pick between them

A quick playbook based on what I've seen so far.

1. For exploration and planning, use Codex

Codex asks more questions and iterates on plans faster. If I'm in the early phase of a feature and I want to think out loud with an agent, Codex is the better thinking partner. The lower price also makes it less precious to throw away.

2. For execution inside a real codebase, use Claude Code

When the work is "don't break the design system, don't create inconsistencies, don't ship a fix that masks a deeper bug," Claude Code reasons about the surrounding code in a way Codex hasn't yet matched for me. The checklist UI also makes it easier to catch hallucinations before they ship.

3. For high-stakes changes, get a plan from both

This is the move I didn't expect to land on. For anything that touches shared infrastructure, sends user-facing emails, or modifies a core component, ask both agents to plan it. The points of agreement are usually safe. The points of disagreement are where the real conversation lives.

In my bookings test, only Claude Code flagged the email notifications. Only Codex caught the dialog UX question. Run them both, and you get the union.

4. Treat clarifying questions as a signal, not a verdict

Codex asked more clarifying questions in my test. That made it feel more thoughtful. But asking good questions and understanding the codebase are different skills. The agent that asks fewer questions but reads the code more deeply may still be the one that ships the right thing.

5. Always do a manual review pass

Both agents hallucinated. Both invented components or patterns that didn't exist. AI coding agents are accelerators, not replacements for the read-through. If you're shipping fast with either of these, build the review pass into your flow.

So which one am I moving forward with?

Both. For now. I want to push Codex harder before I commit to a primary, and I want to see if Claude Code's codebase-level reasoning gets even better as it learns more about the projects I'm working in.

If you're a design engineer or a product team trying to ship faster, the honest answer is that you should be testing both. The mental model that one of these wins outright is the wrong frame. The interesting question is: which agent matches which part of your workflow?

That's the one I'm still answering.