I Replaced a Sprint Task With OpenAI's Codex Agent — Here's Exactly What Happened
OpenAI launched Codex as a cloud-based coding agent. Instead of writing about the announcement, I gave it a real task from my codebase and compared the results to Claude Code, Copilot agent mode, and doing it myself.
I Replaced a Sprint Task With OpenAI's Codex Agent — Here's Exactly What Happened
Another week, another AI product launch. OpenAI just dropped Codex, a cloud-based software engineering agent that spins up its own sandboxed environment, reads your repo, writes code, and opens PRs. According to TechCrunch, it's available now to ChatGPT Pro, Team, and Enterprise users and runs on a new model called codex-1, which is a fine-tuned version of o3.
Cool. But I don't care about announcements. I care about whether this thing can actually do work. So I gave it a real task from TechByBrewski's codebase and timed the whole thing.
The Task
I had a genuine item sitting in my backlog: refactor the blog post rendering pipeline to support a new excerpt field that gets pulled into listing cards. It's not trivial — it touches the data layer, the markdown processing, and the front-end components — but it's not a moonshot either. Solid mid-sprint task. The kind of thing that usually takes me 60-90 minutes when I factor in testing and edge cases.
Perfect candidate.
What Codex Delivered
I connected my GitHub repo, described the task in natural language (with some pointed context about the file structure), and let Codex run. The Verge's coverage describes Codex as working "in the background" on tasks that take minutes — and that tracks. It spun for about 8 minutes before presenting me with a diff.
The good: it correctly identified the three files that needed changes. It added the excerpt field to the data schema, threaded it through the markdown parser, and updated the listing component. The code was clean, idiomatic, and honestly better-formatted than what I write at 11 PM.
The bad: it made an assumption about my markdown frontmatter format that was wrong. It expected YAML-style excerpt: in frontmatter, but my posts use a JSON block. This broke the parser. It also didn't touch my tests — didn't update them, didn't add new ones. When I asked it to fix both issues in a follow-up prompt, it nailed the frontmatter fix but wrote a test that imported from a path that doesn't exist.
Total wall-clock time to a working PR: about 35 minutes, counting my review and manual fixes. For context, I had AI help me research and draft parts of this analysis, which made the comparison easier to frame — but the coding results are exactly what I saw.
How It Stacks Up
Me, manually: ~75 minutes. Clean. Tests included. No surprises.
Codex: ~35 minutes (including my cleanup). 80% correct on first pass. Tests were a miss.
Claude Code (via terminal): I ran the same task description. Claude Code took a more conversational approach — asked me two clarifying questions before writing anything, which was annoying but also meant it got the frontmatter format right. Output was solid. Needed less cleanup. No PR automation though; I had to commit and push myself. Total: ~40 minutes.
GitHub Copilot agent mode (VS Code): This felt the most integrated but the least autonomous. It basically gave me really good autocomplete in the right files. I was still driving. Total: ~55 minutes, which is faster than manual but not "agent" territory.
My Actual Take
Codex is impressive as infrastructure. The sandboxed cloud environment, the GitHub integration, the fact that it reads your repo and produces real diffs — that's genuinely new. Most coding AI still operates at the file level. Codex operates at the project level, and that matters.
But the gap between "project-level awareness" and "production-ready output" is where all the interesting problems live. Codex doesn't know my conventions. It doesn't know that I'm neurotic about test coverage. It doesn't know that my frontmatter is JSON because I migrated from a different static site generator two years ago and never cleaned it up. (Nobody's perfect.)
The real unlock isn't replacing the sprint task. It's replacing the first 80% of the sprint task. If I treat Codex output like a junior dev's PR — review carefully, expect to send it back once — it's a net time savings. If I trust it blindly, I'm shipping bugs.
Where This Goes
I think the competitive landscape shakes out like this: Codex wins on autonomy and integration. Claude Code wins on reasoning quality and asking smart questions. Copilot wins on low-friction, in-editor flow. None of them replace a developer. All of them make a solo dev meaningfully faster.
For a one-person shop like TechByBrewski, that's not nothing. That's an extra feature per week, maybe. That's the margin between "I'll get to it" and "it's live."
Just don't skip the code review.
This post was researched and drafted with AI assistance. If something's wrong — and it might be — I'll update the post and call it out explicitly at the top.