For the past several months, my development workflow looked like this: I’d run Claude Code and Codex side by side. Same prompt, same goal, both investigating the codebase independently. When it was time to plan, Claude would draft the plan and I’d copy it over to Codex for review. Codex would spot things Claude had missed — edge cases, wrong assumptions, overlooked files. I’d relay the findings back, let Claude revise, copy the revision back to Codex, iterate until both agreed. Sometimes I’d stage changes from a previous round so Codex could see the diffs more clearly. Then I’d look over the result myself.
It worked really well. The two models catch different things. Claude is better at synthesis and communication. Codex reads more carefully — it traces code paths, checks edge cases, doesn’t declare done too early. They investigate independently, and that’s what makes the results better.
But the relay was miserable.
Links got stripped, formatting degraded, but those weren’t the real problem. The real problem was that every handoff still depended on me. Copy the output, paste it into the other session, wait, copy back, paste again. My attention stayed trapped inside the loop when it could’ve been somewhere more useful. That was the actual pain — a workflow that wouldn’t move without me.
I didn’t want to change the workflow itself. I just wanted to stop being the human message bus.
The Harness Article That Confirmed Everything
Then I came across Anthropic’s Harness design for long-running application development, mentioned in a YouTube video. I read the whole thing and it did two things at once.
First, it confirmed what I’d already been experiencing: having a separate session verify work independently — without the anchoring bias of having built it — produces better results. That was exactly what I’d been seeing with Claude and Codex.
Second, it turned that experience into a clearer system. I’d been focused mostly on the “different model” part. The article made the “separate evaluator” part much more explicit. Even another Claude session helps if it’s independent enough. The separation itself is load-bearing.
And honestly, it makes sense: if the original session thought the code was wrong, it wouldn’t have written it that way. It’s the same benefit pair programming has always had — a second pair of eyes catches what the author’s brain filters out. These LLMs aren’t that different from us here.
That combination — confirmed by experience, then clarified by research — was the moment it clicked. This wasn’t just some personal quirk in how I like to work. It seemed generally useful. So I turned the workflow into a proper plugin and called it TandemKit.

The Coordination Problem That Actually Needed Solving
My first design had four top-level sessions: Planner, Generator, and two Evaluators — one Claude, one Codex. They coordinated through plain text files. When one session was done, it would write a sync file, and the other would wait in the background until that file changed. It worked perfectly in Claude. But Codex just wouldn’t wait reliably. No matter what I tried — different file-watching approaches, different signaling mechanisms — Codex would either miss that it was its turn or skip the wait entirely.
I was on my fifth or sixth workaround for this when I discovered that OpenAI had just released an official Codex plugin for Claude Code — codex-plugin-cc. Claude could now invoke Codex internally, see its response, and resume the same Codex session later. Exactly what I needed.
That didn’t make the Claude-Codex exchange invisible. Everything is still written to markdown files under TandemKit/, one file per round. The full mission history stays on disk — every investigation, every convergence exchange, every evaluation verdict.
That archive is useful, too. When something weird surfaces weeks later, git history doesn’t just give you a commit message — it gives you the whole conversation behind the commit, so you can actually dig into why a decision was made. It’s also great for improving the workflow itself: reading old missions is often the fastest way to notice that a rule belongs in AGENTS.md or a local skill.
What changed was the plumbing. Instead of juggling a fragile fourth terminal, TandemKit now calls one persistent Codex subagent on demand through codex-plugin-cc and resumes it whenever it needs another independent pass.
What a Mission Actually Is
I’ve been using AI agents every day for close to a year now, and a mission is the size of work I keep landing on: big enough to benefit from separate planning, generation, and evaluation, but small enough that the whole thing can still be implemented and fully verified within one set of sessions.
If the work is smaller than that — a quick one-file fix, a tiny refactor, a straightforward rename — I just use Claude Code directly. TandemKit’s multi-session loop uses more tokens and more ceremony than that kind of change deserves.
If the work is larger than that, I split it before starting. That’s where PlanKit fits in naturally: it takes ideas through Ideas → Roadmap → Features → Missions. By the time something reaches “mission,” it’s not a vague feature bucket anymore — it’s already shaped into a session-sized piece of work.
A Recent Mission in Practice
I wanted to add App Store Connect localization support to TranslateKit, my AI-powered app localization tool. The idea was simple: instead of manually managing app metadata in App Store Connect, developers could edit names, subtitles, descriptions, and keywords for all their localizations directly inside TranslateKit — synced via Apple’s API.
That was too big for one mission. So I split it in two: one mission for the App Store Connect API connection — JWT authentication, credential handling, fetching metadata, updating metadata — and one mission for the UI changes needed to surface all of that in the app. Data layer first, UX layer second.
During planning, Claude and Codex surfaced edge cases I hadn’t thought through. For example: what happens if the user wants to add a new language, but there’s no new App Store Connect version yet? Already-released versions can’t be edited, so the system has to decide — cache the changes locally, prompt the user to create a version first, or offer to auto-create one across platforms.
And if TandemKit creates a new version, what number should it use? Look at the history and guess? Ask the user to type one in? Offer common next-version options to pick from? Those are product decisions, not implementation details, and they belong in the spec before any code exists.
During evaluation of the API mission, Claude initially marked the feature as passing because the happy path worked and the existing-version tests were green. Codex traced the less obvious branch and found the real gap: when no editable App Store Connect version existed yet, the code created the new version record but never retried the localization write in the same flow. So the “new language” path looked successful while silently doing nothing until the user ran sync again — exactly the kind of code-path bug a second model catches.
The fix was small: retry the write after version creation, add tests for that branch. But without the second pass, it would’ve slipped through.
Three Sessions, Autonomous Loop
YOU -- planning
|
`--> [1] Planner Session
Claude ---------> Codex (background)
| <- findings - |
`---- converge ----'
|
Spec.md <-- you approve before continuing
YOU -- start both sessions, then step away
|
|--> [2] Generator Session
| implements against Spec.md
| commits at milestones
|
`--> [3] Evaluator Session
Claude ---------> Codex (background)
| <- findings - |
`---- converge ----'
|
FAIL -> Generator fixes -> loop
PASS -> Review Briefing -> youThe Planner is the only interactive step. Once you approve Spec.md, the Generator and Evaluator run on their own until you get either a FAIL to fix or a PASS with a review briefing.
How Claude and Codex Actually Agree
The obvious move here is scoring: both models rate the work, average the scores, pass if it’s above some threshold. But scores hide failures. An 8/10 can still mean two critical criteria completely failed and everything else was fine.
So TandemKit evaluates criterion by criterion. Each finding gets two dimensions attached to it:
Agreement level: agreed, partially agreed, or disputed
Severity level: HIGH, MEDIUM, or LOW
Claude and Codex investigate independently and write their findings to files. Then Claude reads what Codex found, creates a merged evaluation — keeping what it agrees with, explaining where it differs — and Codex reviews that merge. Any disagreement means re-reading the actual source files, not arguing from memory.
The convergence rule is simple: the loop continues until no HIGH or MEDIUM findings remain in the partially agreed or disputed buckets. If the same disagreement survives three rounds, TandemKit stops iterating and presents both positions to you.
Usually 2-4 rounds.
Why Not Agent Teams
You might wonder: doesn’t Claude Code have Agent Teams for exactly this? It does, but Agent Teams requires API billing — not included in Claude Max — and can’t integrate with Codex.
If Claude Max already costs you $100+/month, adding $20/month for ChatGPT Plus is a reasonable add — you get an independent second model, plus image generation for UI mockups, icons, and other visuals Claude doesn’t produce.
Everything Is Plain Text
Every investigation, every convergence exchange, every evaluation round is stored as readable files in your project:
TandemKit/001-ConnectAPIClient/
├── Spec.md
├── Planner-Discussion/
│ ├── Claude-01.md ← Claude's investigation
│ ├── Codex-01.md ← Codex's independent findings
│ └── Claude-02.md ← converged plan
├── Generator/
│ └── Round-01.md
└── Evaluator/
├── Round-01.md ← FAIL: version created, localization write not retried
└── Round-02.md ← PASSOpen the files and you can trace the full reasoning trail.
Getting Started
The README on GitHub has the full setup flow and shows how the sessions hand off to each other. If this sounds like the workflow you’ve been stitching together by hand, that’s the place to start:
GitHubFlineDev/TandemKitPair programming for AI agents — a Claude Code plugin that coordinates Claude and Codex across planning, generation, and evaluation.
