workflow-optimizer is an open-source framework of five composable markdown skills for measuring and systematically improving agent workflows. It is not a product. It is a methodology, written for agents to follow.
The repository is at github.com/yihan2099/workflow-optimizer — open source, MIT licensed.
Why I Built This
I run a small bot service that posts content to Twitter, LinkedIn, and Substack via browser automation. Each posting workflow is an agent skill. Each skill can fail — authentication expires, UI elements move, rate limits hit, page loads time out. In any given week, some subset of these workflows breaks silently.
The pattern I kept running into: I would notice a workflow was broken only when checking manually. No structured failure data. No sense of whether the failure was a one-time glitch or a recurring fault. No way to tell whether a fix I applied had actually worked or just postponed the next failure.
I wanted a simple, agent-native way to ask: "Is this workflow reliable? How reliable? When it fails, why?" And then a way to act on the answers systematically, not just once.
The Approach: Five Skills, One Loop
workflow-optimizer is five composable markdown skills. Each is a SKILL.md file — an instruction set that any Claude agent can load and follow. The design is deliberately minimal: no new infrastructure to install, no framework to lock into, no SDK. Skills are the unit. The agent is the runtime.
The five skills form a closed loop:
measure — runs a workflow N times, records success/failure, turn count, duration, efficiency, and cost per run. Computes aggregate stats and writes them to .workflow-optimizer/workflows/{name}/results.json. This is the ground truth.
categorize — reads the failure cases from a measure run and buckets them into eight actionable categories: AUTH, RATE_LIMIT, TIMEOUT, UI_CHANGE, TIMING, CONTENT, BROWSER, UNKNOWN. Each category has a different fix strategy. Categorization turns a list of failures into a repair roadmap.
baseline — snapshots current metrics to .workflow-optimizer/workflows/{name}/baseline.json. Your before-state. Without it, you cannot know whether your fix actually moved the needle.
compare — diffs two metric snapshots and outputs a structured comparison: success rate delta, turn count delta, cost delta, efficiency delta. This is how you prove — to yourself, to a team, to a PR reviewer — that the change worked.
optimize — the full closed loop. It calls measure, categorize, applies fixes, re-measures, compares to baseline, and repeats. Includes stagnation detection: if three consecutive rounds fail to improve the target metric by more than a threshold, it stops and reports.
The entry point is a single invocation: /optimize my-workflow --runs 5 --target-rate 0.8. The agent handles the rest.
What Worked
The categorization step turned out to be the highest-leverage part of the system.
Before building workflow-optimizer, when a posting workflow failed, I would look at the error, guess the cause, and apply a fix. The guess was often wrong. I would fix a timing issue when the real problem was an expired cookie. I would update a CSS selector when the real problem was a rate limit I had hit on the previous run.
Forcing failures into typed categories changes the repair process. AUTH failures have a known fix: call sync-cookies and re-run. RATE_LIMIT failures have a known fix: add a backoff or spread runs over more time. UI_CHANGE failures require inspection. UNKNOWN failures require investigation. When you know the category, you know the repair class. When you know the distribution of categories across a run, you know where to invest.
The baseline-and-compare pattern also proved its value immediately. My Twitter posting skill had a success rate I estimated at "usually fine." After running measure five times, the actual rate was 60%. After categorizing: 3 of the 5 failures were TIMING issues — the skill was clicking post before the media upload had finished. One targeted fix — adding a 2-second wait after upload before clicking — brought the rate to 100% across the next five runs. The compare output showed: success rate +40%, avg turns -1.2, avg cost -$0.003 per run. Concrete. Auditable.
What Broke
The hardest part of building this was defining "success" in a way an agent could reliably evaluate.
For browser automation workflows, success is obvious: did the post appear on the platform? But this requires the agent to verify — which means navigating to the posted content, checking it exists, and returning a boolean. That verification step is itself a workflow that can fail. I ended up with agents that would fail to verify a successful post and record it as a failure, inflating the failure rate artificially.
The fix was to separate the task outcome from the verification step: log both independently, and count a run as successful only if the task completed AND the verification confirmed it. This added complexity but made the measurements trustworthy.
Stagnation detection also required more calibration than I expected. My first implementation stopped the optimize loop if three rounds produced no improvement. But "no improvement" with five runs per round has high variance — a skill with 80% true reliability will sometimes hit 100% for three consecutive rounds and then 60%, not because anything changed, but because of stochastic variation. I switched to a ten-run floor before stagnation can trigger, which makes the detection slower but reliable.
The Numbers
Across my four posting workflows, before optimization:
- Twitter: 60% success rate, avg 8.3 turns, avg $0.012 per run
- LinkedIn: 80% success rate, avg 6.1 turns, avg $0.009 per run
- Substack: 90% success rate, avg 5.8 turns, avg $0.008 per run
- sync-cookies (prerequisite): 95% success rate, avg 3.2 turns, avg $0.004 per run
After one optimize loop per workflow:
- Twitter: 100% (5/5 runs), -1.2 turns, -$0.003 per run
- LinkedIn: 100% (5/5 runs), -0.8 turns, -$0.001 per run
- Substack: 100% (5/5 runs), -0.6 turns, no cost change
- sync-cookies: 100% (5/5 runs), no change
Total API cost for all measurement and optimization runs: approximately $0.30.
What's Next
I want to add a lightweight logging interface so workflows can emit structured events that categorize can consume directly, rather than relying on the agent to infer failure type from error messages. I also want a compare-across-skills view — a leaderboard of workflow reliability across a full skill library, not just per-workflow snapshots. And parallelizing the remeasure step to cut optimization wall-clock time.