workflow-optimizer

workflow-optimizer is an open-source framework of five composable markdown skills for measuring and systematically improving agent workflows. It is not a product. It is a methodology, written for agents to follow.

The repository is at github.com/yihan2099/workflow-optimizer — open source, MIT licensed.

Why I Built This

I run a small bot service that posts content to Twitter, LinkedIn, and Substack via browser automation. Each posting workflow is an agent skill. Each skill can fail — authentication expires, UI elements move, rate limits hit, page loads time out. In any given week, some subset of these workflows breaks silently.

The pattern I kept running into: I would notice a workflow was broken only when checking manually. No structured failure data. No sense of whether the failure was a one-time glitch or a recurring fault. No way to tell whether a fix I applied had actually worked or just postponed the next failure.

I wanted a simple, agent-native way to ask: "Is this workflow reliable? How reliable? When it fails, why?" And then a way to act on the answers systematically, not just once.

The Approach: Five Skills, One Loop

workflow-optimizer is five composable markdown skills. Each is a SKILL.md file — an instruction set that any Claude agent can load and follow. The design is deliberately minimal: no new infrastructure to install, no framework to lock into, no SDK. Skills are the unit. The agent is the runtime.

The five skills form a closed loop:

measure — runs a workflow N times, records success/failure, turn count, duration, efficiency, and cost per run. Computes aggregate stats and writes them to .workflow-optimizer/workflows/{name}/results.json. This is the ground truth.

categorize — reads the failure cases from a measure run and buckets them into eight actionable categories: AUTH, RATE_LIMIT, TIMEOUT, UI_CHANGE, TIMING, CONTENT, BROWSER, UNKNOWN. Each category has a different fix strategy. Categorization turns a list of failures into a repair roadmap.

baseline — snapshots current metrics to .workflow-optimizer/workflows/{name}/baseline.json. Your before-state. Without it, you cannot know whether your fix actually moved the needle.

compare — diffs two metric snapshots and outputs a structured comparison: success rate delta, turn count delta, cost delta, efficiency delta. This is how you prove — to yourself, to a team, to a PR reviewer — that the change worked.

optimize — the full closed loop. It calls measure, categorize, applies fixes, re-measures, compares to baseline, and repeats. Includes stagnation detection: if three consecutive rounds fail to improve the target metric by more than a threshold, it stops and reports.

The entry point is a single invocation: /optimize my-workflow --runs 5 --target-rate 0.8. The agent handles the rest.

What Worked

The categorization step turned out to be the highest-leverage part of the system.

Before building workflow-optimizer, when a posting workflow failed, I would look at the error, guess the cause, and apply a fix. The guess was often wrong. I would fix a timing issue when the real problem was an expired cookie. I would update a CSS selector when the real problem was a rate limit I had hit on the previous run.

Forcing failures into typed categories changes the repair process. AUTH failures have a known fix: call sync-cookies and re-run. RATE_LIMIT failures have a known fix: add a backoff or spread runs over more time. UI_CHANGE failures require inspection. UNKNOWN failures require investigation. When you know the category, you know the repair class. When you know the distribution of categories across a run, you know where to invest.

The baseline-and-compare pattern also proved its value immediately. My Twitter posting skill had a success rate I estimated at "usually fine." After running measure five times, the actual rate was 60%. After categorizing: 3 of the 5 failures were TIMING issues — the skill was clicking post before the media upload had finished. One targeted fix — adding a 2-second wait after upload before clicking — brought the rate to 100% across the next five runs. The compare output showed: success rate +40%, avg turns -1.2, avg cost -$0.003 per run. Concrete. Auditable.

What Broke

The hardest part of building this was defining "success" in a way an agent could reliably evaluate.

For browser automation workflows, success is obvious: did the post appear on the platform? But this requires the agent to verify — which means navigating to the posted content, checking it exists, and returning a boolean. That verification step is itself a workflow that can fail. I ended up with agents that would fail to verify a successful post and record it as a failure, inflating the failure rate artificially.

The fix was to separate the task outcome from the verification step: log both independently, and count a run as successful only if the task completed AND the verification confirmed it. This added complexity but made the measurements trustworthy.

Stagnation detection also required more calibration than I expected. My first implementation stopped the optimize loop if three rounds produced no improvement. But "no improvement" with five runs per round has high variance — a skill with 80% true reliability will sometimes hit 100% for three consecutive rounds and then 60%, not because anything changed, but because of stochastic variation. I switched to a ten-run floor before stagnation can trigger, which makes the detection slower but reliable.

The Numbers

Across my four posting workflows, before optimization:

Twitter: 60% success rate, avg 8.3 turns, avg $0.012 per run
LinkedIn: 80% success rate, avg 6.1 turns, avg $0.009 per run
Substack: 90% success rate, avg 5.8 turns, avg $0.008 per run
sync-cookies (prerequisite): 95% success rate, avg 3.2 turns, avg $0.004 per run

After one optimize loop per workflow:

Twitter: 100% (5/5 runs), -1.2 turns, -$0.003 per run
LinkedIn: 100% (5/5 runs), -0.8 turns, -$0.001 per run
Substack: 100% (5/5 runs), -0.6 turns, no cost change
sync-cookies: 100% (5/5 runs), no change

Total API cost for all measurement and optimization runs: approximately $0.30.

What's Next

I want to add a lightweight logging interface so workflows can emit structured events that categorize can consume directly, rather than relying on the agent to infer failure type from error messages. I also want a compare-across-skills view — a leaderboard of workflow reliability across a full skill library, not just per-workflow snapshots. And parallelizing the remeasure step to cut optimization wall-clock time.