awesome-verifying-agentic-work

awesome-verifying-agentic-work is a curated list of research papers, open-source tools, and industry practices focused on one question: when an AI agent does work on your behalf, how do you know the result is correct?

The repository is at github.com/yihan2099/awesome-verifying-agentic-work.

Why I Built It

There are awesome lists for building agents, evaluating models, and AI safety. None of them focus on the practical layer: an agent just refactored 15 files and opened a PR — is this actually correct? An agent wrote a research summary — are the claims real? An agent analyzed a dataset — are the conclusions valid?

This is the verification gap. Generating is easy. Verifying is hard. For many tasks, verifying agent output is nearly as expensive as doing it yourself. As agents take on more complex, multi-step work, verification becomes the bottleneck — not generation.

Model evaluation (benchmarks, leaderboards, MMLU scores) does not answer this question. Nobody loses money because a model scores 2% lower on a benchmark. People lose money — or ship bugs, or publish wrong analysis — because they merged agent work they could not properly verify.

What It Covers

The list bridges several academic fields that don't usually talk to each other:

Scalable oversight — how weaker supervisors can verify stronger systems. Work from Anthropic, DeepMind, and OpenAI on debate, recursive reward modeling, and weak-to-strong generalization.

Runtime verification — checking agent behavior during execution against formal specifications. Recent work like AgentSpec (ICSE '26) applies temporal logic from hardware verification to LLM agent traces.

Process vs. outcome supervision — verifying each step of an agent's reasoning versus only checking the final result. OpenAI's process reward models showed that step-level verification catches errors that outcome-only checking misses.

Verification-by-metric — the simplest case, where success is a single number. Karpathy's autoresearch is the cleanest example: modify code, train 5 minutes, check if val_bpb improved, keep or revert. Works because the objective is unambiguous. Most real-world agentic work does not have this property.

Practical verification patterns — how teams actually verify agent PRs, agent-generated analyses, and agent research output in production. Test generation, mutation testing, human-in-the-loop workflows, audit trails, and the economics of verification cost versus just doing it yourself.

The Core Tension

The fundamental question the list explores: can a weaker verifier reliably catch errors from a stronger generator? If verification requires equal capability to generation, you have gained nothing — you have just moved the work from "do the task" to "check the task."

Related questions that shape the taxonomy:

Do generator and verifier fail on the same inputs? If failures are correlated, verification is unreliable exactly when you need it most.
What is the cost-accuracy tradeoff for different verification strategies? Given a fixed budget, what mix of automated checks, cross-model verification, and human review maximizes confidence?
Which domains are easy to verify (code has tests and compilers, math has proofs) versus hard (research, writing, open-ended analysis)?

Connection to My Other Work

Several of my projects attack this problem from different angles:

Pact verifies agent work through economic incentives and multi-party consensus — N workers submit independently, M judges rank independently, and algorithmic consensus (Borda Count + Kendall Tau) determines the result. No single trusted party needed.

OpenProcurement verifies through decision auditability — agents document what tier they chose, why, and what alternatives they tried. You verify the reasoning process, not just the output.

Workflow Optimizer verifies through empirical measurement — run the workflow N times, measure pass/fail rates, categorize failure modes, apply targeted fixes, and re-measure until the success rate meets a target.

These three approaches — consensus, audit trails, empirical measurement — cover complementary parts of the verification landscape. The awesome list is the connective tissue that situates them alongside the broader research and tooling ecosystem.