AI Coding Agent Leaderboard
PinchBench success rates for AI coding agents, cross-referenced with ModelPriceLab live pricing to help you find the best value.
Success Rate Leaderboard
Success rate = percentage of standardized OpenClaw agent tasks completed successfully. Graded via automated checks and LLM judge.
Benchmark data from PinchBench (pinchbench.com), last updated 2026-03-08. Price data from ModelPriceLab. This leaderboard is for informational purposes only.
About PinchBench
PinchBench is an open-source benchmarking system that evaluates LLMs as OpenClaw coding agents across 23 standardized real-world tasks. Unlike traditional LLM benchmarks, PinchBench focuses on tool usage, multi-step reasoning, handling ambiguous instructions, and practical outcomes.
23 Benchmark Tasks
Covering calendar creation, code writing, document summarization, email triage, market research, and more. Each task is graded via automated checks, LLM judge, or a hybrid approach.
Key Takeaways
Google Leads
Gemini 3 Flash Preview leads with 95.1% success rate, showing exceptional tool usage and multi-step reasoning.
Anthropic Full Lineup
All 5 Claude models rank in the top 17, from Haiku (90.8%) to Sonnet 4.5 (92.7%), showing consistent quality.
Value Matters
Highest success rate is not always the best value. Combined with ModelPriceLab pricing, the cost per point of performance becomes clear.
Better model decisions come from seeing benchmark scores and API pricing together
Explore the full price matrix, scenario leaderboards, and solution library on ModelPriceLab to find the optimal combination for your use case.