ModelPriceLab
← Back to home
Benchmark · PinchBench

AI Coding Agent Leaderboard

PinchBench success rates for AI coding agents, cross-referenced with ModelPriceLab live pricing to help you find the best value.

📊 24 models·🧪 23 tasks·💰 Price overlay·

Success Rate Leaderboard

Success rate = percentage of standardized OpenClaw agent tasks completed successfully. Graded via automated checks and LLM judge.

#1
🦞Gemini 3 Flash Preview
google/gemini-3-flash-preview
Success
95.1%
Input / 1M
-
Output / 1M
-
#2
🦀MiniMax M2.1
minimax/minimax-m2.1
Success
93.6%
Input / 1M
-
Output / 1M
-
#3
Kimi K2.5
moonshotai/kimi-k2.5
Success
93.4%
Input / 1M
-
Output / 1M
-
#4
Claude Sonnet 4.5
anthropic/claude-sonnet-4.5
Success
92.7%
Input / 1M
$3.00
Output / 1M
$15.00
#5
Gemini 3 Pro Preview
google/gemini-3-pro-preview
Success
91.7%
Input / 1M
-
Output / 1M
-
#6
Claude Haiku 4.5
anthropic/claude-haiku-4.5
Success
90.8%
Input / 1M
$1.00
Output / 1M
$5.00
#7
Claude Opus 4.6
anthropic/claude-opus-4.6
Success
90.6%
Input / 1M
-
Output / 1M
-
#8
Claude Opus 4.5
anthropic/claude-opus-4.5
Success
88.9%
Input / 1M
$5.00
Output / 1M
$25.00
#9
GPT-5 Nano
openai/gpt-5-nano
Success
85.8%
Input / 1M
$0.05
Output / 1M
$0.40
#10
Qwen3 Coder Next
qwen/qwen3-coder-next
Success
85.4%
Input / 1M
-
Output / 1M
-
#11
GLM 4.5 Air
z-ai/glm-4.5-air
Success
85.4%
Input / 1M
-
Output / 1M
-
#12
GPT-4o
openai/gpt-4o
Success
85.2%
Input / 1M
$2.50
Output / 1M
$10.00
#13
GPT-4o Mini
openai/gpt-4o-mini
Success
83.4%
Input / 1M
$0.15
Output / 1M
$0.60
#14
Gemini 2.5 Flash Lite
google/gemini-2.5-flash-lite
Success
83.2%
Input / 1M
$0.10
Output / 1M
$0.40
#15
🦐DeepSeek V3.2
deepseek/deepseek-v3.2
Success
82.1%
Input / 1M
-
Output / 1M
-
#16
Devstral 2512
mistralai/devstral-2512
Success
81.7%
Input / 1M
-
Output / 1M
-
#17
Claude Sonnet 4
anthropic/claude-sonnet-4
Success
77.5%
Input / 1M
$3.00
Output / 1M
$15.00
#18
DeepSeek Chat
deepseek/deepseek-chat
Success
77.3%
Input / 1M
$0.28
Output / 1M
$0.42
#19
Gemini 2.5 Flash
google/gemini-2.5-flash
Success
76.6%
Input / 1M
$0.30
Output / 1M
$2.50
#20
Grok 4.1 Fast
x-ai/grok-4.1-fast
Success
70.0%
Input / 1M
-
Output / 1M
-
#21
GPT-5.2
openai/gpt-5.2
Success
65.6%
Input / 1M
$2.50
Output / 1M
$10.00
#22
Trinity Large Preview
arcee-ai/trinity-large-preview:free
Success
65.5%
Input / 1M
-
Output / 1M
-
#23
Step 3.5 Flash
stepfun/step-3.5-flash
Success
40.9%
Input / 1M
-
Output / 1M
-
#24
Qwen3 Max Thinking
qwen/qwen3-max-thinking
Success
40.9%
Input / 1M
-
Output / 1M
-

Benchmark data from PinchBench (pinchbench.com), last updated 2026-03-08. Price data from ModelPriceLab. This leaderboard is for informational purposes only.

About PinchBench

PinchBench is an open-source benchmarking system that evaluates LLMs as OpenClaw coding agents across 23 standardized real-world tasks. Unlike traditional LLM benchmarks, PinchBench focuses on tool usage, multi-step reasoning, handling ambiguous instructions, and practical outcomes.

🔧
Tool Usage
Can the model call tools with correct parameters
🔗
Multi-step Reasoning
Can it chain actions to complete complex tasks
🌊
Real-world Messiness
Can it handle ambiguous and incomplete information
Practical Outcomes
Did it actually create the file, send the email

23 Benchmark Tasks

Covering calendar creation, code writing, document summarization, email triage, market research, and more. Each task is graded via automated checks, LLM judge, or a hybrid approach.

Sanity Check
Automated
📅Calendar Event
Automated
📈Stock Research
Automated
✍️Blog Writing
LLM Judge
🌤️Weather Script
Automated
📄Doc Summary
LLM Judge
🎤Conference Research
LLM Judge
✉️Email Drafting
LLM Judge
🧠Memory Retrieval
Automated
📁File Structure
Automated
🔄API Workflow
Hybrid
🔌Skill Install
Automated
🔍Skill Search
Automated
🎨Image Generation
Hybrid
🤖Humanize AI Text
LLM Judge
📊Research Summary
LLM Judge
📬Email Triage
Hybrid
🔎Email Search
Hybrid
🏢Market Research
Hybrid
📑CSV/Excel Analysis
Hybrid
👶ELI5 PDF Summary
LLM Judge
📖Report Comprehension
Automated
💾Knowledge Persistence
Hybrid

Key Takeaways

🏆

Google Leads

Gemini 3 Flash Preview leads with 95.1% success rate, showing exceptional tool usage and multi-step reasoning.

💡

Anthropic Full Lineup

All 5 Claude models rank in the top 17, from Haiku (90.8%) to Sonnet 4.5 (92.7%), showing consistent quality.

💰

Value Matters

Highest success rate is not always the best value. Combined with ModelPriceLab pricing, the cost per point of performance becomes clear.

Better model decisions come from seeing benchmark scores and API pricing together

Explore the full price matrix, scenario leaderboards, and solution library on ModelPriceLab to find the optimal combination for your use case.