Benchmark · PinchBench

AI Coding Agent Leaderboard

PinchBench success rates for AI coding agents, cross-referenced with ModelPriceLab live pricing to help you find the best value.

📊 24 models·🧪 23 tasks·💰 Price overlay·

Success Rate Leaderboard

Success rate = percentage of standardized OpenClaw agent tasks completed successfully. Graded via automated checks and LLM judge.

Model

Success Rate

Input / 1M

Output / 1M

🦞Gemini 3 Flash Preview

google/gemini-3-flash-preview

Success

95.1%

Input / 1M

Output / 1M

🦀MiniMax M2.1

minimax/minimax-m2.1

Success

93.6%

Input / 1M

Output / 1M

Kimi K2.5

moonshotai/kimi-k2.5

Success

93.4%

Input / 1M

Output / 1M

Claude Sonnet 4.5

anthropic/claude-sonnet-4.5

Success

92.7%

Input / 1M

Output / 1M

Gemini 3 Pro Preview

google/gemini-3-pro-preview

Success

91.7%

Input / 1M

Output / 1M

Claude Haiku 4.5

anthropic/claude-haiku-4.5

Success

90.8%

Input / 1M

Output / 1M

Claude Opus 4.6

anthropic/claude-opus-4.6

Success

90.6%

Input / 1M

Output / 1M

Claude Opus 4.5

anthropic/claude-opus-4.5

Success

88.9%

Input / 1M

Output / 1M

GPT-5 Nano

openai/gpt-5-nano

Success

85.8%

Input / 1M

Output / 1M

#10

Qwen3 Coder Next

qwen/qwen3-coder-next

Success

85.4%

Input / 1M

Output / 1M

#11

GLM 4.5 Air

z-ai/glm-4.5-air

Success

85.4%

Input / 1M

Output / 1M

#12

GPT-4o

openai/gpt-4o

Success

85.2%

Input / 1M

Output / 1M

#13

GPT-4o Mini

openai/gpt-4o-mini

Success

83.4%

Input / 1M

Output / 1M

#14

Gemini 2.5 Flash Lite

google/gemini-2.5-flash-lite

Success

83.2%

Input / 1M

Output / 1M

#15

🦐DeepSeek V3.2

deepseek/deepseek-v3.2

Success

82.1%

Input / 1M

Output / 1M

#16

Devstral 2512

mistralai/devstral-2512

Success

81.7%

Input / 1M

Output / 1M

#17

Claude Sonnet 4

anthropic/claude-sonnet-4

Success

77.5%

Input / 1M

$3.00

Output / 1M

$15.00

#18

DeepSeek Chat

deepseek/deepseek-chat

Success

77.3%

Input / 1M

Output / 1M

#19

Gemini 2.5 Flash

google/gemini-2.5-flash

Success

76.6%

Input / 1M

$0.30

Output / 1M

$2.50

#20

Grok 4.1 Fast

x-ai/grok-4.1-fast

Success

70.0%

Input / 1M

Output / 1M

#21

GPT-5.2

openai/gpt-5.2

Success

65.6%

Input / 1M

Output / 1M

#22

Trinity Large Preview

arcee-ai/trinity-large-preview:free

Success

65.5%

Input / 1M

Output / 1M

#23

Step 3.5 Flash

stepfun/step-3.5-flash

Success

40.9%

Input / 1M

Output / 1M

#24

Qwen3 Max Thinking

qwen/qwen3-max-thinking

Success

40.9%

Input / 1M

Output / 1M

Benchmark data from PinchBench (pinchbench.com), last updated 2026-03-08. Price data from ModelPriceLab. This leaderboard is for informational purposes only.

About PinchBench

PinchBench is an open-source benchmarking system that evaluates LLMs as OpenClaw coding agents across 23 standardized real-world tasks. Unlike traditional LLM benchmarks, PinchBench focuses on tool usage, multi-step reasoning, handling ambiguous instructions, and practical outcomes.

🔧

Tool Usage

Can the model call tools with correct parameters

🔗

Multi-step Reasoning

Can it chain actions to complete complex tasks

🌊

Real-world Messiness

Can it handle ambiguous and incomplete information

✅

Practical Outcomes

Did it actually create the file, send the email

GitHub ↗

23 Benchmark Tasks

Covering calendar creation, code writing, document summarization, email triage, market research, and more. Each task is graded via automated checks, LLM judge, or a hybrid approach.

✅Sanity Check

Automated

📅Calendar Event

Automated

📈Stock Research

Automated

✍️Blog Writing

LLM Judge

🌤️Weather Script

Automated

📄Doc Summary

LLM Judge

🎤Conference Research

LLM Judge

✉️Email Drafting

LLM Judge

🧠Memory Retrieval

Automated

📁File Structure

Automated

🔄API Workflow

Hybrid

🔌Skill Install

Automated

🔍Skill Search

Automated

🎨Image Generation

Hybrid

🤖Humanize AI Text

LLM Judge

📊Research Summary

LLM Judge

📬Email Triage

Hybrid

🔎Email Search

Hybrid

🏢Market Research

Hybrid

📑CSV/Excel Analysis

Hybrid

👶ELI5 PDF Summary

LLM Judge

📖Report Comprehension

Automated

💾Knowledge Persistence

Hybrid

Key Takeaways

🏆

Google Leads

Gemini 3 Flash Preview leads with 95.1% success rate, showing exceptional tool usage and multi-step reasoning.

💡

Anthropic Full Lineup

All 5 Claude models rank in the top 17, from Haiku (90.8%) to Sonnet 4.5 (92.7%), showing consistent quality.

💰

Value Matters

Highest success rate is not always the best value. Combined with ModelPriceLab pricing, the cost per point of performance becomes clear.

Better model decisions come from seeing benchmark scores and API pricing together

Explore the full price matrix, scenario leaderboards, and solution library on ModelPriceLab to find the optimal combination for your use case.

View Price Matrix Leaderboards