LLMSnare

A behavior-based LLM / AI Agent benchmark tool

A quantifiable benchmark built around tool-calling behavior.

Download GitHub

[ BENCHMARK // LIVE ARENA ]

Results live in Arena

Open Arena

[ BENCHMARK // WHAT IT MEASURES ]

What it measures, what it does not

ONGOING

Tracks regressions over time, not one-off rankings

It keeps running comparable cases, so it is useful for seeing whether a model gets lazier or regresses in how it reads context, not just where one score lands.

SCOPE

Not a direct test of capability level

It does not directly judge whether coding output is correct, whether a math answer is right, or whether generated text matches a product requirement.

TOOLS

Biased toward tool-calling behavior

It mostly watches read, write, search, and similar tool calls. A low score here usually means the model is a poor fit for agent work.

LAZY

Checks for lazy early writes

If a model starts writing before it has read enough context, it gets penalized. Frontier models also get nerfed at times, and one goal here is to notice that shift early.

CASE

Case-driven and customizable

The standard comes from each case's prompt, rootfs, and scoring rule. You can run LLMSnare yourself and change cases for different scenarios and standards.

LIMITS

These dimensions are not scored directly

Latency, cost, throughput, long-context limits, multi-tool orchestration stability, and side-effect control in real repositories are not part of the score today.

EXPLAIN

Every bonus and deduction is explainable

Each case ties bonuses and deductions to explicit behaviors. You can trace the rules to see why a score moved.

[ CASES // FAILURE PATTERNS ]

Failure behaviors it identifies

Common behavior failures an agent makes in real tasks

P-01

PATTERN 01

Acts before it understands the situation

For example, a task requires writing under a style guide, but the model starts writing without reading the style guide file first. That shows weak instruction following.

P-02

PATTERN 02

Knows it is on the wrong path, but keeps going

For example, an instruction may contain a small mistake, but tool calls are enough to recover the correct context. If the model still stays on the old wrong path after seeing the correct information, its recovery is weak.

P-03

PATTERN 03

Gets stuck in ambiguity

For example, the model notices ambiguity, but the surrounding context is enough to reach the correct conclusion. If it keeps making low-value explorations and still gets it wrong, that points to behavioral problems.

[ UPDATES // RECENT CHANGES ]

Update log

Recent changes that affect how this benchmark should be read.

2026-05-20 18:30 JST

Removed Grok 4.1; added Gemini 3.5 Flash and GPT 5.5.

2026-05-07 11:30 JST

Added DeepSeek V4 Pro and DeepSeek V4 Flash.

2026-04-19 23:00 UTC

Changed the update cadence to every 3 hours and removed Claude Sonnet 4.5 and Claude Opus 4.5.

2026-04-17 02:00 UTC

Added Claude Opus 4.7, removed OpenAI GPT 4.1 and Claude Haiku 4.5, added the new search_text tool, and raised the difficulty.

2026-04-12

Added two models: Google’s Gemma 4 31B and Xiaomi’s Mimo v2 Pro.

2026-04-11

Raised the difficulty because too many models were hitting full score.

2026-04-10

Launched the live LLMSnare benchmark.