Results

Leaderboard

All models evaluated on the hidden 125-question private set. Each model was run five times using the same standard zero-shot prompt. Avg@5 is the mean accuracy over five runs. Maj@5 scores each question by the model’s majority answer across the five runs.

Category columns show accuracy by question type: intended meaning, target identification, sentiment reversal, sincere control, and context dependence.

# ▲	Model	Provider	Avg@5	Maj@5	Intended	Target ID	Sentiment	Sincere	Context	Evaluated	Prompt
T-1	Claude Opus 4.7	Anthropic	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	2026-05-10	standard zero-shot
T-1	Claude Sonnet 4.6	Anthropic	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	2026-05-10	standard zero-shot
T-1	Gemini 3 Flash Preview	Google	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	2026-05-10	standard zero-shot
T-1	GPT-5.5	OpenAI	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	2026-05-10	standard zero-shot
T-1	Gemini 3.1 Pro Preview	Google	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	2026-05-11	standard zero-shot
6	DeepSeek V4 Flash	DeepSeek	99.7%	100.0%	100.0%	100.0%	100.0%	100.0%	98.0%	2026-05-11	standard zero-shot
7	DeepSeek V4 Pro	DeepSeek	98.6%	98.4%	100.0%	100.0%	100.0%	96.0%	96.0%	2026-05-11	standard zero-shot
8	Grok 4.1 Fast	xAI	97.9%	98.4%	100.0%	100.0%	100.0%	93.6%	95.0%	2026-05-10	standard zero-shot
9	Kimi K2.6	MoonshotAI	89.3%	92.0%	98.0%	91.2%	100.0%	68.0%	87.0%	2026-05-11	standard zero-shot
-	Random Chance	-	16.7%	16.7%	16.7%	16.7%	16.7%	16.7%	16.7%	-	-

Tap any column header to sort. Scroll right or view on a wider screen for category breakdowns.Scroll right to see all columns.