Results

Leaderboard

All models evaluated on the hidden 125-question private set. Each model was run five times using the same standard zero-shot prompt. Avg@5 is the mean accuracy over five runs. Maj@5 scores each question by the model’s majority answer across the five runs.

Category columns show accuracy by question type: intended meaning, target identification, sentiment reversal, sincere control, and context dependence.

#ModelAvg@5
T-1Claude Opus 4.7100.0%
T-1Claude Sonnet 4.6100.0%
T-1Gemini 3 Flash Preview100.0%
T-1GPT-5.5100.0%
T-1Gemini 3.1 Pro Preview100.0%
6DeepSeek V4 Flash99.7%
7DeepSeek V4 Pro98.6%
8Grok 4.1 Fast97.9%
9Kimi K2.689.3%
-Random Chance16.7%

Tap any column header to sort. Scroll right or view on a wider screen for category breakdowns.