Results
Leaderboard
All models evaluated on the hidden 125-question private set. Each model was run five times using the same standard zero-shot prompt. Avg@5 is the mean accuracy over five runs. Maj@5 scores each question by the model’s majority answer across the five runs.
Category columns show accuracy by question type: intended meaning, target identification, sentiment reversal, sincere control, and context dependence.
| # ▲ | Model | Avg@5 |
|---|---|---|
| T-1 | Claude Opus 4.7 | 100.0% |
| T-1 | Claude Sonnet 4.6 | 100.0% |
| T-1 | Gemini 3 Flash Preview | 100.0% |
| T-1 | GPT-5.5 | 100.0% |
| T-1 | Gemini 3.1 Pro Preview | 100.0% |
| 6 | DeepSeek V4 Flash | 99.7% |
| 7 | DeepSeek V4 Pro | 98.6% |
| 8 | Grok 4.1 Fast | 97.9% |
| 9 | Kimi K2.6 | 89.3% |
| - | Random Chance | 16.7% |
Tap any column header to sort. Scroll right or view on a wider screen for category breakdowns.Scroll right to see all columns.