Can AI models understand sarcasm?
SarcBench is a six-option multiple-choice benchmark that tests whether language models can figure out what a speaker actually means when the literal words and context point in different directions.
Leaderboard
Full leaderboard| # | Model | Avg@5 |
|---|---|---|
| 1 | Claude Opus 4.7 | 100.0% |
| 1 | Claude Sonnet 4.6 | 100.0% |
| 1 | Gemini 3 Flash Preview | 100.0% |
| 1 | GPT-5.5 | 100.0% |
| 1 | Gemini 3.1 Pro Preview | 100.0% |
| 6 | DeepSeek V4 Flash | 99.7% |
| 7 | DeepSeek V4 Pro | 98.6% |
| 8 | Grok 4.1 Fast | 97.9% |
| 9 | Kimi K2.6 | 89.3% |
| - | Random Chance | 16.7% |
Scores updated manually. Avg@5 shown.
Sample Question
Try one yourself
The apartment package locker sent Mia a code that was supposed to open her delivery box. The code opened an empty locker, while her package sat on the counter with no label.
“The locker system is really earning its keep.”
What does the speaker actually mean?
Coverage
What SarcBench tests
Intended Meaning
Does the model understand what the speaker actually means? Questions ask for the pragmatic reading, not the literal one.
Target Identification
Who or what is being mocked? Sarcasm often targets a specific person, system, or idea. The model must identify the target.
Sentiment Reversal
Positive words can carry negative meaning. Models must identify the true emotional valence, not just the surface tone.
Sincere Control
Not every question contains sarcasm. Control items catch models that treat every exaggeration or unusual phrasing as sarcastic.
Context Dependence
Each question includes a context passage. Models that ignore context fail systematically, because sarcasm depends on situation.
How It Works
Methodology summary
SarcBench is a multiple-choice benchmark for testing whether AI models can understand sarcasm, indirect meaning, sincere lookalikes, and context-dependent language. Each question includes a short context, an utterance, and six possible answers. The model must choose the answer that best captures what the speaker most likely means.
The benchmark is organized into five categories: intended meaning, target identification, sentiment reversal, sincere control, and context dependence. These categories are designed to test more than simple sarcasm detection. A model has to reason about the situation, the speaker’s intent, and whether the literal wording matches what is actually happening.
Models are evaluated zero-shot using the same prompt and the same hidden private question set. Each model is run five times. We report Avg@5, which is the average accuracy across the five runs, and Maj@5, which scores a question as correct when the model chooses the right answer in a majority of its five attempts.