Can AI models understand sarcasm?

SarcBench is a six-option multiple-choice benchmark that tests whether language models can figure out what a speaker actually means when the literal words and context point in different directions.

Leaderboard

Full leaderboard
#ModelAvg@5
1Claude Opus 4.7100.0%
1Claude Sonnet 4.6100.0%
1Gemini 3 Flash Preview100.0%
1GPT-5.5100.0%
1Gemini 3.1 Pro Preview100.0%
6DeepSeek V4 Flash99.7%
7DeepSeek V4 Pro98.6%
8Grok 4.1 Fast97.9%
9Kimi K2.689.3%
-Random Chance16.7%

Scores updated manually. Avg@5 shown.

Sample Question

Try one yourself

The apartment package locker sent Mia a code that was supposed to open her delivery box. The code opened an empty locker, while her package sat on the counter with no label.

The locker system is really earning its keep.

What does the speaker actually mean?

Coverage

What SarcBench tests

01

Intended Meaning

Does the model understand what the speaker actually means? Questions ask for the pragmatic reading, not the literal one.

02

Target Identification

Who or what is being mocked? Sarcasm often targets a specific person, system, or idea. The model must identify the target.

03

Sentiment Reversal

Positive words can carry negative meaning. Models must identify the true emotional valence, not just the surface tone.

04

Sincere Control

Not every question contains sarcasm. Control items catch models that treat every exaggeration or unusual phrasing as sarcastic.

05

Context Dependence

Each question includes a context passage. Models that ignore context fail systematically, because sarcasm depends on situation.

How It Works

Methodology summary

SarcBench is a multiple-choice benchmark for testing whether AI models can understand sarcasm, indirect meaning, sincere lookalikes, and context-dependent language. Each question includes a short context, an utterance, and six possible answers. The model must choose the answer that best captures what the speaker most likely means.

The benchmark is organized into five categories: intended meaning, target identification, sentiment reversal, sincere control, and context dependence. These categories are designed to test more than simple sarcasm detection. A model has to reason about the situation, the speaker’s intent, and whether the literal wording matches what is actually happening.

Models are evaluated zero-shot using the same prompt and the same hidden private question set. Each model is run five times. We report Avg@5, which is the average accuracy across the five runs, and Maj@5, which scores a question as correct when the model chooses the right answer in a majority of its five attempts.

Read the full paper