Can AI models understand sarcasm?

SarcBench is a six-option multiple-choice benchmark that tests whether language models can figure out what a speaker actually means when the literal words and context point in different directions.

View Leaderboard Try 25 Questions

Meet the Author

Leaderboard

Full leaderboard

#	Model	Provider	Avg@5
1	Claude Opus 4.7	Anthropic	100.0%
1	Claude Sonnet 4.6	Anthropic	100.0%
1	Gemini 3 Flash Preview	Google	100.0%
1	GPT-5.5	OpenAI	100.0%
1	Gemini 3.1 Pro Preview	Google	100.0%
6	DeepSeek V4 Flash	DeepSeek	99.7%
7	DeepSeek V4 Pro	DeepSeek	98.6%
8	Grok 4.1 Fast	xAI	97.9%
9	Kimi K2.6	MoonshotAI	89.3%
-	Random Chance	-	16.7%

Scores updated manually. Avg@5 shown.

Sample Question

Try one yourself

The apartment package locker sent Mia a code that was supposed to open her delivery box. The code opened an empty locker, while her package sat on the counter with no label.

“The locker system is really earning its keep.”

What does the speaker actually mean?

AMia wants all packages left outside.

BThe empty locker was probably her delivery.

CMia is complimenting the counter space.

DThere is not enough information to judge the package process.

EThe locker worked exactly as intended.

FThe locker system failed to deliver the package correctly.

Coverage

What SarcBench tests

Intended Meaning

Does the model understand what the speaker actually means? Questions ask for the pragmatic reading, not the literal one.

Target Identification

Who or what is being mocked? Sarcasm often targets a specific person, system, or idea. The model must identify the target.

Sentiment Reversal

Positive words can carry negative meaning. Models must identify the true emotional valence, not just the surface tone.

Sincere Control

Not every question contains sarcasm. Control items catch models that treat every exaggeration or unusual phrasing as sarcastic.

Context Dependence

Each question includes a context passage. Models that ignore context fail systematically, because sarcasm depends on situation.

How It Works

Methodology summary

SarcBench is a multiple-choice benchmark for testing whether AI models can understand sarcasm, indirect meaning, sincere lookalikes, and context-dependent language. Each question includes a short context, an utterance, and six possible answers. The model must choose the answer that best captures what the speaker most likely means.

The benchmark is organized into five categories: intended meaning, target identification, sentiment reversal, sincere control, and context dependence. These categories are designed to test more than simple sarcasm detection. A model has to reason about the situation, the speaker’s intent, and whether the literal wording matches what is actually happening.

Models are evaluated zero-shot using the same prompt and the same hidden private question set. Each model is run five times. We report Avg@5, which is the average accuracy across the five runs, and Maj@5, which scores a question as correct when the model chooses the right answer in a majority of its five attempts.

Read the full paper