HumorGen — Open-Weight Computational Humor Generation

Overview

The whole story, in one paragraph.

What we do

Six comedic personalities, one small model.

We start with large AI models acting as teachers. For every headline, each teacher writes jokes from six different comedic angles — the worrier, the cynic, the absurdist, and so on. A strong AI judge then compares the drafts head-to-head and keeps the funniest. A small, efficient student model learns from those winning jokes, so it can be funny on its own.

What we find

Good data beats big models — and fancy training tricks hit a wall.

Our small model ranks among the strongest open humor models available, beating systems many times its size and holding its own against frontier models. Adding popular preference-tuning methods on top didn't help — once the training jokes are diverse and well-chosen, you've already captured most of the gain. We call this the data quality ceiling.

The problem

Funny lives at the edges of the distribution.

Language models struggle to be funny — not because they lack knowledge of humor, but because of how they are trained.

The objective mismatch

Next-token prediction rewards the median.

Standard training maximizes the likelihood of the next word. The most probable continuation is always the safest, most generic one — so the model is pulled toward the center of the distribution, away from anything surprising.

What comedy demands

Incongruity lives in the tail.

Humor lives at the edges: the unexpected turn of phrase, the precise word that collapses two meanings at once, the observation that makes a situation suddenly absurd. The funniest line is rarely the most likely one.

Training a model to "be funny" with a single instruction does not work — it produces outputs that scan as jokes but fail to land. HumorGen's premise: a single headline supports multiple valid comedic interpretations. Teach the model that diversity, and the laughs follow.

Superiority theory

Laughter at another's expense

Humor arises from perceived superiority or social critique. Grounds The Cynic.

Incongruity theory

Mismatched frames collide

Humor arises when incompatible interpretations meet. Grounds The Observer (social mapping) and The Absurdist (surrealism).

Linguistic · Benign violation

Wordplay & safe rule-breaking

Phonological ambiguity grounds The Wordsmith; harmless norm violations ground The Optimist.

For each headline, every personality produces several joke drafts. No personality is hand-picked as the best — an AI judge compares all the drafts and lets the funniest ones earn their place in training.

Mixture-of-Thought

Many jokes, not one.

Instead of a single attempt, the framework generates many joke drafts per headline, each shaped by a different comedic personality. This steers generation toward the surprising territory where humor lives — away from the safe, generic punchline a model would default to.

Silver teacher distillation

Judge, then teach.

A strong AI judge compares the drafts two at a time and ranks them. Only the top-ranked jokes are passed down to a smaller, efficient student model, which learns to produce that quality on its own.

Training pipeline

From many joke drafts to one funny model.

01

Teacher generation

Large teacher models write many joke drafts for each headline, each from a different comedic angle. This builds a broad, varied pool of candidates instead of one safe attempt.

02

Rank & teach

A strong AI judge compares the drafts head-to-head and ranks them. The funniest jokes become training material for a smaller, efficient student model that learns to be funny on its own. Some variants also learn the reasoning behind each joke.

03

Does fancy tuning help?

We test two popular preference-tuning methods on top: DPO (direct preference learning) and O-GRPO (a group-relative variant). Neither beats the simpler approach — once the training jokes are well-chosen, the extra tuning adds little.

DPO

Direct Preference Optimization

Learns from pairs of "funnier" vs "less funny" jokes chosen by the AI judge.
Builds on top of the student model from step two.
Performs about the same as the simpler approach — the difference isn't statistically meaningful.
Our top two models sit right next to each other near the top of the leaderboard.

O-GRPO · offline

Offline Group Relative Policy Optimization

Looks at the full pool of joke drafts per headline, scored in advance.
An offline variant — no new jokes generated during training.
Consistently underperforms the simpler approaches.
The training signal ends up punishing weak jokes more than rewarding strong ones.

key finding · data quality ceiling

The takeaway: choosing the right training jokes matters more than scaling up the model. Our small model beats systems many times its size and holds its own against frontier models. Adding preference tuning on top brings no meaningful gain once the data is good — a data quality ceiling. Forcing the model to "think out loud" before each joke can even make it less funny — the explainer trap.

Evaluation

How does it actually compare?

We ranked our models against 13 others — including frontier systems from OpenAI, Google, and others — on two humor benchmarks. Each model's jokes are compared head-to-head by an AI judge, and the results are turned into a leaderboard. Higher scores mean the model's jokes were judged funnier, more often. Full tables are in the paper.

400 prompts across 8 different styles of input · 42,000 head-to-head comparisons Tests whether a model can be funny beyond news headlines — no model trained on this set.

Rank	Model	BT rating	95% CI
1	GPT-5	1336.18	1323.3 – 1348.3
2	Kimi-K2	1259.98	1249.7 – 1268.5
3	HumorGen SFT-7B	1128.14	1118.3 – 1138.1
4	HumorGen DPO-7B	1123.72	1115.7 – 1134.9
5	HumorGen DPO-Think-7B	1116.65	1107.9 – 1127.1
6	HumorGen SFT-Think-7B	1085.31	1075.8 – 1096.5
7	HumorGen GRPO-7B	1071.13	1060.8 – 1080.1
8	Gemini-2.5-Pro	1059.07	1049.3 – 1068.4
9	HumorGen GRPO-Think-7B	1055.94	1043.8 – 1066.8
10	GPT-OSS-120B	1048.19	1039.7 – 1057.1
11	Qwen3-32B	990.44	981.4 – 999.4
12	phi2-Humor	803.72	794.5 – 818.2
13	HumorGen-Com-7B	665.93	645.5 – 680.0
14	Base Qwen-7B	643.01	628.3 – 658.0
15	JokeGPT	612.58	597.6 – 627.4

50 news headlines · 5,250 head-to-head comparisons The established benchmark for humor generation from headlines.

Rank	Model	BT rating	95% CI
1	GPT-5	1378.73	1346.1 – 1421.5
2	Kimi-K2	1279.63	1245.0 – 1322.1
3	Gemini-2.5-Pro	1247.80	1212.4 – 1279.7
4	HumorGen SFT-7B	1140.37	1107.9 – 1173.2
5	HumorGen DPO-7B	1135.25	1101.9 – 1160.2
6	HumorGen GRPO-7B	1089.84	1060.7 – 1114.1
7	GPT-OSS-120B	1049.99	1019.8 – 1081.5
8	HumorGen SFT-Think-7B	1049.99	1016.3 – 1084.5
9	HumorGen DPO-Think-7B	1031.30	1002.1 – 1058.0
10	Qwen3-32B	1023.18	997.0 – 1046.0
11	HumorGen GRPO-Think-7B	948.51	914.7 – 982.6
12	phi2-Humor	791.32	751.8 – 826.1
13	HumorGen-Com-7B	721.97	682.3 – 750.8
14	Base Qwen-7B	673.16	643.1 – 718.0
15	JokeGPT	438.97	384.3 – 500.2

We verified the rankings with a second, independent AI judge — the orderings barely changed. Our top two models land near the top of both leaderboards, ahead of models many times their size.

CLEF 2026 JOKER · Task 4

Extending to constrained humor.

The framework isn't limited to open-ended jokes. JOKER Task 4 is stricter: you're given a pun word and two meanings it has to carry, and you must write a short funny line that hits both senses at once. The model has to satisfy a hard constraint without losing the laugh.

Stage 1 — Humor prior

The model first learns general humor from a large set of headline jokes. This gives it a baseline sense of what's funny before it ever sees a pun-brief.

Stage 2a — English JOKER

The English model then specializes on pun-briefs — learning to hit both required senses while staying funny.

Stage 2b — Multilingual warm-up

French and Spanish start from the same humor baseline and warm up on pun-briefs across all three languages together.

Stage 3 — FR / ES specialization

Each language then gets its own focused pass to specialize. Six open-weight adapters are released — English, French, and Spanish, in two sizes each.

Task 4 pun brief — example from the JOKER test set (English, het_1261)

Pun word

vein

Sense A

blood vessel toward heart

Sense B

to no avail (in vain)

Generated joke (Kimi-CSF system)

“He paid the hospital twelve grand for an I.V., watched the bag drip empty into his arm, then flatlined anyway. Turns out the whole transfusion was in vein.”

A valid pun-brief has to use the pun word, hit both senses, read naturally in its language, and still be funny. Our experiments show that running the full framework at generation time beats baking it into the model's weights — the in-the-moment search is what carries it. The distilled student models inherit the structure of good pun-briefs, but not the breadth of the search.

Model collection

Open weights, all of them.

14 open-weight models, released as lightweight adapters you load onto the matching base model. Collection Jayi2424/HumorGen · landing repo Jayi2424/HumorGen. Apache-2.0.

Core HumorGen — 7B

Open-ended headline humor · full comparison across training methods · our top performers

Model	Training	CSD	Hugging Face
`HumorGen_SFT_7B`	Supervised Fine-Tuning	—	link ↗
`HumorGen_SFT_Think_7B`	SFT + CSD traces	yes	link ↗
`HumorGen_DPO_7B`	DPO (β=0.1, from SFT)	—	link ↗
`HumorGen_DPO_Think_7B`	DPO (from SFT-Think)	yes	link ↗
`HumorGen_GRPO_7B`	O-GRPO (G=24, from SFT)	—	link ↗
`HumorGen_GRPO_Think_7B`	O-GRPO + CSD traces	yes	link ↗

Multilingual Base — 14B & 32B

General humor baseline trained on English headlines · the starting point for the pun-brief models

Model	Scale	Base model	Hugging Face
`HumorGen_SFT_14B`	14B	Qwen3-14B	link ↗
`HumorGen_SFT_32B`	32B	Qwen3-32B	link ↗

CLEF 2026 JOKER Task 4 — Constrained Pun Generation

Constrained pun-brief generation · English, French & Spanish · two sizes each

Model	Language	Scale	Hugging Face
`HumorGen_JOKER_EN_14B`	English	14B	link ↗
`HumorGen_JOKER_EN_32B`	English	32B	link ↗
`HumorGen_JOKER_FR_14B`	French	14B	link ↗
`HumorGen_JOKER_FR_32B`	French	32B	link ↗
`HumorGen_JOKER_ES_14B`	Spanish	14B	link ↗
`HumorGen_JOKER_ES_32B`	Spanish	32B	link ↗

Usage

Load an adapter in four lines.

Every model is a PEFT LoRA adapter — load the base, apply the adapter, generate.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "Jayi2424/HumorGen_SFT_7B")

headline = "Scientists discover that staring at spreadsheets increases sadness by 94%"
prompt = (
    "<|im_start|>system\nYou are a comedy writer. Write one sharp, witty joke for the headline.\n<|im_end|>\n"
    f"<|im_start|>user\n{headline}<|im_end|>\n"
    "<|im_start|>assistant\n"
)
inputs  = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=120, temperature=0.9, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "Jayi2424/HumorGen_JOKER_EN_32B")

pun_word = "vein"
sense_a = "blood vessel toward heart"
sense_b = "to no avail (in vain)"
prompt = (
    "Write a humorous text for this pun brief.\n\n"
    f"Language: English\nPun word: {pun_word}\n"
    f"Sense A: {sense_a}\nSense B: {sense_b}\n"
)
inputs  = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=80, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Papers

Read the work.

arXiv 2026

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Ajayi, E. & Mitra, P.

View PDF ↗

CLEF 2026 · Working Notes

Cross-Lingual Cognitive Synergy for Constrained Humor Generation in LLMs

Ajayi, E. & Mitra, P. · CLEF 2026 JOKER Track

View PDF ↗

Citation

Cite HumorGen.

arXiv

@misc{ajayi2026humorgen,
  title         = {HumorGen: Cognitive Synergy for Humor Generation in Large Language
                   Models via Persona-Based Distillation},
  author        = {Ajayi, Edward and Mitra, Prasenjit},
  year          = {2026},
  eprint        = {2604.09629},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2604.09629}
}

CLEF 2026

@inproceedings{ajayi2026joker,
  title     = {Cross-Lingual Cognitive Synergy for Constrained Humor Generation
               in LLMs: SaLT Lab at the CLEF 2026 JOKER Track},
  author    = {Ajayi, Edward and Mitra, Prasenjit},
  booktitle = {Working Notes of CLEF 2026},
  year      = {2026},
  url       = {https://edwardajayi.github.io/assets/papers/HumorGen-JOKER.pdf}
}

An open-weight ecosystem for computational humor.

The whole story, in one paragraph.

Six comedic personalities, one small model.

Good data beats big models — and fancy training tricks hit a wall.

Funny lives at the edges of the distribution.

Next-token prediction rewards the median.

Incongruity lives in the tail.

The Cognitive Synergy Framework.

The Neurotic

The Cynic

The Observer

The Wordsmith

The Optimist

The Absurdist

Tension release

Laughter at another's expense

Mismatched frames collide

Wordplay & safe rule-breaking

Many jokes, not one.

Judge, then teach.

From many joke drafts to one funny model.

Teacher generation

Rank & teach

Does fancy tuning help?

Direct Preference Optimization

Offline Group Relative Policy Optimization

How does it actually compare?

Extending to constrained humor.

Stage 1 — Humor prior

Stage 2a — English JOKER

Stage 2b — Multilingual warm-up

Stage 3 — FR / ES specialization

Open weights, all of them.

Core HumorGen — 7B

Multilingual Base — 14B & 32B

CLEF 2026 JOKER Task 4 — Constrained Pun Generation

Load an adapter in four lines.

Read the work.

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Cross-Lingual Cognitive Synergy for Constrained Humor Generation in LLMs

Cite HumorGen.