The Neurotic
Relief Theory
Tension release
Internal anxiety, overthinking, social insecurity
HumorGen teaches language models to be genuinely funny — not just to produce text that sounds like a joke. Instead of asking one model to be funny, it runs six different comedic personalities in parallel and keeps the jokes that actually land.
We start with large AI models acting as teachers. For every headline, each teacher writes jokes from six different comedic angles — the worrier, the cynic, the absurdist, and so on. A strong AI judge then compares the drafts head-to-head and keeps the funniest. A small, efficient student model learns from those winning jokes, so it can be funny on its own.
Our small model ranks among the strongest open humor models available, beating systems many times its size and holding its own against frontier models. Adding popular preference-tuning methods on top didn't help — once the training jokes are diverse and well-chosen, you've already captured most of the gain. We call this the data quality ceiling.
Language models struggle to be funny — not because they lack knowledge of humor, but because of how they are trained.
Standard training maximizes the likelihood of the next word. The most probable continuation is always the safest, most generic one — so the model is pulled toward the center of the distribution, away from anything surprising.
Humor lives at the edges: the unexpected turn of phrase, the precise word that collapses two meanings at once, the observation that makes a situation suddenly absurd. The funniest line is rarely the most likely one.
Training a model to "be funny" with a single instruction does not work — it produces outputs that scan as jokes but fail to land. HumorGen's premise: a single headline supports multiple valid comedic interpretations. Teach the model that diversity, and the laughs follow.
A Mixture-of-Thought method that structures humor generation as an ensemble of six cognitive personas, each grounded in psychological theory.
Instead of generating one joke and hoping it lands, the Cognitive Synergy Framework writes many jokes at once — each through a different comedic lens. This pushes the model toward surprising, offbeat territory where humor actually lives, rather than the safe, obvious punchline. An AI judge then runs a tournament across all the drafts to decide which ones are worth teaching to the student model.
Relief Theory
Tension release
Internal anxiety, overthinking, social insecurity
Superiority Theory
Social critique
Hypocrisy, biting sarcasm, moral contradictions
Incongruity Theory
Social mapping
Mundane minutiae and unwritten awkward social norms
Linguistic Theory
Ambiguity
Puns, double entendres, phonological play
Benign Violation Theory
Recontextualization
Wholesome misinterpretations of negative traits
Incongruity Theory
Surrealism
Non-sequiturs, dream logic, fractured causality
Humor discharges built-up psychological tension. Grounds The Neurotic.
Humor arises from perceived superiority or social critique. Grounds The Cynic.
Humor arises when incompatible interpretations meet. Grounds The Observer (social mapping) and The Absurdist (surrealism).
Phonological ambiguity grounds The Wordsmith; harmless norm violations ground The Optimist.
For each headline, every personality produces several joke drafts. No personality is hand-picked as the best — an AI judge compares all the drafts and lets the funniest ones earn their place in training.
Instead of a single attempt, the framework generates many joke drafts per headline, each shaped by a different comedic personality. This steers generation toward the surprising territory where humor lives — away from the safe, generic punchline a model would default to.
A strong AI judge compares the drafts two at a time and ranks them. Only the top-ranked jokes are passed down to a smaller, efficient student model, which learns to produce that quality on its own.
Large teacher models write many joke drafts for each headline, each from a different comedic angle. This builds a broad, varied pool of candidates instead of one safe attempt.
A strong AI judge compares the drafts head-to-head and ranks them. The funniest jokes become training material for a smaller, efficient student model that learns to be funny on its own. Some variants also learn the reasoning behind each joke.
We test two popular preference-tuning methods on top: DPO (direct preference learning) and O-GRPO (a group-relative variant). Neither beats the simpler approach — once the training jokes are well-chosen, the extra tuning adds little.
The takeaway: choosing the right training jokes matters more than scaling up the model. Our small model beats systems many times its size and holds its own against frontier models. Adding preference tuning on top brings no meaningful gain once the data is good — a data quality ceiling. Forcing the model to "think out loud" before each joke can even make it less funny — the explainer trap.
We ranked our models against 13 others — including frontier systems from OpenAI, Google, and others — on two humor benchmarks. Each model's jokes are compared head-to-head by an AI judge, and the results are turned into a leaderboard. Higher scores mean the model's jokes were judged funnier, more often. Full tables are in the paper.
| Rank | Model | BT rating | 95% CI |
|---|---|---|---|
| 1 | GPT-5 | 1336.18 | 1323.3 – 1348.3 |
| 2 | Kimi-K2 | 1259.98 | 1249.7 – 1268.5 |
| 3 | HumorGen SFT-7B | 1128.14 | 1118.3 – 1138.1 |
| 4 | HumorGen DPO-7B | 1123.72 | 1115.7 – 1134.9 |
| 5 | HumorGen DPO-Think-7B | 1116.65 | 1107.9 – 1127.1 |
| 6 | HumorGen SFT-Think-7B | 1085.31 | 1075.8 – 1096.5 |
| 7 | HumorGen GRPO-7B | 1071.13 | 1060.8 – 1080.1 |
| 8 | Gemini-2.5-Pro | 1059.07 | 1049.3 – 1068.4 |
| 9 | HumorGen GRPO-Think-7B | 1055.94 | 1043.8 – 1066.8 |
| 10 | GPT-OSS-120B | 1048.19 | 1039.7 – 1057.1 |
| 11 | Qwen3-32B | 990.44 | 981.4 – 999.4 |
| 12 | phi2-Humor | 803.72 | 794.5 – 818.2 |
| 13 | HumorGen-Com-7B | 665.93 | 645.5 – 680.0 |
| 14 | Base Qwen-7B | 643.01 | 628.3 – 658.0 |
| 15 | JokeGPT | 612.58 | 597.6 – 627.4 |
| Rank | Model | BT rating | 95% CI |
|---|---|---|---|
| 1 | GPT-5 | 1378.73 | 1346.1 – 1421.5 |
| 2 | Kimi-K2 | 1279.63 | 1245.0 – 1322.1 |
| 3 | Gemini-2.5-Pro | 1247.80 | 1212.4 – 1279.7 |
| 4 | HumorGen SFT-7B | 1140.37 | 1107.9 – 1173.2 |
| 5 | HumorGen DPO-7B | 1135.25 | 1101.9 – 1160.2 |
| 6 | HumorGen GRPO-7B | 1089.84 | 1060.7 – 1114.1 |
| 7 | GPT-OSS-120B | 1049.99 | 1019.8 – 1081.5 |
| 8 | HumorGen SFT-Think-7B | 1049.99 | 1016.3 – 1084.5 |
| 9 | HumorGen DPO-Think-7B | 1031.30 | 1002.1 – 1058.0 |
| 10 | Qwen3-32B | 1023.18 | 997.0 – 1046.0 |
| 11 | HumorGen GRPO-Think-7B | 948.51 | 914.7 – 982.6 |
| 12 | phi2-Humor | 791.32 | 751.8 – 826.1 |
| 13 | HumorGen-Com-7B | 721.97 | 682.3 – 750.8 |
| 14 | Base Qwen-7B | 673.16 | 643.1 – 718.0 |
| 15 | JokeGPT | 438.97 | 384.3 – 500.2 |
We verified the rankings with a second, independent AI judge — the orderings barely changed. Our top two models land near the top of both leaderboards, ahead of models many times their size.
The framework isn't limited to open-ended jokes. JOKER Task 4 is stricter: you're given a pun word and two meanings it has to carry, and you must write a short funny line that hits both senses at once. The model has to satisfy a hard constraint without losing the laugh.
The model first learns general humor from a large set of headline jokes. This gives it a baseline sense of what's funny before it ever sees a pun-brief.
The English model then specializes on pun-briefs — learning to hit both required senses while staying funny.
French and Spanish start from the same humor baseline and warm up on pun-briefs across all three languages together.
Each language then gets its own focused pass to specialize. Six open-weight adapters are released — English, French, and Spanish, in two sizes each.
veinA valid pun-brief has to use the pun word, hit both senses, read naturally in its language, and still be funny. Our experiments show that running the full framework at generation time beats baking it into the model's weights — the in-the-moment search is what carries it. The distilled student models inherit the structure of good pun-briefs, but not the breadth of the search.
14 open-weight models, released as lightweight adapters you load onto the matching base model. Collection Jayi2424/HumorGen · landing repo Jayi2424/HumorGen. Apache-2.0.
| Model | Training | CSD | Hugging Face |
|---|---|---|---|
HumorGen_SFT_7B | Supervised Fine-Tuning | — | link ↗ |
HumorGen_SFT_Think_7B | SFT + CSD traces | yes | link ↗ |
HumorGen_DPO_7B | DPO (β=0.1, from SFT) | — | link ↗ |
HumorGen_DPO_Think_7B | DPO (from SFT-Think) | yes | link ↗ |
HumorGen_GRPO_7B | O-GRPO (G=24, from SFT) | — | link ↗ |
HumorGen_GRPO_Think_7B | O-GRPO + CSD traces | yes | link ↗ |
Every model is a PEFT LoRA adapter — load the base, apply the adapter, generate.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "Jayi2424/HumorGen_SFT_7B")
headline = "Scientists discover that staring at spreadsheets increases sadness by 94%"
prompt = (
"<|im_start|>system\nYou are a comedy writer. Write one sharp, witty joke for the headline.\n<|im_end|>\n"
f"<|im_start|>user\n{headline}<|im_end|>\n"
"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=120, temperature=0.9, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "Jayi2424/HumorGen_JOKER_EN_32B")
pun_word = "vein"
sense_a = "blood vessel toward heart"
sense_b = "to no avail (in vain)"
prompt = (
"Write a humorous text for this pun brief.\n\n"
f"Language: English\nPun word: {pun_word}\n"
f"Sense A: {sense_a}\nSense B: {sense_b}\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=80, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Ajayi, E. & Mitra, P.
View PDF ↗Ajayi, E. & Mitra, P. · CLEF 2026 JOKER Track
View PDF ↗@misc{ajayi2026humorgen,
title = {HumorGen: Cognitive Synergy for Humor Generation in Large Language
Models via Persona-Based Distillation},
author = {Ajayi, Edward and Mitra, Prasenjit},
year = {2026},
eprint = {2604.09629},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2604.09629}
}
@inproceedings{ajayi2026joker,
title = {Cross-Lingual Cognitive Synergy for Constrained Humor Generation
in LLMs: SaLT Lab at the CLEF 2026 JOKER Track},
author = {Ajayi, Edward and Mitra, Prasenjit},
booktitle = {Working Notes of CLEF 2026},
year = {2026},
url = {https://edwardajayi.github.io/assets/papers/HumorGen-JOKER.pdf}
}