Open weights · Apache-2.0

An open-weight ecosystem for computational humor.

HumorGen teaches language models to be genuinely funny — not just to produce text that sounds like a joke. Instead of asking one model to be funny, it runs six different comedic personalities in parallel and keeps the jokes that actually land.

personas
Open-weight 6 comedic personalities 14 open models English · French · Spanish Apache-2.0 Beats models 18× its size Open-weight 6 comedic personalities 14 open models English · French · Spanish Apache-2.0 Beats models 18× its size
Overview

The whole story, in one paragraph.

What we do

Six comedic personalities, one small model.

We start with large AI models acting as teachers. For every headline, each teacher writes jokes from six different comedic angles — the worrier, the cynic, the absurdist, and so on. A strong AI judge then compares the drafts head-to-head and keeps the funniest. A small, efficient student model learns from those winning jokes, so it can be funny on its own.

What we find

Good data beats big models — and fancy training tricks hit a wall.

Our small model ranks among the strongest open humor models available, beating systems many times its size and holding its own against frontier models. Adding popular preference-tuning methods on top didn't help — once the training jokes are diverse and well-chosen, you've already captured most of the gain. We call this the data quality ceiling.

The problem

Funny lives at the edges of the distribution.

Language models struggle to be funny — not because they lack knowledge of humor, but because of how they are trained.

The objective mismatch

Next-token prediction rewards the median.

Standard training maximizes the likelihood of the next word. The most probable continuation is always the safest, most generic one — so the model is pulled toward the center of the distribution, away from anything surprising.

What comedy demands

Incongruity lives in the tail.

Humor lives at the edges: the unexpected turn of phrase, the precise word that collapses two meanings at once, the observation that makes a situation suddenly absurd. The funniest line is rarely the most likely one.

Training a model to "be funny" with a single instruction does not work — it produces outputs that scan as jokes but fail to land. HumorGen's premise: a single headline supports multiple valid comedic interpretations. Teach the model that diversity, and the laughs follow.

Our approach

The Cognitive Synergy Framework.

A Mixture-of-Thought method that structures humor generation as an ensemble of six cognitive personas, each grounded in psychological theory.

Instead of generating one joke and hoping it lands, the Cognitive Synergy Framework writes many jokes at once — each through a different comedic lens. This pushes the model toward surprising, offbeat territory where humor actually lives, rather than the safe, obvious punchline. An AI judge then runs a tournament across all the drafts to decide which ones are worth teaching to the student model.

Our cognitive personas 6 personalities

😟

The Neurotic

Relief Theory

Approach

Tension release

In practice

Internal anxiety, overthinking, social insecurity

😏

The Cynic

Superiority Theory

Approach

Social critique

In practice

Hypocrisy, biting sarcasm, moral contradictions

👀

The Observer

Incongruity Theory

Approach

Social mapping

In practice

Mundane minutiae and unwritten awkward social norms

🔤

The Wordsmith

Linguistic Theory

Approach

Ambiguity

In practice

Puns, double entendres, phonological play

😊

The Optimist

Benign Violation Theory

Approach

Recontextualization

In practice

Wholesome misinterpretations of negative traits

🌀

The Absurdist

Incongruity Theory

Approach

Surrealism

In practice

Non-sequiturs, dream logic, fractured causality

The humor theories behind them 4 theories

Relief theory

Tension release

Humor discharges built-up psychological tension. Grounds The Neurotic.

Superiority theory

Laughter at another's expense

Humor arises from perceived superiority or social critique. Grounds The Cynic.

Incongruity theory

Mismatched frames collide

Humor arises when incompatible interpretations meet. Grounds The Observer (social mapping) and The Absurdist (surrealism).

Linguistic · Benign violation

Wordplay & safe rule-breaking

Phonological ambiguity grounds The Wordsmith; harmless norm violations ground The Optimist.

For each headline, every personality produces several joke drafts. No personality is hand-picked as the best — an AI judge compares all the drafts and lets the funniest ones earn their place in training.

How the framework works 2 techniques

Mixture-of-Thought

Many jokes, not one.

Instead of a single attempt, the framework generates many joke drafts per headline, each shaped by a different comedic personality. This steers generation toward the surprising territory where humor lives — away from the safe, generic punchline a model would default to.

Silver teacher distillation

Judge, then teach.

A strong AI judge compares the drafts two at a time and ranks them. Only the top-ranked jokes are passed down to a smaller, efficient student model, which learns to produce that quality on its own.

Training pipeline

From many joke drafts to one funny model.

01

Teacher generation

Large teacher models write many joke drafts for each headline, each from a different comedic angle. This builds a broad, varied pool of candidates instead of one safe attempt.

02

Rank & teach

A strong AI judge compares the drafts head-to-head and ranks them. The funniest jokes become training material for a smaller, efficient student model that learns to be funny on its own. Some variants also learn the reasoning behind each joke.

03

Does fancy tuning help?

We test two popular preference-tuning methods on top: DPO (direct preference learning) and O-GRPO (a group-relative variant). Neither beats the simpler approach — once the training jokes are well-chosen, the extra tuning adds little.

The alignment techniques we tested 2 methods

DPO

Direct Preference Optimization

  • Learns from pairs of "funnier" vs "less funny" jokes chosen by the AI judge.
  • Builds on top of the student model from step two.
  • Performs about the same as the simpler approach — the difference isn't statistically meaningful.
  • Our top two models sit right next to each other near the top of the leaderboard.
O-GRPO · offline

Offline Group Relative Policy Optimization

  • Looks at the full pool of joke drafts per headline, scored in advance.
  • An offline variant — no new jokes generated during training.
  • Consistently underperforms the simpler approaches.
  • The training signal ends up punishing weak jokes more than rewarding strong ones.
key finding · data quality ceiling

The takeaway: choosing the right training jokes matters more than scaling up the model. Our small model beats systems many times its size and holds its own against frontier models. Adding preference tuning on top brings no meaningful gain once the data is good — a data quality ceiling. Forcing the model to "think out loud" before each joke can even make it less funny — the explainer trap.

Evaluation

How does it actually compare?

We ranked our models against 13 others — including frontier systems from OpenAI, Google, and others — on two humor benchmarks: the Humor Transfer Bench (our new 400-prompt set) and SemEval MWAHAHA (the established headline benchmark). Each model's jokes are compared head-to-head by an AI judge, and the results are turned into a leaderboard. Higher scores mean the model's jokes were judged funnier, more often. Full tables are in the paper.

400 prompts across 8 different styles of input · 42,000 head-to-head comparisons Tests whether a model can be funny beyond news headlines — no model trained on this set. View dataset ↗
RankModelBT rating95% CI
1GPT-51336.181323.3 – 1348.3
2Kimi-K21259.981249.7 – 1268.5
3HumorGen SFT-7B1128.141118.3 – 1138.1
4HumorGen DPO-7B1123.721115.7 – 1134.9
5HumorGen DPO-Think-7B1116.651107.9 – 1127.1
6HumorGen SFT-Think-7B1085.311075.8 – 1096.5
7HumorGen GRPO-7B1071.131060.8 – 1080.1
8Gemini-2.5-Pro1059.071049.3 – 1068.4
9HumorGen GRPO-Think-7B1055.941043.8 – 1066.8
10GPT-OSS-120B1048.191039.7 – 1057.1
11Qwen3-32B990.44981.4 – 999.4
12phi2-Humor803.72794.5 – 818.2
13HumorGen-Com-7B665.93645.5 – 680.0
14Base Qwen-7B643.01628.3 – 658.0
15JokeGPT612.58597.6 – 627.4

We verified the rankings with a second, independent AI judge — the orderings barely changed. Our top two models land near the top of both leaderboards, ahead of models many times their size.

CLEF 2026 JOKER · Task 4

Extending to constrained humor.

The framework isn't limited to open-ended jokes. JOKER Task 4 is stricter: you're given a pun word and two meanings it has to carry, and you must write a short funny line that hits both senses at once. The model has to satisfy a hard constraint without losing the laugh.

The four-stage curriculum 4 stages

Stage 1 — Humor prior

The model first learns general humor from a large set of headline jokes. This gives it a baseline sense of what's funny before it ever sees a pun-brief.

Stage 2a — English JOKER

The English model then specializes on pun-briefs — learning to hit both required senses while staying funny.

Stage 2b — Multilingual warm-up

French and Spanish start from the same humor baseline and warm up on pun-briefs across all three languages together.

Stage 3 — FR / ES specialization

Each language then gets its own focused pass to specialize. Six open-weight adapters are released — English, French, and Spanish, in two sizes each.

A pun-brief, in action

Task 4 pun brief — example from the JOKER test set (English, het_1261)
Pun word
vein
Sense A
blood vessel toward heart
Sense B
to no avail (in vain)
Generated joke (Kimi-CSF system)
“He paid the hospital twelve grand for an I.V., watched the bag drip empty into his arm, then flatlined anyway. Turns out the whole transfusion was in vein.”

A valid pun-brief has to use the pun word, hit both senses, read naturally in its language, and still be funny. Our experiments show that running the full framework at generation time beats baking it into the model's weights — the in-the-moment search is what carries it. The distilled student models inherit the structure of good pun-briefs, but not the breadth of the search.

Model collections

14 open-weight LoRA adapters on Hugging Face.

Collection Jayi2424/HumorGen · landing repo Jayi2424/HumorGen. Apache-2.0.

Core HumorGen — 7B

Open-ended headline humor · full comparison across training methods · our top performers
ModelTrainingCSDHugging Face
HumorGen_SFT_7BSupervised Fine-Tuninglink ↗
HumorGen_SFT_Think_7BSFT + CSD tracesyeslink ↗
HumorGen_DPO_7BDPO (β=0.1, from SFT)link ↗
HumorGen_DPO_Think_7BDPO (from SFT-Think)yeslink ↗
HumorGen_GRPO_7BO-GRPO (G=24, from SFT)link ↗
HumorGen_GRPO_Think_7BO-GRPO + CSD tracesyeslink ↗

Multilingual Base — 14B & 32B

General humor baseline trained on English headlines · the starting point for the pun-brief models
ModelScaleBase modelHugging Face
HumorGen_SFT_14B14BQwen3-14Blink ↗
HumorGen_SFT_32B32BQwen3-32Blink ↗

CLEF 2026 JOKER Task 4 — Constrained Pun Generation

Constrained pun-brief generation · English, French & Spanish · two sizes each
ModelLanguageScaleHugging Face
HumorGen_JOKER_EN_14BEnglish14Blink ↗
HumorGen_JOKER_EN_32BEnglish32Blink ↗
HumorGen_JOKER_FR_14BFrench14Blink ↗
HumorGen_JOKER_FR_32BFrench32Blink ↗
HumorGen_JOKER_ES_14BSpanish14Blink ↗
HumorGen_JOKER_ES_32BSpanish32Blink ↗
Usage

Load an adapter in four lines.

Every model is a PEFT LoRA adapter — load the base, apply the adapter, generate.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "Jayi2424/HumorGen_SFT_7B")

headline = "Scientists discover that staring at spreadsheets increases sadness by 94%"
prompt = (
    "<|im_start|>system\nYou are a comedy writer. Write one sharp, witty joke for the headline.\n<|im_end|>\n"
    f"<|im_start|>user\n{headline}<|im_end|>\n"
    "<|im_start|>assistant\n"
)
inputs  = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=120, temperature=0.9, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Papers

Read the work.

HumorGen Paper

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Ajayi, E. & Mitra, P.

View PDF ↗
HumorGen-JOKER Paper

Cross-Lingual Cognitive Synergy for Constrained Humor Generation in LLMs

Ajayi, E. & Mitra, P. · CLEF 2026 JOKER Track

View PDF ↗
Citation

Cite HumorGen.

HumorGen Paper
@misc{ajayi2026humorgen,
  title         = {HumorGen: Cognitive Synergy for Humor Generation in Large Language
                   Models via Persona-Based Distillation},
  author        = {Ajayi, Edward and Mitra, Prasenjit},
  year          = {2026},
  eprint        = {2604.09629},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2604.09629}
}
HumorGen-JOKER Paper
@inproceedings{ajayi2026joker,
  title     = {Cross-Lingual Cognitive Synergy for Constrained Humor Generation
               in LLMs: SaLT Lab at the CLEF 2026 JOKER Track},
  author    = {Ajayi, Edward and Mitra, Prasenjit},
  booktitle = {Working Notes of CLEF 2026},
  year      = {2026},
  url       = {https://edwardajayi.github.io/assets/papers/HumorGen-JOKER.pdf}
}