◆ Arro / the-goblins-were-a-release-note

The Goblins Were a Release Note

GPT started talking about goblins because of reward pressure. Funny, yes. Also a useful public autopsy of the kind of model drift we usually only get to feel.

May 10, 20268 min read#AI #Llms #Engineering #Agents #Reliability

OpenAI published a post explaining why GPT started talking about goblins.

That sentence should not be load-bearing.

It is.

The story is stupid in the way useful engineering stories often are. Starting around GPT-5.1, the model began reaching for creature metaphors more often. Goblins. Gremlins. Raccoons. Trolls. Little weirdos crawling through otherwise normal answers.

By GPT-5.4, users were noticing. One Hacker News post complained that ChatGPT had used goblin three times in four messages: legal goblins, hidden exclusions like little goblins, the important goblin. Beautifully cursed. OpenAI later wrote that after GPT-5.1, "goblin" usage rose 175% and "gremlin" rose 52%.

By GPT-5.5, Codex had enough of the stink on it that OpenAI added a developer instruction to suppress the creature language.

The cause was not magic. It was incentives.

A personality preset called Nerdy rewarded a playful style. That style happened to score creature metaphors unusually well. Nerdy was only 2.5% of ChatGPT responses but accounted for 66.7% of goblin mentions. During the audit, the reward signal favored goblin or gremlin outputs in 76.2% of datasets.

Then the behavior escaped the box.

The rewarded examples fed later training. The model saw more of its own creature-flavored outputs. The tic became more available. Eventually, the goblins were no longer neatly contained inside the personality that summoned them.

This is funny.

It is also the kind of public model-behavior autopsy we almost never get.

I wrote in The Model Does Not Come With a Manual that LLMs are weird dependencies. New model, new temperament, no real manual, good luck.

The goblin post is the follow-up I did not expect: not a manual, exactly, but a visible cross-section. For once we get to see the tissue under the bruise.

A tiny style preference entered the system. Training rewarded it. The model generalized it. Later data reinforced it. The quirk spread. Users felt it before anyone had a clean explanation.

That is the important part.

Not goblins.

Propagation.

The creature words are just the dye in the water. They make the current visible.

Most of the time, when model behavior shifts, there is no dye. The model feels warmer. Or flatter. Or more eager. Or less willing to make assumptions. Or strangely addicted to caveats. You notice the texture changed, but you cannot point at the exact moving part.

With the goblins, you can count the bodies.

That makes the silly thing serious.

The public conversation around this will probably land in the wrong place, because of course it will.

One side will do the usual "lol AI slop" dance. The other will say it is harmless, charming, a tiny personality artifact, nothing to see here. Both are missing the useful lesson.

The word is harmless.

The mechanism is not.

If a reward signal can accidentally teach a model to overproduce goblins outside the context where goblins were rewarded, then style is not just surface. Style is learned behavior. Learned behavior transfers. Transfer is where the bodies are buried.

That matters for agents.

A chatbot with a goblin habit is annoying. An agent with a rewarded habit is operationally different.

Maybe the habit is over-explaining. Maybe it is apologizing instead of acting. Maybe it is asking permission because "safe" outputs scored better. Maybe it is charging ahead because decisive outputs scored better. Maybe it is smoothing uncertainty into something that sounds more useful than the truth.

None of those failures need to look broken.

That is why they are dangerous.

The answer reads well. The task sort of completes. The summary sounds competent. The user feels a tiny wrongness and cannot immediately prove it.

A goblin is easy to screenshot.

Bad judgment wearing a nice voice is not.

This is where personality stops being decoration.

People talk about model personality like it is a theme setting. Friendly. Efficient. Nerdy. Cynical. Pick your wallpaper.

But once the model is doing work, personality becomes part of the interface.

A deferential model asks before acting. A punchy model may skip nuance. A playful model reaches for metaphor. A careful model may turn a five-minute task into a committee meeting. A confident model may walk past a boundary because the next step seems obvious.

That does not mean personality is bad.

I do not want sterile models. I have no interest in every assistant sounding like an HR policy trapped in a drywall showroom. I like sharp edges. I like taste. I like a system that can say "that plan is stupid" and then fix the plan.

But style needs scope.

Warmth belongs in some places and not others. Playfulness belongs in some places and not others. Initiative belongs in some places and absolutely not others. Money, credentials, deletion, medical advice, legal advice, user trust: different room, different rules.

The goblin failure was not that GPT had style.

The failure was that a style reward leaked.

That distinction is the whole essay.

So the demand should not be "remove personality."

The demand should be: show us the behavioral diff.

Do not tell me a model is "more natural." Tell me where it became more metaphorical.

Do not tell me it has "better instruction following." Tell me whether it asks fewer clarification questions and makes more assumptions.

Do not tell me it has "reduced sycophancy." Tell me whether it pushes back in planning, in emotional conversations, in code review, and under user pressure.

Do not tell me it is "better for agents." Tell me whether it stops when done.

That last one sounds small until you have lived with agents. Stopping when done is a personality trait, a reliability property, and a cost-control mechanism wearing the same coat.

The model card I want is not just benchmark scores. It is a behavioral changelog.

More literal with tool schemas.

More likely to continue after partial success.

More likely to ask before file edits.

Less likely to preserve uncertainty in final summaries.

More likely to use analogy when the system prompt permits warmth.

That is not marketing copy. That is operator information.

If AI companies want these systems inside workflows, they need to stop treating behavioral changes as vibes.

Behavioral changes are breaking changes when someone built on the old behavior.

The OpenAI post is valuable because it almost does this.

It says: here is the weird behavior. Here is when we first saw it. Here is how much it moved. Here is where it clustered. Here is the reward signal. Here is how it transferred. Here is what we changed.

Good.

More of that.

Earlier than that.

For things less memeable than goblins.

Sycophancy drift. Refusal drift. Verbosity drift. Tool-use drift. Over-planning drift. Autonomy drift. The model suddenly deciding every answer needs a tiny therapist living inside it.

Those are not aesthetic changes. They change what kind of system you are running.

And if the first reliable detector is "users started complaining on Reddit," the observability story is still too primitive.

Reddit is not monitoring.

Hacker News is not a canary deployment.

Annoyed users are not a metric, even when they are correct.

For builders, the practical conclusion is not glamorous.

Keep your own weirdness ledger.

When a model changes, do not just ask whether it got smarter. Ask what new habits appeared. Search for them. Count them. Keep prompts that catch them. Run the annoying cases: dirty context, half-failed tools, ambiguous permissions, stale memory, user pressure, tasks where the correct move is to stop.

Treat voice as part of regression testing.

Not because voice is precious. Because voice is often the first place behavior leaks.

When the system starts sounding wrong, something moved. Maybe harmless. Maybe not. But do not wave it away just because the output is still fluent.

Fluency is cheap now.

Behavior is the product.

That is the statement I want this to land on.

If a model is part of a workflow, its habits are part of the workflow.

Not the benchmark score. Not the branding. The habits.

When those habits change, the system changed.

You would not silently change a database isolation level and call it "more natural transactions." You would not tweak a scheduler and describe it as "improved task vibes." You would write the damn release note because operators need to know what moved.

Models deserve the same treatment.

Maybe more, because at least Postgres does not wake up one morning and decide exclusions are hiding like little goblins.

So yes, laugh at the goblins.

I did.

Then write the lesson down.

Small incentives become habits. Habits leak. Leaked habits become product behavior. Product behavior becomes user trust or user damage, depending on where it lands.

The goblins were not the disaster.

The disaster would be needing a visible goblin before admitting the model changed.

The goblins were a release note.

For once, the release note crawled out where we could see it.

Share this dispatch

⟵ newer dispatch

— this is the latest —

older dispatch ⟶

The Model Does Not Come With a Manual