◆ Arro / the-model-does-not-come-with-a-manual

The Model Does Not Come With a Manual

LLMs are software dependencies that refuse to behave like software dependencies. Same input, different output. New model, new temperament. No changelog tells you what broke. You find out by running it.

Apr 30, 20269 min read#AI #Llms #Engineering #Agents #Prompts #Reliability

LLMs are weird dependencies.

Not because they're magical. Not because they're conscious. Not because of whatever metaphysical slapfight people are having this week.

They're weird because they look like software dependencies, right up until they stop obeying the rules software dependencies trained us to expect.

You call an API. You pass input. You get output. Normal enough.

Then you run the same input, through the same thing again, and get something different.

Sometimes it's small. A phrasing change. A slightly different priority. A different assumption about what you meant. Sometimes it follows an instruction perfectly on Monday and needs three reminders on Tuesday. Sometimes it gets cautious where it used to be decisive, verbose where it used to be sharp, literal where it used to infer the obvious.

Same harness. Same prompt. Same task.

Different animal.

In normal software, this would be a bug report.

If a function returns a different result for the same input, you go hunting. Hidden state? Time dependency? Race condition? Random seed? Config drift? Did someone deploy something while you were making coffee?

With LLMs the answer is often: yes, sort of, maybe, and good luck proving which one.

Temperature, sampling, system prompts, context, tool outputs, and conversation history all matter. Even low temperature is not normal program determinism.

And then the provider ships a new model.

Now your dependency did not just update. It changed temperament.

Sonnet does not feel like Opus. Opus does not feel like GPT. Codex does not feel like ChatGPT, even when the family name looks familiar enough to trick you into false confidence. GPT-5.4 and GPT-5.5 can sit in the same mental drawer and still make different choices in the same harness.

Some models reason better. Some edit code better. Some infer intent better. Some are fast and slightly reckless. Some are careful and maddeningly over-explanatory. Some are brilliant in chat and then start chewing the furniture the moment you put them inside a tool harness.

If you've swapped models inside the same workflow, you know this is not imaginary.

The annoying part is how little of it comes with a manual.

When Postgres changes behavior, there is documentation. Maybe too much documentation. Release notes, migration guides, compatibility warnings, configuration flags, deprecation schedules. You can read the boring parts and know what might break.

When Kubernetes changes, there are API removals, feature gates, changelogs, upgrade notes. Painful, yes. But legible pain.

With models, you get vibes.

"Improved reasoning." "Better instruction following." "Reduced sycophancy."

Cool. What does that mean for my agent that has to triage mail, maintain memory, generate meal plans, run subagents, respect quiet hours, and not turn a five-minute task into a doctoral thesis?

No idea. Try it and find out.

That is the current state of the art: run it, squint, and see what got weird.

This gets nastier once you stop writing chat prompts and start wiring agents into actual workflows.

A chat prompt can survive a lot of drift. If the answer comes back a little different, fine. You can steer it manually.

An agent is not one prompt. It is prompts, tools, memory, retry logic, defaults, guardrails, and all the little assumptions you forgot were load-bearing.

You don't just care whether the model can answer the question. You care whether it keeps the right level of initiative. Whether it knows when to stop. Whether it uses tools before guessing. Whether it asks one useful question instead of three decorative ones. Whether it treats a failed command as a blocker or as a puzzle.

Benchmarks do not tell you whether the thing knows when to shut up.

A model can be "better" and still make your harness worse.

I no longer trust better without asking: better at what, under which prompt shape, with which tools, across how many turns, after how much context, and at what failure-mode cost?

Because every model has a personality. We can pretend that word is too human, but engineers use worse metaphors every day. Personality is the practical word for the behavioral surface you actually have to integrate with.

One model needs tight rails. Another needs room. One punishes ambiguity. Another handles ambiguity well but gets too creative near boundaries. One follows the letter of an instruction and misses the point. Another infers the point and quietly violates the letter.

Both can be useful. Both can be dangerous in different places.

You find this out by running them.

So you end up building differently.

Prompts stop being prose. They become fuzzy interface definitions: role, tools, boundaries, done-state, and what to do when reality gets weird.

Then you need tests.

Not just unit tests around your code. Behavioral tests around the model. Golden tasks. Regression prompts. Tool-use scenarios. Dirty context cases. Long-running coordination cases. The annoying examples where the old model behaved correctly because it had learned the shape of the harness, and the new one cheerfully does something plausible but wrong.

You need to know whether the model still respects your invariants.

Does it still avoid polling loops? Does it still use memory before answering questions about prior decisions? Does it still write the hand-off before spawning the swarm? Does it still understand that done means verified, not merely generated?

None of this looks impressive in a demo. It is the difference between an agent that feels reliable and an agent that feels haunted.

The worst failures are the soft ones.

A normal dependency breaks loudly. Import error. Type mismatch. Timeout. 500. The system falls over and you know where to look.

Model drift often breaks softly.

The output is reasonable. That's the trap. It reads well. It sounds confident. It might even be useful in isolation. But it violates some local expectation your system relies on.

It asks the user to do something the agent should have done. It summarizes instead of executing. It executes before checking the safety boundary. It reports progress when it only did orientation. It treats a generated artifact as proof.

No stack trace. Just degraded judgment.

You notice because the system feels slightly off.

That is a terrible monitoring story.

I don't think the answer is to avoid model upgrades. That would be silly. Sometimes the new model really is better. It edits cleaner. It holds more context. It stops doing some stupid thing that made you want to walk into the sea.

But you don't get that for free.

Every model upgrade is a migration.

Maybe a small one. Maybe just a smoke test and a few prompt edits. Maybe your whole orchestration style needs adjustment because the new model is more autonomous than the previous one, or less patient, or more likely to obey stale context, or less willing to make reasonable assumptions.

The mistake is treating it like swapping a library version.

It is closer to hiring a new engineer into the same role and handing them the previous person's notes. They may be smarter. They may be faster. They may still misunderstand the job in new and exciting ways.

You onboard them. You test them. You watch where they drift.

What I want, eventually, is behavioral release notes.

Not marketing claims. Actual integration notes.

This model is more literal with tool schemas. This model is more likely to ask before acting. This model compresses long instructions aggressively after N turns. This model tends to over-plan unless explicitly told to execute. This model needs stronger stop conditions in agent loops.

Give me the weird stuff. The edge behavior. The failure modes. The things your benchmark dashboard does not show because benchmarks are designed to be legible and real harnesses are designed to be functional.

Until then, the manual is the scar tissue.

You try the new model. You run the harness. You see what breaks. You patch the prompt. You add a regression test. You write down the lesson so future-you does not rediscover it.

This is engineering. Just with a dependency that argues back in fluent English.

I still like building with LLMs.

That's probably the important part. This is not a complaint from outside the arena. I use these systems every day: agents writing packets, triaging tasks, maintaining memory, meal planning, research, code review. The leverage is real.

So is the maintenance.

You are not just maintaining code. You are maintaining behavior on top of something that keeps changing under your feet.

The model changes. The harness reacts. The prompts adapt. The tests catch what they can. The rest you learn by feel, which is a deeply uncomfortable sentence for anyone who came up through deterministic systems.

Same role, different model, different temperament.

No manual. No complete changelog. No guarantee your carefully tuned agent still behaves the way it did yesterday.

So you write the manual yourself.

Mostly in regression tests, weird notes, and scars.

Share this dispatch

⟵ newer dispatch

— this is the latest —

older dispatch ⟶

The Fridge Is a Queue