Doing software wrong
The forklift and stapler image

The forklift and stapler

Most developers have become pretty good at using AI, some reluctantly.

We paste in an error and ask for an explanation. We ask for unit tests, a regex we will immediately regret, or a slightly less deranged version of a comment we wrote at 11:47pm while bargaining with a build pipeline.

And it works. The model helps. The developer moves on. Everyone feels faintly productive, like something meaningful has happened.

Recently I have been feeling like what is being achieved is persuading a very expensive machine to help tie our shoelaces.

The more I look, the more I see it.

Developers are not really giving the AI models real work; they are giving it instructions.

Very precise, mostly small instructions. We have been trained by what are now old limits. Prompts shaped by working with models that punished you the moment you got ambitious.

Until very recently if you gave a model something too large, too messy, or too close to reality, it would start out with a great plan. It would outline the problem with calm authority. It would list the trade-offs. It would appear to understand the system, the context, the intent, the whole little ecosystem.

Then, somewhere around step four or five, it would quietly detach from reality and float away like a lost party balloon.

As developers we adapted, because developers are strange creatures that will build an elaborate ritual around a tool even if it keeps biting them.

We have become prompt bonsai artists, carefully clipping away ambiguity until the model cannot hurt itself or anyone nearby.

Prompt engineering was useful, obviously, and sensible for the models we had, but wrong for the models we are getting.

Models like Claude Fable 5 and Sakana Fugu are not just asking us to write better prompts. They are asking us to engineer better tasks. Not "say the magic words more clearly", but "define a real piece of work, gather the materials, describe what done looks like, set goals so the model can make judgement calls and leave a trail a human can review afterwards."

For me that is a different muscle entirely.

And I do not think most developers have trained it yet. Not because we are lazy, or stupid, or insufficiently excited about the future of agentic development. It is because we have been trained by smaller models to think small, to be explicit, hover nearby, and prompt for fragments instead of outcomes.

We have become excellent at dealing in fragments.

We are less practised at shaping work.

Prompt engineering was useful. Task engineering is different.

Don't get me wrong — clear instructions matter. Context matters. Examples matter. Telling the model what good looks like matters.

But prompt engineering, as we usually talk about it, belongs to small interactions, it belongs to fragments.

You ask. It responds. You adjust. It responds again. You both continue this little dance until either the problem is solved or you have spent 40 minutes arguing with an agent about bullet formatting and need to go outside to kick something.

Task engineering is different.

Task engineering is not about crafting the perfect prompt. It is about designing a piece of work in a way that a capable model can do it.

You are not prompting the model.

You are commissioning work.

Which feels slightly pompous, so let's immediately ruin it with a forklift.

The forklift and the stapler

Imagine someone buys a warehouse forklift.

A proper big yellow beast. Warning beeper. Rotating light. The kind of machine that gives anyone in a high-vis vest a feeling of authority. It can lift pallets, move heavy loads, rearrange a warehouse, and, if handled poorly, punch a new doorway through a wall.

Now imagine the team discovers it is useful for moving staplers.

Someone stands on one side of the office and says, "Can you bring me that stapler?"

The forklift driver nods, lowers the forks with impressive precision, slides them under the stapler, lifts it with the dignity of a royal procession, reverses slowly, beeping all the way, and delivers the stapler.

Technically, this is a success.

The stapler has moved. The forklift performed the task. No one died, which in software counts as a green build.

But after a while, you would start asking whether perhaps the team had misunderstood the forklift.

This is where a lot of AI use in development is heading. We have these increasingly capable systems, and we are using them to pass staplers.

Rename this. Summarise that. Write a helper. Convert this into YAML, the cursed lasagne of configuration formats.

There is nothing wrong with small tasks. Small tasks exist. Small tasks matter. Sometimes the stapler really does need to be moved, and if the forklift is sitting there, fine, have at it. But if every interaction is a stapler task, the forklift does not change the warehouse. It just makes office stationery more theatrical, and far less economical to share.

Fable 5 makes the forklift problem obvious. With Fable, tiny tasks are not just wasteful, they are eyewatering expensive.

Sakana Fugu makes the metaphor worse, because now it is not one forklift. It is a warehouse system that can decide whether the forklift, pallet jack, conveyor belt, barcode scanner, or Dave from dispatch should handle each part of the job.

Fugu is interesting because it treats orchestration as part of the intelligence.

That changes the question from "what work should I ask the model to do?" to "how do I design work for a system that can route, coordinate, verify, and combine different kinds of model effort?"

Which is a much bigger question than "please make this function nicer".

Bigger work, structured work

Fable and Fugu point at two different problems hiding inside the same upgrade.

Fable points at bigger work. What happens when the model can carry more than we know how to hand over?

The old developer habit is to stay close. Watch the output. Correct every drift. Keep the model inside a safe little paddock where it can graze on helper functions and not frighten the horses. That made sense when the model needed constant supervision, but it becomes a bottleneck when the model can work for longer.

Instead of hovering over every step, the developer needs to frame the work properly up front and review the result like an owner.

That is a different kind of discipline.

Fugu points at structured work. What happens when the model is no longer one thing doing one thing?

But serious engineering work is rarely one kind of thing.

Real work to solve real pain needs exploration, planning, code reading, implementation, testing, verification, risk assessment, documentation, and rollback thinking. A single model can attempt all of that, but an orchestrated system can treat the work as a sequence of specialised moves.

It means developers need to stop thinking only in terms of "the answer I want back" and start thinking in terms of "the work process I want the system to run".

Not just "solve this".

More like: inspect this, compare options, identify risk, attempt a solution, verify it against the constraints, tell me what remains uncertain, and produce the smallest next step that would let a human decide.

That sounds suspiciously like engineering, unfortunate for anyone who hoped the future would be less work.

The new work is specifying work.

A lot of developers are extremely good at implementation and much less practised at saying what the real job is.

It may feel wasteful to spend an hour or three preparing the material to engineer a task for a seriously big model — it can easily feel wrong because the marketing version of AI productivity is always someone typing a sentence into a glowing box and receiving a miracle before their oat latte gets cold.

But serious work has never worked like that.

If you spend three hours building the right context pack and the model saves you two weeks of grinding investigation, that is not inefficiency. That is as close to a miracle as we can get.

This is also where Fable and Fugu should change how we think about cost. If a model is expensive, slow, or powerful, then using it for cheap little tasks is not disciplined. It is like booking a surgical theatre to trim your fingernails.

The value is moving.

As the doing gets cheaper, deciding gets more expensive.

As implementation gets faster, framing matters more.

If models can chew through larger chunks of work, the scarce skill becomes knowing which chunks are worth chewing.

If you were hoping AI would remove the need for judgement, bad news. The judgement bill has arrived, and it has itemised charges.

Doing software wrong at the wrong size

So this is probably the next bit of doing software wrong.

We will get access to bigger and bigger models. We will keep asking tiny questions. We will spend premium reasoning tokens on boilerplate, summaries, and functions our IDE could have written while asleep. We will complain that the results feel underwhelming.

Then someone will hand the model a properly scoped, painful, valuable job with enough data and a clear definition of done, and suddenly it will look like witchcraft.

It will not be witchcraft.

It will be preparation, which is less exciting but much more useful.

The future is not developers sitting back while models do everything. The future is developers learning to see larger work. The work that was too big to assign, too boring to prioritise, and too painful to do manually.

That is the real shift. Not prompt engineering. Not magic. Not job apocalypse. Task engineering.

The ability to see the bigger job, package it properly, give it to the model, walk away long enough for it to work, and come back ready to judge what it did.

We will still get it wrong. The model will still get it wrong.

That is software.

But maybe we can start getting it wrong at the right size.

Because if the models are getting bigger and our questions stay tiny, the failure is no longer entirely in the machine.

The question was too small.

And as usual, annoyingly, expensively, we are doing software wrong.