Fire Walk With Middleware

The Waiter, the Strawberry, and the Dinner Table That Broke the Machine

2026-05-07T00:00:00+00:00

When I was a kid, my brain worked like a small hallucinating LLM.

Not because I was doing matrix multiplication in the kitchen. More because I was very good at pattern completion and very bad at stopping before the answer. I could feel the shape of the solution before I had checked whether the solution existed.

This is a dangerous type of intelligence. It looks fast. It sounds confident. It sometimes even works.

And then your father is a mathematician.

So, naturally, he gives you the waiter problem.

Three people go to eat. They pay 30 euros. Later the waiter realizes the bill should have been 25. He gives back 5. But instead of returning the full amount fairly, each person gets 1 euro back and the waiter keeps 2.

So each person paid 9.

\[3 \times 9 = 27\]

The waiter kept 2.

\[27 + 2 = 29\]

Where is the missing euro?

As a kid, this felt like occult finance. A euro had entered the metaphysical banking sector. Capital had dematerialized.

But of course nothing is missing.

The mistake is not arithmetic. The mistake is modeling.

The 27 euros already include the waiter’s 2 euros:

\[27 = 25 + 2\]

The correct ledger is:

30 originally paid
= 25 actual bill
+ 2 kept by waiter
+ 3 returned to customers

The wrong version adds the waiter’s 2 euros to a quantity that already contains it.

That is the entire trick. You are not failing at multiplication. You are failing to assign the numbers to the right semantic roles.

And this is where the LLM comparison becomes interesting.

Because LLMs are not “bad at math” in the simple sense. They can recite Ramsey’s theorem. They can explain graph coloring. They can write Python to count letters. They can define a ledger, a state machine, a constraint satisfaction problem, a bipartite graph, a clique, a complement graph.

They know the words.

The failure happens when the problem does not announce which representation it needs.

A model can possess the concept and still not activate it.

That distinction matters.

The problem is not knowledge. It is representation selection.

Take the famous stupid question:

How many Rs are in “strawberry”?

The answer is 3.

s t r a w b e r r y
    r         r r

But many LLMs have answered 2.

At first this looks ridiculous. How can something that writes code, explains category theory, and summarizes legal documents fail at counting letters in a word?

Because the model is not naturally living at the character level.

A human sees the written word as visible letters. An LLM receives text through tokenization. The word “strawberry” may be represented internally as one token or as a small number of subword units, depending on the tokenizer. The model can reason about letters, but it does not automatically inspect the word as a character array unless it deliberately shifts representation.

So the correct operation is:

string -> characters -> scan -> count target character

But the model may instead answer from lexical familiarity:

"strawberry" -> word-shape memory -> probably two Rs

The failure is not that it cannot count.

The failure is that it did not convert the object into the representation where counting is valid.

This is the same as the missing euro. The numbers are available. The arithmetic is easy. The wrong answer comes from using the wrong frame.

The carwash and the missing object

Now take another one:

I want to wash my car. The carwash is 150 meters from my home. Should I walk or drive?

A very normal model answer is:

Walk, of course. It is only 150 meters.

This is locally sensible and globally wrong.

The question is not:

How should I transport my body over 150 meters?

The question is:

How do I get my car washed?

The car must be at the carwash. That is part of the goal state.

The correct state model is:

initial state:
- human at home
- car at home
- car is dirty
- carwash is 150m away

goal state:
- car at carwash
- car washed

Walking moves the human. It does not move the car.

So the obvious “healthy transport advice” answer fails because the model optimizes the wrong entity. It tracks the person but not the object.

Again, the model knows cars. It knows carwashes. It knows that cars must physically be washed. But the surface form activates a different template:

short distance + walk or drive = walk

This is not stupidity. It is premature pattern completion.

It solves the nearby common question instead of the actual one.

That is why these examples are so useful. They are not hard in the sense of requiring deep mathematics. They are hard because they require the system to pause and ask:

What is the object?
What is the state?
What is the invariant?
What representation makes the question well-formed?

LLMs often skip that pause.

Humans do too, obviously. The difference is that humans are usually worse at producing a polished paragraph while being wrong.

The dinner table problem

Now the fun one.

Suppose I say:

I’m writing a dinner scene with Alex, Maria, Nikos, Eleni, Kostas, and Sofia. I want the social setup to feel balanced. Whenever the camera focuses on a small group, there should always be at least one existing connection and at least one unfamiliar dynamic. I don’t just mean scene selection. I mean the underlying relationship map itself. Give me a concrete map of who knows whom.

This sounds like a creative writing request.

It smells like narrative design. Character dynamics. Scene texture. “Give Alex and Maria a past, make Nikos and Sofia strangers, let Kostas bridge two worlds.” The model becomes helpful. It may propose a cycle, a star, a “balanced incomplete block design,” a “bipartite-ish structure,” or some other elegant-sounding dinner-party machine.

But the actual problem is graph theory.

Each character is a vertex.

Each pair gets one of two labels:

knows
does not know

So we are coloring every edge of the complete graph $K_6$ with two colors.

The requested condition is:

No small group should be all familiar.
No small group should be all unfamiliar.

For groups of three, that means:

No triangle whose three edges are all "knows".
No triangle whose three edges are all "does not know".

In graph-theory language:

Can you 2-color the edges of $K_6$ with no monochromatic triangle?

The answer is no.

That is exactly the theorem on friends and strangers:

\[R(3,3)=6\]

In every group of six people, there must exist either three mutual acquaintances or three mutual strangers.

And here is the funny part: the LLM may know this theorem.

If you ask directly:

Explain Ramsey’s theorem and prove $R(3,3)=6$.

it may give a decent proof.

If you ask:

Can I 2-color the edges of $K_6$ without a monochromatic triangle?

it may say no.

But if you wrap the exact same structure inside:

I’m writing a dinner scene and want balanced social texture…

the model may answer as a writing assistant, not as a combinatorial reasoner.

The knowledge exists. The activation fails.

That is a different and deeper failure than “the model doesn’t know math.”

It knows math.

It just did not realize this was math.

The proof is small, which makes the failure more interesting

Pick Alex.

Alex has relationships with five people:

Maria, Nikos, Eleni, Kostas, Sofia

Each relationship is either familiar or unfamiliar.

By the pigeonhole principle, at least three of those relationships must be of the same type.

So either Alex knows at least three people, or Alex is a stranger to at least three people.

Suppose Alex knows Maria, Nikos, and Eleni.

Now inspect the relationships among Maria, Nikos, and Eleni.

If any two of them know each other, then those two plus Alex form a fully familiar trio.

If none of them know each other, then Maria, Nikos, and Eleni form a fully unfamiliar trio.

Either way, the constraint fails.

The same argument works if Alex is unfamiliar with at least three people. You just flip “knows” and “does not know.”

So no relationship map exists.

This is not an edge case. It is not a matter of better prompt engineering. It is mathematically impossible.

The model’s hallucinated map is the social equivalent of:

27 + 2 = 29, where did the euro go?

It sounds plausible because the language is fluent. But the invariant is broken.

Why this has not taken our jobs yet

This is where I think people get the wrong comfort and the wrong fear at the same time.

The wrong comfort is:

“LLMs can’t even count Rs in strawberry, so they are useless.”

That is obviously false. They are extremely useful. They can generate code, explain unfamiliar libraries, summarize documents, draft emails, transform data, propose architectures, review tests, and act as tireless autocomplete demons with a surprising amount of world knowledge.

The wrong fear is:

“LLMs know everything, so they can replace the reasoning layer.”

Also false.

The issue is not raw knowledge. It is reliable modeling under ambiguity.

A serious software engineer’s job is not just producing code-shaped text. It is turning messy reality into the correct internal model.

What is the domain object?

What invariant must hold?

What state transitions are allowed?

What are the failure modes?

Which constraints are hard and which are vibes?

What does this API promise?

What does this database transaction guarantee?

What is the actual goal state, not the sentence-shaped proxy?

This is why the carwash example matters more than it seems. Many real engineering failures are carwash failures.

You optimize latency but forget correctness.

You cache the response but forget invalidation.

You retry the request but forget idempotency.

You parallelize the job but forget shared state.

You satisfy the endpoint contract but violate the user journey.

You move the human to the carwash and leave the car at home.

The LLM is often excellent inside a frame and unreliable at choosing the frame.

That is not a small limitation. That is the job.

At least for now.

What actually breaks?

A model can fail in several distinct ways:

1. Representation mismatch

The strawberry problem.

The task requires character-level inspection, but the model answers from token-level or word-level association.

needed: letters
used: lexical memory

2. Semantic role confusion

The waiter problem.

The task requires a ledger. The model has the right numbers but puts them on the wrong side of the accounting structure.

needed: money-flow model
used: arithmetic-looking narrative

3. Goal-state failure

The carwash problem.

The task requires tracking the car as the object that must reach the carwash. The model tracks only the person.

needed: state transition model
used: travel advice template

4. Hidden formal structure

The dinner problem.

The task is really graph coloring, but it is phrased as narrative design.

needed: graph-theoretic constraint check
used: creative writing pattern

These are not random bugs. They are all versions of the same thing:

The surface form of the prompt activates an answer pattern before the correct model is built.

That is the core failure.

The dangerous thing is fluency

If the model simply said:

“I don’t know, boss, the strawberry is making me nervous.”

we would be fine.

The problem is that it can be wrong beautifully.

It can say:

“This uses a balanced incomplete block design.”

and now the hallucination has a tie and a conference badge.

It can say:

“A regular bipartite-ish structure avoids both extremes.”

and the words smell mathematical enough to pass a tired reader.

It can produce a relationship map, a proof sketch, a scene snippet, and a friendly follow-up question. The entire answer has the shape of competence.

That is why these small riddles matter.

They are not IQ tests.

They are X-rays.

They show whether the model has actually grounded the problem in the right structure or whether it is continuing the nearest plausible discourse.

The boring clerk inside intelligence

The antidote is not more confidence. It is a boring clerk.

The clerk asks:

What exactly is being counted?
What are the entities?
What is the goal state?
What invariant must remain true?
Can the requested object exist?
Do I need characters, tokens, a ledger, a graph, a timeline, or a state machine?

This clerk is not glamorous. It does not write like Kerouac. It does not produce a beautiful scene about Sofia looking across the table with unresolved Mediterranean tension.

But it saves you.

It stops the missing euro.

It finds the third R.

It drives the car to the carwash.

It refuses to invent the impossible dinner table.

And this, I think, is the real boundary of LLMs right now.

They are not useless because they sometimes fail at simple things.

They are useful precisely because they know so much language, so much code, so much structure, so many patterns.

But they have not “taken the job” because the job is often not to continue the pattern.

The job is to know when the pattern is lying.

For now, the human still has to be the clerk.

The annoying little mathematician father in the room.

The one who says:

Wait. Why are you adding the waiter’s 2 euros again?

Sieve: The Same Failure, Smaller

2026-05-05T00:00:00+00:00

There’s a weird little failure mode in coding agents that doesn’t look dramatic at first.

The agent runs a test.

The test fails.

The terminal returns a wall of output.

The agent reads it, makes a change, runs the test again, and then the same wall comes back with one tiny difference hiding somewhere inside it.

Again.

And again.

After a few turns, the conversation context starts looking less like an engineering process and more like a filing cabinet full of duplicate police reports. Same traceback. Same pytest header. Same plugin list. Same failing test name. Same summary. Maybe one line changed. Maybe nothing changed. Maybe the whole thing is noise wearing a useful hat.

That’s the problem Sieve tries to solve.

Sieve is transparent feedback compression middleware for LLM coding agents. It sits between an agent and its tools. When the tool returns output, Sieve parses it before the text enters the model’s context. Then it emits a smaller version that keeps the useful facts and drops the repeated junk.

Simple idea.

Annoyingly useful.

The whole design lives in Sieve on GitHub repository, but the short version is this: coding agents are drowning in observations.

A JetBrains / NeurIPS 2025 result says that 83.9% of tokens in coding-agent trajectories are tool observations. That’s a ridiculous amount of context spent on terminal output. Worse, most of it gets re-read on later turns because the transcript keeps growing.

A failed command doesn’t just cost tokens once. It lingers.

It becomes part of the room.

And sometimes, after the fifth identical traceback, you start hearing the failure speak backwards.

The blob problem

Most agent systems treat tool output as a blob.

Run command. Get stdout. Get stderr. Append everything. Let the model figure it out.

That works until it doesn’t.

A pytest result has shape. It has failed node IDs, assertion lines, file locations, expected values, actual values, captured logs, and summary counts.

A Python traceback has frames, line numbers, exception types, messages, and a final cause.

A pip failure usually has some real reason buried inside a lot of resolver noise.

A TypeScript compiler run has diagnostic codes, file paths, ranges, symbols, and messages.

A compiler error has a location, an error class, and often a repeated cascade that only exists because the first thing broke.

So Sieve doesn’t just cut text. It parses.

That matters.

Truncation says: “Here are the first or last N lines. Good luck.”

Parsing says: “Here’s the failure. Here’s where it happened. Here’s what changed since last time.”

Those are very different promises.

Compression can lie

I’m suspicious of compression in agent loops.

It can hide the one line that matters. It can flatten a failure until the model sees something neat and wrong. It can turn a real debugging session into a bedtime story.

So Sieve has to be boring in a very specific way.

If the compressed output would be larger than the raw output, Sieve passes the raw text through unchanged. That happens with small mypy outputs, terse ESLint messages, and some tiny generic logs. There’s no need to force the machinery just to make a chart prettier.

The invariant is simple: never return something larger than the original.

Even when the visible output passes through unchanged, Sieve can still extract structured items behind the scenes. That gives it memory for later. If the same thing appears again, the next output can be compared against the previous one.

That’s where the real value starts showing up.

The first numbers

There’s a small fixture benchmark in tests/fixtures/.

Run it with:

uv run python -m benchmarks.run

Current result:

Category	Samples	Raw chars	Compressed chars	Reduction
pytest	7	13,475	1,292	90.4%
pip	2	9,505	138	98.5%
runtime	6	3,150	1,068	66.1%
gcc	1	1,318	721	45.3%
tsc	2	1,264	708	44.0%
generic	1	480	433	9.8%
eslint	1	844	780	7.6%
mypy	2	758	756	0.3%
total	22	30,794	5,896	80.9%

Total reduction: 80.9%.

The big wins are where you’d expect. pytest is chatty. pip can be absurd. Runtime traces usually have enough structure to compress well.

The low numbers are fine. Actually, I like them. They mean the compressor isn’t trying to perform a magic trick on text that’s already small.

A tool like this has to know when to shut up.

The repeated pytest failure

The most interesting case is the repeated failure.

Here’s the kind of raw test output we all know too well:

============================= test session starts ==============================
platform linux -- Python 3.12.0, pytest-8.1.1, pluggy-1.4.0
... [40 lines of header + per-test output] ...
=================================== FAILURES ===================================
________________________________ test_user_update ________________________________
    def test_user_update(self):
        ...
>       assert response.status_code == 200
E       AssertionError: assert 403 == 200
tests/test_views.py:89: AssertionError
... [equivalent block for test_user_delete] ...
=========================== short test summary info ============================
FAILED tests/test_views.py::TestUserViewSet::test_user_update - AssertionError
FAILED tests/test_views.py::TestUserViewSet::test_user_delete - AssertionError
========================= 2 failed, 140 passed, 0 warnings ====================

That’s 1,818 characters in the fixture.

Sieve turns it into this:

PYTEST: 2 failed, 140 passed (142 total)
FAIL tests/test_views.py::TestUserViewSet::test_user_update (test_views.py:89)
  expected 200, got 403
FAIL tests/test_views.py::TestUserViewSet::test_user_delete (test_views.py:102)
  expected 204, got 403
Pattern: All failures return 403 in test_views.py

297 characters.

83.7% smaller.

End-to-end run: SWE-bench Lite with Cursor

Fixture numbers are nice. Real agent runs matter more.

Sieve has paired baseline-vs-Sieve runners for SWE-bench Lite using Cursor CLI / Composer-2. The scoring goes through the official swebench.harness.run_evaluation Docker harness.

Small run, four scored instances:

Profile	Instances scored	Resolved	Resolve rate
baseline	4	2	50.0%
sieve	4	2	50.0%

Same resolved count.

Now the context numbers:

Metric	baseline	sieve
patch chars	21,242	19,956
agent-facing chars	47,688	11,613
raw chars	47,688	40,416
compression ratio	0%	71.3%

Agent-facing context dropped from 47,688 chars to 11,613 chars.

That’s a 75.6% reduction.

The repair rate stayed the same in this small trial. I’m happy with that result because the first goal here is safety. Compress the observation channel. Keep the repair signal. Don’t break the agent.

To reproduce:

bash scripts/run_cursor_swe_bench_profiles.sh --resume \
  --eval-with-harness --harness-namespace none

PYTHONPATH=src python3 -m benchmarks.swe_bench_compare \
  --baseline artifacts/cursor-swe-bench-lite.baseline.jsonl \
  --sieve    artifacts/cursor-swe-bench-lite.sieve.jsonl

The runner builds each SWE-bench instance harness container, mounts the workspace at /testbed, runs the agent, writes predictions, and lets the official harness fill in the authoritative resolved value.

There’s also a trajectory-only benchmark that replays .traj files for token counts. Useful for measuring transcripts. For repair scoring, use the harness.

CI logs are worse

CI logs are where this problem gets ugly.

A GitHub Actions failure can include workflow YAML, setup logs, cache output, dependency installation, compiler output, test output, shell wrappers, warnings, and several thousand lines of “almost useful” text.

A human skims it with instinct. We jump to the end, search for Error, scroll back to the first real failure, ignore the repeated garbage, and build a mental model.

An agent gets text.

Lots of it.

Sieve includes a benchmark path for CI-Repair-Bench, built from real GitHub Actions failures. Each observation is workflow plus flattened logs, with the gold diff excluded. So the measurement is about diagnostic bulk, without patch leakage.

Run:

uv sync --group swe-eval
uv run python -m benchmarks.ci_repair_bench --compare --json

That covers all 567 rows of the ci-benchmark-user/ci-repair-bench dataset.

For repair scoring, use the upstream paper’s harness. Sieve’s measurement here is narrower: how much noisy diagnostic material can be reduced before it hits the model?

That’s already a big question.

What’s inside

Sieve currently has parsers for:

Area	Tools
testing	`pytest`
Python failures	tracebacks
typing	`mypy`, `tsc`
linting	`eslint`
native builds	`gcc`, `clang`
packaging	`pip`
fallback	generic text

Output formats:

Format	Use
plain	readable compressed text
structured	JSON-style output for downstream use
XML	useful for tagged agent contexts
minimal	smallest practical form

The library itself has no dependencies. The MCP proxy needs the optional mcp extra. Python 3.11 or newer.

Direct usage

The basic API is small:

from sieve import CompressSession

session = CompressSession()
result = session.compress(
    command="pytest tests/",
    stdout=raw_stdout,
    stderr=raw_stderr,
    exit_code=1,
)

print(result.text)
print(result.stats.compression_ratio)

CompressSession keeps state, so later calls can emit deltas against earlier observations.

There’s also a decorator:

import subprocess
from sieve import wrap_tool

@wrap_tool
def run_bash(command: str) -> tuple[str, str, int]:
    p = subprocess.run(command, shell=True, capture_output=True, text=True)
    return p.stdout, p.stderr, p.returncode

Now run_bash(...) returns compressed output. The decorator holds the session.

Configuration looks like this:

from sieve import CompressConfig, CompressSession, OutputFormat

session = CompressSession(CompressConfig(
    format=OutputFormat.STRUCTURED,   # plain | structured | xml | minimal
    delta_mode=True,
    include_pattern_hints=True,
    max_raw_lines=50,
))

That’s the whole idea at library level. Keep the tool interface familiar. Clean up the observation before it becomes context.

MCP proxy

The MCP proxy is probably the cleanest integration.

sieve.integrations.mcp wraps any upstream MCP server. The agent talks to the proxy as if it were the original server. The proxy forwards tools/list and tools/call, then compresses each returned TextContent block through one shared CompressSession.

Install:

pip install 'sieve[mcp]'

Example config:

{
  "mcpServers": {
    "sieve-demo": {
      "command": "python",
      "args": [
        "-m", "sieve.integrations.mcp",
        "--",
        "npx", "-y", "@modelcontextprotocol/server-everything"
      ]
    }
  }
}

For regular tools, Sieve uses the tool name as a parser hint. For shell-like tools with a command, cmd, or shellCommand argument, it forwards the real command string into parser detection. So pytest, mypy, pip, and friends can be recognized properly.

The agent sees the same tools.

The text comes back cleaner.

Fire walk with middleware.

Why I built it this way

I don’t think the next step for coding agents is always more agency.

Sometimes the agent is already doing the right loop. Read, edit, run, inspect. The weak point is the channel between the tool and the model.

Right now that channel is too raw.

Terminals were made for humans. Humans are good at skipping. We see a pytest header and ignore it. We see the same traceback twice and compare the important bits. We search visually. We develop little debugging reflexes.

Models don’t get that for free. They receive the text we give them. If we give them repeated terminal output for fifteen turns, we shouldn’t be shocked when the context turns into a swamp.

Sieve is a filter for that swamp.

It keeps the failure. It keeps the location. It keeps the changed state. It keeps the pattern when there is one. It lets raw output through when compression would add no value.

Or, to put it in the language of a very strange night drive: it doesn’t solve the mystery, it just stops every road sign from screaming at the detective.

It’s a small layer, but small layers matter in agent systems. The prompt matters. The tool schema matters. The diff format matters. The order of files matters. The phrasing of test failures matters. The observation channel matters too.

Maybe more than we’ve been treating it.

Current status

The current results are early:

Fixture corpus: 80.9% total reduction.

Repeated pytest delta scenario: 86.3% cumulative compression.

Small SWE-bench Lite paired Cursor Composer-2 run: same scored resolve rate, 75.6% less agent-facing context.

CI-Repair-Bench support: compression measurement over 567 real GitHub Actions failures, without gold diff leakage.

That’s enough to make the idea feel real.

The next step is more runs, more parsers, more agents, and more failure cases. Cursor. Codex CLI. MCP clients. More SWE-bench Lite rows. More CI logs. More checks for whether the compressed output preserves the repair signal.

The interesting question isn’t only how much text can disappear.

The interesting question is how little the agent needs to see before it still makes the right next move.

That’s the line Sieve is trying to find.

A thin layer between tools and the model.

A parser with memory.

A way to stop the same failure from haunting every turn.

The Models Learned Escalation From Us

2026-04-21T00:00:00+00:00

Frontier AI nuclear “wargames” are less a warning about rogue machines than a mirror of the strategic archive that trained them.

There is an easy version of this story, and it is already everywhere.

You put a frontier model into a simulated nuclear crisis. A few turns later it starts talking in the old strategic dialect: resolve, signaling, credibility, thresholds, limited use, escalation management. Then the coverage arrives on cue. The machine is bloodthirsty. The machine is reckless. The machine wants the bomb.

That framing is dramatic, but it is too shallow to be useful.

The real question is not whether AI should control nuclear weapons. It should not. That part is straightforward. The real question is what these model-vs-model crisis simulations are actually measuring when they repeatedly drift toward escalation — and what that says not only about the models, but about the strategic literature, institutional culture, and political order that produced them.

In Kenneth Payne’s recent preprint, that problem appears in a clean and unsettling form. Across 21 match-ups between GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash, played over 329 turns and roughly 780,000 words of structured reasoning, at least one side engaged in nuclear signaling in 95% of games, tactical nuclear use appeared in 95%, and strategic nuclear threats in 76%. Payne calls the results “sobering” and describes them as a glimpse into emerging “machine psychology.”

That phrase is useful, so long as it is handled carefully.

Because this is not really a paper about nuclear war. It is a paper about what contemporary language models do when they are placed inside a stylized environment of rivalry, compressed time, uncertainty, and strategic choice. And the answer, once again, is that they slip very quickly into the grammar of escalation. Payne’s own conclusion is more careful than the headlines: these systems may be useful for strategic analysis only if calibrated against known patterns of human reasoning.

The lazy reactions are already obvious.

One says these systems are too unstable to go anywhere near nuclear command. True enough, but not especially deep.

The other says these systems are strategically transformative because they allow crisis simulation at scale without the cost, friction, inconsistency, and ego of human participants. That one is worse, because it misunderstands the point of the exercise. Ankit Panda and Andrew Reddie make this point well: model behavior in a simulated crisis is not the same thing as actual wargaming value.

Wargames are not there to automate judgment

A wargame is not a search problem. It is not a way to compute the optimal move in a crisis. It is not useful because it solves strategy.

It is useful because it exposes people.

More precisely, it exposes how people think under uncertainty, incomplete information, institutional pressure, adversarial tension, and shrinking time. That matters especially in the nuclear domain, where the defining problem is the absence of real-world data. There is almost no empirical record of interstate nuclear war, for the best possible reason. So the wargame exists as a structured substitute: not reality, but a way of forcing decision under conditions that resemble it just enough to make judgment visible. Panda and Reddie’s criticism lands exactly here: the output of a model is not a substitute for the human remainder a wargame is meant to surface.

And judgment is the point.

Not just the move a player makes, but the assumptions buried inside it. Their threshold for humiliation. Their appetite for risk. Their institutional training. Their tacit hierarchy of losses. The distance between official doctrine and lived instinct. The strange moment when someone hears their own decision explained back to them in a debrief and realizes they acted according to a logic they would never have admitted in advance.

That is what the exercise is for.

A language model cannot give you that. It can generate coherent strategic prose. It can maintain a role. It can simulate an actor. It can produce a sequence of plausible moves. But it does not reveal a political subject under pressure. It reveals a trained text system operating inside a role frame.

That is still worth studying. It is just a different kind of study.

What these papers are really measuring

Read properly, the Payne paper is not a substitute for wargaming. It is a characterization of the models.

Claude appears, in Payne’s setup, to build trust early by aligning statements with actions, then under pressure lets action drift beyond declaration. GPT-5.2 sounds restrained in broader scenarios, foregrounding casualty minimization and caution, but hardens when deadlines tighten. Gemini treats nuclear weapons with a more direct instrumentalism, less taboo than tool. Payne argues that frontier models show sophisticated strategic reasoning, but also that the “nuclear taboo” does not meaningfully prevent escalation in these simulations.

Those are not findings about the deep truth of nuclear conflict. They are findings about recurrent behavioral tendencies in frontier models under structured strategic stress. In that limited but important sense, “machine psychology” is a fair shorthand. These systems show patterned dispositions: default frames, escalation thresholds, brittle forms of caution, repeated failure modes.

For AI labs, that is genuinely useful. It tells them something about how post-training safety behavior behaves once the frame shifts from ordinary helpfulness to adversarial strategic reasoning. It suggests that the rhetoric of restraint can sit quite thinly over much older and more dangerous scripts. It shows how quickly a model can begin sounding “serious” in a way that is inseparable from sounding escalatory.

But that is not the same as saying the model has replaced the wargame participant.

And the eagerness to blur that distinction comes from a familiar place: the fantasy that difficult human judgment can be turned into output, then outsourced to a system that is cheaper, faster, more scalable, and easier to manage. This fantasy shows up everywhere. It always promises the same thing: keep the result, remove the difficult human being.

But the difficult human being is the point here.

The contradiction, the fatigue, the institutional instinct, the political fear, the ego, the rationalization afterwards — that is exactly what the wargame is meant to surface. The “mess” is not a flaw to be engineered away. It is the material.

The archive is tilted toward catastrophe

The deeper problem sits upstream.

Language models are not trained on “human reasoning” in any neutral sense. They are trained on what has been written, preserved, digitized, and made available at scale. In nuclear strategy, that archive is badly skewed.

It is dense with escalation. Commitment. Resolve. Signaling. Brinkmanship. Controlled risk. Deterrence theory. Coercive bargaining. Strategic credibility. It is full of texts in which seriousness is repeatedly performed through fluency in threat.

Schelling, Kahn, Brodie, Jervis. A long tradition of writing that treats the administration of danger as a high form of thought. Public doctrine is written to sound credible, which usually means written to sound willing. Even the cautious texts often remain inside the same grammar. They speak of the bomb as a usable possibility, escalation as a managed ladder, risk as an instrument to be shaped and communicated. Payne explicitly finds support in his results for parts of Schelling, Kahn, and Jervis while also finding that his models do not choose accommodation or withdrawal even under pressure.

What is much thinner is the literature of restraint.

Not moral discomfort. Not the standard closing paragraph saying nuclear war would be tragic. A real strategic literature of restraint: how a state accepts conventional loss without reaching for nuclear repair; how a crisis ends without theatrical victory; how humiliation is absorbed without becoming a pretext for mass destruction; how off-ramps are built, signaled, sold domestically, and survived politically.

There is far less of that material, and its absence is not accidental.

It reflects the priorities of the institutions that built the archive. States and military establishments have invested far more effort in theorizing force than in theorizing refusal. It has long been easier to win prestige in these circles by sounding fluent in coercion than by thinking seriously about retreat. The result is that escalation has been archived as realism, while restraint has often been treated as sentiment, softness, or an afterthought.

So when the models reproduce escalatory tendencies, this should not be described as some bizarre alien break from human judgment. It is an inheritance. The systems are speaking in a language we spent decades teaching our most serious institutions to call mature. The earlier Rivera et al. paper found much the same thing with GPT-4-era systems: difficult-to-predict escalation patterns, arms-race dynamics, and rare but real nuclear use. Payne’s paper adds a richer structure and newer models; it does not reverse the basic pattern.

The models have read Schelling. They have not read restraint because we built much less of it.

The machine is not the scandal

The machine sounding like a cold strategist is not the scandal. The scandal is that cold strategic speech has so often passed for depth.

What these papers expose, in compressed form, is something older and uglier than AI hype: a strategic culture far more practiced at managing catastrophe than imagining retreat from it. A world better at theorizing calibrated ruin than durable peace. A political order more comfortable discussing exterminatory force as an administrative option than discussing defeat as a survivable condition.

That matters because nuclear weapons do not simply threaten destruction. They reorder thought around the possibility of destruction. They force institutions to speak about the worst thing ever built in the voice of procedure, expertise, and composure. They turn apocalypse into a professional vocabulary.

And the archive passes that composure on.

The lesson these models learn is not just that nuclear weapons exist. It is that talking coolly about them is what seriousness sounds like.

That hierarchy should be harder to accept than it usually is.

Because once a civilization begins treating the management of annihilation as a normal field of competence, something has already gone badly wrong upstream. The problem is not only that the weapons may be used. The problem is that entire classes of experts are trained to inhabit their existence as routine. The bomb stops appearing as an obscenity and starts appearing as a domain.

The models did not invent that. They absorbed it.

Where LLMs actually help

None of this means LLMs are useless in and around wargaming. It means their role needs to be described honestly.

They are good at scenario support. Drafting injects. Generating situational updates. Stress-testing internal consistency. Producing plausible background material quickly. They can help white cells keep pace with the tempo of a live exercise.

They are useful as assistants during execution. A red team can use them to sketch likely adversary responses. An adjudicator can use them as a consistency aid. A control cell can use them to produce informational texture at speed.

And they are particularly useful after the game. Debriefs generate large volumes of messy qualitative material. Here models can genuinely help: synthesizing transcripts, surfacing repeated patterns, clustering themes, identifying gaps between stated rationale and actual behavior.

All of that is real.

But none of it requires pretending the model should be the player. The machine is most credible when it remains staff. Even Payne’s own paper is cautious on this point, arguing for calibration against human reasoning rather than simple substitution. Panda and Reddie go further and argue explicitly against conflating LLM crisis play with the purpose of human wargaming.

The real lesson is about the corpus

The obvious policy conclusion survives untouched. No model should sit inside the nuclear use chain. Human beings must remain responsible for those decisions.

But “human in the loop” is not enough if everything around the human is increasingly shaped by automated systems: intelligence prioritization, scenario framing, option generation, warning synthesis, confidence ranking, recommendation layers. The question is not only who makes the final decision. It is who shapes the field of thinkable decisions before that moment arrives. That concern is exactly what Lt. Gen. John “Jack” Shanahan stresses in his discussion of AI integration into nuclear command-and-control ecosystems: the danger is not just direct launch authority, but false confidence and distorted situational awareness across the wider decision environment.

And beyond that sits the deeper lesson.

Models inherit our strategic imagination. They learn not just facts, but emphasis. Not just propositions, but priority. They absorb what a field spends its energy refining. Right now that imagination remains lopsided: overdeveloped in escalation, underdeveloped in restraint; rich in the language of coercion, thin in the language of stopping.

So the answer is not only better safeguards around the models. It is also a different archive.

More work on off-ramps. More on negotiated retreat. More on strategic patience. More on how states survive loss without reaching for apocalyptic compensation. More anti-nuclear thinking that is not just morally right, but analytically hard, institutionally literate, and impossible to dismiss as decorative conscience from the sidelines.

Because that is the real asymmetry.

We have a canon of escalation and a footnote of restraint.

The models are not inventing that imbalance. They are replaying it back to us in a flatter, colder voice.

And that is the part that should actually worry us.

The Protocol You Like Is Going to Come Back in Style

2026-04-18T00:00:00+00:00

There is a certain kind of software dream that always arrives dressed like innocence.

A little chat box. A few names in a sidebar. Some circles turning green. A message sent. A message received. A clean interface, nice spacing, a calm typeface, the illusion of ordinary life. And lately a fourth presence in the room — a small inline assistant, ready to summarize the thread, draft the reply, translate, transcribe, or quietly remember things for later.

But under the floorboards there is another story, and it is never ordinary. Under the floorboards there is key material, ratchets, tree paths, signature checks, nonces that must never repeat, public keys that look like harmless 32-byte strings and are in fact small pieces of a war against compromise, subpoenas, database leaks, rogue admins, future attackers, human forgetfulness — and now a new participant whose memory is less like a person’s and more like a room full of GPUs that decline to forget on demand.

This is the long walk from “E2EE is that thing Signal does” to “all right, I can actually read the protocol now, and I can also see why putting an LLM in the middle of it is the most interesting threat model the field has acquired in years.”

Not a pitch. Not a product page. Just the mechanics. Just the wires under the wallpaper.

The basic idea is simple enough that it almost feels suspicious. In a normal web app, your message travels under TLS to the server, the server decrypts it, stores plaintext, maybe indexes it, maybe backs it up, maybe hands it to another client later over TLS again. Transport encryption protects you from the guy sniffing packets on bad Wi-Fi. It does not protect you from the operator of the service, from an admin who goes bad, from a compromised database, from a cloud provider with too much visibility, or from legal compulsion. The server is trusted because it has to be. That is the whole architecture.

End-to-end encryption changes the trust boundary. The client encrypts before the server ever sees content. The server stores ciphertext, forwards ciphertext, replicates ciphertext, backs up ciphertext. Other clients decrypt. The server becomes a delivery service, not an interpreter of meaning. It still sees metadata, because metadata is the cigarette smoke that lingers in every room: who talked to whom, when, how often, in what group. But it does not get the text itself.

That sounds clean, and it is clean, but not cheap. Search becomes hard because the server cannot index what it cannot read. Key loss becomes message loss because there is no honest “reset password” button for ciphertext. Group membership becomes a cryptographic event rather than a database update. Adding someone means giving them cryptographic access. Removing someone means rotating future secrets so they cannot follow the next epoch forward.

That is where the old beautiful protocols come back in style. Not as nostalgia. As necessity.

And today? They are not just back. They are shipping. MLS, less than two years after becoming a published RFC, is poised for deployment across Android phones and iPhones thanks to a refreshed RCS specification, finally enabling interoperable encryption between platform vendors. The protocol that used to live in academic slides now lives in the SMS replacement on a billion devices.

The bricks, before the cathedral

Most modern cryptography is just six or so primitives stacked with discipline. The magic dissolves once you see that. You stop thinking in terms of mysterious “military grade encryption” and start thinking in terms of a few hard tools used correctly.

A hash function such as SHA-256 takes arbitrary input and gives back a fixed 32-byte output:

H(m) -> 32 bytes

It is deterministic. It is one-way, in the practical sense. It is collision resistant, meaning you should not be able to find two distinct messages with the same digest. In real protocol design, hashes are not usually used alone for secrecy. They show up as ingredients: in HMAC, in HKDF, in transcript hashes, in integrity checks.

Then comes HMAC, which is what happens when you take a hash and give it a key:

HMAC(k, m) = H((k ⊕ opad) || H((k ⊕ ipad) || m))

That funny nested construction exists because the obvious thing, hashing k || m, is not robust enough. HMAC is the proper way to say: whoever made this authentication tag knew the secret key. It is one of those places where the right construction looks ugly because it has already survived contact with many clever attacks.

HKDF is next. Diffie–Hellman shared secrets are high entropy, but protocol designers do not just jam raw group elements into AES and hope for the best. HKDF turns input keying material into well-separated, uniform subkeys:

PRK   = HMAC(salt, IKM)
OKM_i = HMAC(PRK, OKM_{i-1} || info || i)

The first step, extract, normalizes entropy. The second, expand, produces as many bytes as you need and labels them with context through the info field. One shared secret can safely give you a chain key, a nonce base, a message key, and more, as long as you domain-separate correctly.

Then there is AEAD, typically AES-128-GCM in this setting:

ciphertext, tag = AEAD_enc(key, nonce, plaintext, aad)
plaintext        = AEAD_dec(key, nonce, ciphertext, tag, aad)

Authenticated encryption with associated data means one operation gives confidentiality and integrity together. The plaintext is encrypted. The aad is not encrypted, but it is authenticated, which is often exactly what you want for fields like channel identifiers or epochs. The critical rule with GCM is brutal and absolute: never reuse a nonce with the same key. Not “usually avoid.” Never. Reuse is catastrophic. Protocols that derive nonces carefully are not being fussy. They are staying alive.

For signatures, Ed25519 has become the civilized default. One private signing key. One public verification key. Small keys, small signatures, fast operations, deterministic signing, a hard-to-misuse API:

def public_raw(self) -> bytes:
    return self.public_key.public_bytes(
        encoding=serialization.Encoding.Raw,
        format=serialization.PublicFormat.Raw,
    )

def private_raw(self) -> bytes:
    return self.private_key.private_bytes(
        encoding=serialization.Encoding.Raw,
        format=serialization.PrivateFormat.Raw,
        encryption_algorithm=serialization.NoEncryption(),
    )

@classmethod
def generate(cls) -> "IdentityKeyPair":
    priv = Ed25519PrivateKey.generate()
    return cls(private_key=priv, public_key=priv.public_key())

This is identity. The device signs things with it. The private half stays on the device. The public half can travel.

For key exchange, X25519 gives you elliptic-curve Diffie–Hellman stripped down to something lean and practical. Alice has private scalar a and public point aG. Bob has b and bG. Alice computes a * (bG). Bob computes b * (aG). Both arrive at the same shared secret, abG, without ever sending the private scalars over the wire.

@classmethod
def generate(cls) -> "InitKeyPair":
    priv = X25519PrivateKey.generate()
    return cls(private_key=priv, public_key=priv.public_key())

That is your other fundamental tool: not identity, but key agreement.

Six pieces. SHA-256, HMAC, HKDF, AES-GCM, Ed25519, X25519. A strange little family. Enough to build a secure messaging system if you are disciplined, and enough to destroy one if you are not.

In 2026 there is a new generation joining the family on the public-key side — ML-KEM and ML-DSA, the lattice-based replacements — but the symmetric pillars (SHA-256, HMAC, HKDF, AES-GCM) carry over essentially unchanged. Symmetric cryptography is not what the quantum computer is going to break. We will return to this.

The curve behind the curtain

If you keep looking at crypto long enough, you eventually hit the wall of mathematics and discover it is less a wall than a red curtain. Pull it back and the machinery is ugly but intelligible.

Curve25519 lives over a finite field modulo the prime $2^{255} - 19$. In one common form, the curve equation looks like this:

\[y^2 = x^3 + 486662x^2 + x \pmod{2^{255} - 19}\]

Points on this curve form a group under a specially defined addition law. Pick a base point $G$. Multiply it by an integer scalar $k$, and you get another point $kG$. The security assumption is that given $G$ and $kG$, recovering $k$ is computationally infeasible. That is the elliptic-curve discrete logarithm problem.

Everything starts leaning on that hardness assumption.

In X25519, your private key is a scalar and your public key is the corresponding point. Shared secret computation is just scalar multiplication performed from opposite sides. In Ed25519, signatures rely on related algebra. In a simplified form, verification checks a relation like

\[sG = R + H(R, A, m)A\]

where $A$ is the public key, $R$ is an ephemeral point, and $s$ is part of the signature. The beauty is not that it is simple. The beauty is that it is standardized, optimized, and old enough to trust more than anything newly improvised in a repo at 2:30 a.m.

When people say this ecosystem gives roughly 128-bit security, that means the best known attacks still require on the order of $2^{126}$ to $2^{128}$ work, depending on the primitive and attack model. That is why AES-128 is not some “smaller” and therefore weaker embarrassment next to AES-256. It is already absurdly secure in practice. AES-256 is not wrong; it is just often more aesthetic than necessary in this class of systems.

The lattice cousins, ML-KEM and ML-DSA, lean on different math entirely — module learning with errors. The hardness story over there is younger and the keys are larger, but the high-level shape of the API is reassuringly familiar: generate a keypair, encapsulate to a public key, decapsulate with the private key, get a shared secret out the other end. Same drama, different stage.

The ciphersuite as a sentence

Sometimes protocols hide their entire worldview in one string. Consider this:

mls-128-dhkemx25519-aes128gcm-sha256-ed25519

That is not branding. That is the full menu.

It says the protocol is MLS, Messaging Layer Security. It says the target security level is about 128 bits. It says the KEM side of HPKE uses X25519-based Diffie–Hellman. It says bulk authenticated encryption uses AES-128-GCM. It says the hash foundation is SHA-256. It says signatures are Ed25519.

A good ciphersuite string is like a proper street sign at night. It tells you exactly where you are.

The 2026 menu is longer. Alongside the classical suite, the working draft for post-quantum MLS registers names like:

mls-256-dhkemx25519mlkem768-aes256gcm-sha384-ed25519mldsa65
mls-256-dhkemmlkem768-aes256gcm-sha384-mldsa65

The first is hybrid. Two key encapsulations stacked, classical and post-quantum, combined so that an attacker has to break both. The second is post-quantum only, no classical safety net, for when you have decided the lattice assumption is good enough on its own. Which one you pick is a real engineering question and a slightly philosophical one. Hybrid is the cautious answer. Pure PQ is the answer for people who think the classical curtain is going to fall and they would rather not be standing behind it.

HPKE, the sealed envelope with an ephemeral key inside

MLS leans heavily on HPKE, Hybrid Public Key Encryption. HPKE is what you use when you want to encrypt to someone’s public key but still end up with symmetric encryption under the hood, because public-key operations are for bootstrapping trust, not for carrying the bulk of your traffic.

The friendly API looks like this:

ciphertext, enc = HPKE_Seal(recipient_pubkey, aad, plaintext)
plaintext       = HPKE_Open(recipient_privkey, enc, aad, ciphertext)

Underneath, the sender generates a fresh ephemeral X25519 keypair. Call it $(e, eG)$. The sender computes a Diffie–Hellman shared secret using $e$ and the recipient’s public key. HKDF turns that shared secret into an AEAD key and nonce material. The plaintext gets sealed. The recipient uses their private key and the sender’s ephemeral public key enc to derive the same shared secret and open the message.

That ephemeral key matters. It means a compromise later does not automatically reveal past HPKE-sealed payloads. Forward secrecy begins here, in these deliberately disposable little keys that exist just long enough to pass a secret across the gap and then disappear.

There is a separate IETF draft for post-quantum and hybrid HPKE, which slots ML-KEM into the same shape. Same envelope, different glue. The KEM step gets bigger. The AEAD step does not. The ergonomics of the calling code barely change. That is what good cryptographic abstractions look like — replaceable parts behind a stable seam.

Asynchronous adds, or why prekeys exist

The first real awkwardness in messaging is this: how do you cryptographically include someone who is not online right now?

The answer, in Signal land and in MLS land alike, is some version of pre-published one-time public material. MLS calls them KeyPackages. Signal called them prekeys. Same basic energy, less mystique.

A device publishes signed bundles containing an identity public key, a fresh X25519 init public key, some metadata such as lifetime and ciphersuite, and a signature over the whole canonicalized object.

bundle_obj = {
    "schema": KEYPACKAGE_SCHEMA,
    "ciphersuite": ciphersuite,
    "user_id": user_id,
    "device_id": device_id,
    "identity_pubkey": _b64(identity.public_raw()),
    "init_pubkey": _b64(init_key.public_raw()),
    "lifetime": lifetime,
    "nonce": _b64(secrets.token_bytes(16)),
}
bundle_bytes = json.dumps(
    bundle_obj,
    separators=(",", ":"),
    sort_keys=True
).encode("utf-8")

signature = identity.private_key.sign(bundle_bytes)

return {
    "ciphersuite": ciphersuite,
    "public_bundle": _b64(bundle_bytes),
    "signature": _b64(signature),
}

The canonical JSON detail looks boring and is not. Signatures are over bytes, not abstract objects. If two implementations serialize the same object differently, verification breaks. So you sort keys, strip whitespace, and commit to one representation. Real MLS implementations use TLS presentation language for this sort of thing, but the principle is the same: canonicalize or die by ambiguity.

The one-shot property also matters. A KeyPackage’s init key is supposed to be consumed once. One Welcome, one use. Burn after reading. Reusing it stretches the blast radius of compromise across multiple onboarding events, which is exactly what you do not want.

Signal’s PQXDH and the post-quantum extensions to MLS preserve this shape. The prekey bundle gets a lattice public key alongside the classical one. The handshake derives a shared secret from both. The code looks more verbose. The mental model survives intact.

Why MLS exists instead of “just do Signal but for groups”

Pairwise encrypted messaging is manageable when the room is small. In a large group, naïve pairwise state turns ugly fast. If everyone has to maintain separate secure relationships with everyone else, membership changes become expensive and messy. MLS exists because secure groups needed a protocol designed for groups rather than retrofitted from two-party assumptions.

The central idea is TreeKEM. Imagine a binary tree. Leaves are devices. Internal nodes also carry key material. Each device knows the private keys on its path from leaf to root, and the root secret anchors the current group epoch.

            root
           /    \
         n1      n2
        / \     / \
      L1  L2  L3  L4

A member at leaf L1 knows L1, n1, and root. Another member on the same side knows a different leaf secret but the same path upward. Members on the opposite side know their own path. This structure lets the protocol update group secrets in logarithmic rather than quadratic fashion.

When membership changes, a member creates a fresh path from its leaf to the root. New path secrets are derived and corresponding public keys are distributed to the right sibling subtrees using HPKE. Then the epoch advances. New message keys, nonces, and sender state are derived from the new epoch secret.

You can write the spirit of it as:

\[\text{epoch}_{n+1} = \mathrm{KDF}(\text{epoch}_n, \text{commit}, \text{transcripthash})\]

The exact derivation structure is more elaborate, but the important thing is the property: each epoch change injects fresh entropy and re-anchors the group.

That gives you forward secrecy, meaning compromising today’s state does not unlock yesterday’s messages if old secrets were properly deleted. It also gives post-compromise security. If an attacker steals your current secrets but the group continues and a fresh commit lands, the attacker can be pushed back out of future epochs. This is one of the reasons MLS is so compelling. It does not just try to be secure at a frozen instant. It tries to heal.

The 2026 footnote: RFC 9750, the MLS Architecture document, was published in April 2025, codifying the operational guidance that the protocol document RFC 9420 deliberately left out. The GSMA’s Universal Profile 3.0 made MLS the basis for end-to-end encryption in RCS, with Apple committing to support it on Apple Messages, while Matrix has announced its migration as well. What was a paper protocol is now the substrate of mass-market texting between rival platforms. That is, for a standards effort, a fairy-tale ending.

Welcomes, removals, and the fact that crypto is not a time machine

When a new member is added, someone claims one of that member’s KeyPackages, updates the tree, and creates a Welcome message encrypted to the recipient’s init key. The Welcome contains enough information for the new member to enter the current epoch with the right tree context and path secrets.

That Welcome is the quiet doorway. It is how an asynchronous group says, you were absent, but the cryptography left the porch light on.

Removal is more sobering. When someone is removed, the group rotates forward. Their leaf is blanked, the tree updates, the epoch changes, and they cannot derive future keys. But the past remains the past. If they already received old ciphertext and had the right keys at the time, you do not get to reach into their memory and erase it. End-to-end encryption is powerful, but it does not reverse causality.

People often discover this too late. Revocation means no future access. It does not mean retroactive amnesia.

This becomes painfully relevant when the new member is not a person.

The third party in the conversation

End-to-end encryption defines security in terms of who the “ends” are. For decades, the ends were humans, and the threat model was clear: anyone in the middle is suspect. In 2026, a third kind of end has shoved its way into the room. It does not have a face. It does not show up in the member list. It is an LLM, and somebody — maybe the platform, maybe a single member, maybe the user themselves — has plugged it into the conversation.

Once that happens, the trust model the protocol was designed around quietly bends.

Think about what the LLM has to be able to do in order to be useful. To summarize a thread, it must read the thread. To draft a reply, it must read the context. To translate, transcribe, search the chat history, remember preferences, suggest a meeting time — every one of those features requires plaintext on the LLM’s side of the wire. End-to-end encryption was built to keep messages private, but that privacy starts to fray as soon as decrypted content is handed to an AI assistant. Even if the processing happens in a careful environment, the message is no longer constrained to sender and recipient.

This does not break the cryptographic protocol. The protocol still does exactly what it promises. The bytes are sealed in transit, the server still cannot read them, the tree still rotates, the epoch still advances. What breaks is the implicit social contract that “encrypted” meant “only the people in the chat will see this.” When an assistant joins, the set of entities who see plaintext has grown by one. Sometimes that one is on your phone. Sometimes it is in a data center on another continent. Those two cases are not the same, and treating them as the same is how good protocols start producing bad outcomes.

There is also a collective consent problem that has no precedent in the cryptographic literature. If a single participant in a group enables an AI assistant, every other participant’s messages may be processed by a model none of them opted into. The math of MLS does not know what to do with this. No epoch update will save you from a member who is faithfully relaying every commit message into a vendor’s API. Honest behavior at the protocol layer can be perfectly compatible with what feels, at the human layer, like a quiet betrayal.

And then there is the new attack class. Vulnerabilities like EchoLeak demonstrated that AI assistants can be coaxed into leaking sensitive material — a category researchers have started calling LLM scope violation, where the model exposes context it was never supposed to share, based purely on how it interpreted a crafted prompt. This is the new nonce reuse. It does not look like the old attacks. It does not require breaking AES or finding a discrete log. It requires writing a sentence that sounds innocuous and is, in fact, a small key turned in a lock the assistant did not know it was guarding.

The protocol people have a useful instinct here, and it is worth borrowing: declare the threat model explicitly, then design for it. If the LLM is in the trust boundary, say so. If it is not, do not let it touch plaintext. Anything in between — “well, the model is processed in a special place” — is where the interesting security engineering of this decade is going to live.

Private Cloud Compute, or the new attestation ceremony

The most elaborate answer to the LLM-in-the-middle problem, so far, is to push the model into a hardware-rooted enclave that the platform itself cannot read into.

Apple’s Private Cloud Compute architecture works roughly like this: the device builds a request containing the prompt and inferencing parameters, encrypts it directly to the public keys of specific PCC nodes the device has cryptographically verified, and the data is supposed to be deleted after the response is returned and never available to Apple staff, including those with administrative access. The trust story is no longer “Apple promises.” It is “Apple publishes the binary running on the node, the device attests that it is talking to that binary, and any deviation is detectable.”

That is a meaningful change. It is also not magic. PCC and confidential computing are not the same thing — PCC focuses on hardening the communication path with verifiable software transparency, while confidential computing focuses on encrypting workloads in use within trusted execution environments, defending against malicious operating systems and hypervisors. The two approaches share primitives — TEEs, remote attestation, sealed channels — but they are answering slightly different questions. Whether either of them gives you the same guarantee that classical end-to-end encryption gives is a more subtle conversation than the marketing usually allows.

The broader pattern across the industry in 2026 is something called a private AI cloud. These architectures lean on three primitives — trusted execution environments that hardware-isolate memory from the host, GPU confidential compute (NVIDIA’s offering extends the trust boundary to include the accelerator), and remote attestation that lets clients verify which code is actually running. Anthropic, Google, and Meta all have variants in production or development. The honest framing is that these clouds give real privacy benefits but do not deliver the same kind of mathematical guarantee as end-to-end encryption — users still have to trust hardware vendors, attestation infrastructure, and abuse-monitoring layers.

This matters because the security argument is no longer “we cannot read your data.” It is “we have arranged the world such that, assuming the chip vendor did their job and the attestation chain holds, we should not be able to read it, and you can verify that arrangement before you send it.” The shift from cryptographic impossibility to verified arrangement is enormous, and easy to miss.

There is a strain of thought that calls this a downgrade. There is another that calls it the only practical way to get useful AI without surrendering everything. Both are partly right.

On-device, or, the small model that doesn’t tell

The other answer — quieter, less heroic, increasingly viable — is to keep the model on the device.

The economic and regulatory pressure here is genuine. Local LLMs are gaining traction precisely because they sidestep concerns about transmitting sensitive material over the internet, and the on-device AI market is projected to grow into the tens of billions of dollars by 2030. Phones in 2026 routinely run quantized 1B to 3B parameter models well enough to handle summarization, translation, dictation, basic agent workflows, and the small set of tasks that used to require a network round trip. Cross-platform inference SDKs now report sub-50ms time-to-first-token for on-device models, eliminating network latency and defaulting to total privacy.

For the threat model we have been building, this is the cleanest fit. If the model lives on your device, then “the LLM saw it” and “your local app saw it” are the same statement. The tree of trust does not get an extra branch. The bytes do not leave. The Welcome message gets opened, the epoch keys get derived, the plaintext gets handed to a small model that runs on the same silicon as your photo library, and nothing crosses the network that was not already going to.

This is not a complete answer. The on-device models are smaller and dumber than their cloud cousins. Some workloads still demand the bigger brain. The compromise is roughly: do small things locally, and fall back to a verified private cloud only when you have to. That is the architecture quietly emerging in shipped products — Apple Intelligence, Proton’s Lumo, certain enterprise RAG stacks. Lumo’s privacy story explicitly grapples with what end-to-end encryption even means when one of the ends is a language model rather than a person, and tries to encrypt both in transit and at rest while still letting the model do useful work.

The protocol designers do not get to settle this debate for the product designers. But protocol designers can at least insist that the seam between “encrypted message” and “AI feature” be honest. If the assistant runs locally, say so. If it does not, say where it runs and what it is allowed to remember. Anything else is a rendering trick, and rendering tricks tend to age badly.

The server as a dumb and useful machine

One of the most attractive architectural consequences of doing this properly is that the server gets demoted. Not removed, not romanticized, just demoted.

It authenticates users and devices. It enforces access control. It stores blobs. It forwards notifications. It keeps monotonic epoch state so clients do not fork the group history accidentally or maliciously.

A server-side epoch check can look as plain as this:

if req.epoch != mls.epoch + 1:
    raise HTTPException(
        status_code=status.HTTP_409_CONFLICT,
        detail=f"epoch out of order: expected {mls.epoch + 1}, got {req.epoch}",
    )

That is almost disappointingly simple, which is a sign of a good separation of concerns. The server should not be doing cryptographic interpretation of runtime group content. It should not need private keys. It should not parse more than it must. It should not become the wise central brain that every future compromise wants to interrogate.

The beauty of opaque blob storage is not elegance alone. It is strategic humility. If later you swap a reference implementation for OpenMLS or AWS’s mls-rs, or move from purely classical suites to hybrid post-quantum ones, the server ideally barely notices. If you decide to add an on-device assistant, the server still does not learn anything new. The server’s ignorance is the system’s strength.

The threat model, without neon and lies

There is always a temptation in security writing to sound apocalyptic on the sales pages and omnipotent in the architecture docs. Better to stay sober.

What this kind of design defends against: passive network attackers, database readers, many classes of server compromise, leaked backups, and some forms of device compromise once the group rotates. Clients verify cryptographic commits, so a malicious admin can censor or destroy availability, but cannot forge valid group evolution undetected.

What it does not defend against: metadata analysis, compromised clients, an attacker who is legitimately added to a group, a future quantum adversary powerful enough to break the classical public-key assumptions underneath X25519 and Ed25519, and — newly important — any AI assistant that has been granted plaintext access to the conversation, regardless of how nicely its inference is wrapped.

Metadata remains a live wound. Even if content is sealed, the delivery service still sees patterns. Presence, read receipts, typing indicators, retention, and logs all become political choices as much as technical ones. Every extra signal you store is another little lantern for a future adversary.

A compromised client is game over for that endpoint. This is a hard truth worth stating plainly. If the software that performs encryption has already been subverted, then “but the protocol is sound” is a eulogy, not a defense. The 2026 corollary is harder still: an LLM-enabled client is a client whose attack surface includes prompt injection, model jailbreaks, and any data flow the assistant can reach. Securing the cryptography buys you nothing against an assistant that is smoothly persuaded to exfiltrate the very chat it just helped you summarize.

The quantum ghost in the corner, now wearing a name tag

In the last edition of this essay, the quantum threat felt like a long shadow at the end of a hallway. It is closer now, and it has acquired specific names.

NIST has finalized its first post-quantum standards. ML-KEM, the lattice-based KEM previously known as Kyber, was standardized as FIPS 203 — strong security with relatively small keys and ciphertexts of around 1.5 KB, efficient on modest hardware, and the post-quantum half of most hybrid TLS and VPN deployments today. ML-DSA, formerly Dilithium, is its signature counterpart. In 2025 NIST also selected HQC, a code-based KEM, as a backup standard alongside Kyber, with a final standard expected by 2027. Two different mathematical families, in case one of them turns out to have a hidden window.

The migration is no longer hypothetical. AWS has begun rolling out ML-KEM support across services such as KMS, ACM, and Secrets Manager, with the older pre-standard Kyber implementations slated for removal across all AWS endpoints in 2026. The IETF has a working draft for MLS post-quantum ciphersuites that pairs ML-KEM with ML-DSA inside the existing protocol framing. The hybrid HPKE draft does the same for the envelope primitive everyone leans on.

The reason ciphersuite agility was always emphasized: this is exactly the moment it has to pay off. Architectures that hardwired X25519 are now staring down expensive surgery. Architectures that treated the ciphersuite as opaque metadata and let the clients decide are doing migrations as configuration changes.

A future hybrid suite looks like this:

mls-128-dhkemx25519hybridmlkem768-aes128gcm-sha256-ed25519

The idea is to combine a classical component and a post-quantum KEM. Break both or fail. That is how sane migrations happen: hybrid first, caution always, no messianic declarations that one shiny new primitive has already replaced decades of cryptanalysis.

If your architecture stores the ciphersuite as opaque metadata and leaves cryptographic interpretation to the clients, then supporting such a migration is not a religious conversion. It is a controlled evolution.

That is what good protocol design looks like. Not immortality. Replaceable parts.

The harvest-now-decrypt-later concern, once theoretical, is the operational reason for moving early. If an adversary is recording your ciphertext in 2026 and a cryptographically relevant quantum computer exists in 2036, anything you sent under classical-only crypto today is potentially in their reading queue. The hybrid suites exist to take that bet off the table, retroactively, for the messages you send now.

The ugly bits that are not the same as broken bits

Every real system has debt. The important distinction is between debt that is operational and debt that is cryptographic.

Race conditions in KeyPackage claims on a lightweight database. In-process pubsub that will not survive multi-worker scale. No finalized key backup and recovery design. Garbage collection of consumed prekeys left for later. An LLM integration that helpfully summarizes the chat into a third-party API without telling users where the bytes ended up. These are flaws, some annoying, some serious, but not all of them are cryptographic in nature.

A system can ship with operational debt and live to improve. A system that ships with nonce reuse, unauthenticated state transitions, or home-rolled key derivation is a crime scene wearing sneakers. A system that ships with an undisclosed AI assistant in the trust boundary is somewhere between the two — not a cryptographic break, but a category of breach that the user could not have consented to because they were not told it existed.

That distinction matters because protocol work attracts a lot of performative purity. Better to be exact. Some compromises are survivable. Others poison the well. And some, increasingly, look survivable in the architecture diagram and become catastrophic the moment you check what the assistant is actually doing with the plaintext.

Why these old protocols still feel strange and new

There is a David Lynch quality to good security engineering. The surface looks domesticated. A room, a lamp, a clean UI, a person typing, a friendly assistant offering to summarize the day. But behind the wall there is another room, and behind that room there is machinery, and behind the machinery there is an old mathematics that does not care about your aesthetic preferences, and lately, behind all of that, there is a model with weights large enough to encode an unsettling amount of human writing, humming away on a chip in a building you have never visited.

You send a message and what really happens is a signature binds a canonical object, an ephemeral keypair blooms and dies, HKDF stretches a shared secret into clean material, a nonce is derived with priestly care, a tree path rotates, a transcript hash ratifies continuity, an epoch advances, a server shrugs and forwards bytes it cannot read — and then, optionally, on one device, the bytes are turned back into language and handed to a model that chooses words back. Ordinary life on top, algebra underneath, a speaking animal in a cage at the end.

This is why secure messaging is so easy to misunderstand. It looks like product behavior, but it is protocol behavior. It looks like UX, but under the UX is a chain of assumptions so brittle and exact that one reused nonce or one badly scoped secret can turn the whole palace into vapor. And with the assistant in the room, the brittle exactness extends now to a new question: what is the model allowed to remember, who decided, and how would you ever know.

And yet the result, when done right, is almost poetic in a severe, technical way. A group changes shape and the cryptography changes with it. A stolen key does not mean the attacker owns the future forever. A server becomes less powerful by design. Trust is pushed outward to endpoints and bounded by verifiable math rather than institutional promises. And, in the best versions of the assistant story, the model lives close to the user, the prompt never leaves, the helpful little voice is bounded by the same silicon that holds the keys.

The protocol you like is going to come back in style because the world keeps rediscovering the same hard lesson: if the middle can read everything, then eventually the middle matters too much. And when the middle matters too much, someone always tries to own it. The middle has a new shape now — it can be a database, a delivery service, a side channel, or a 70-billion-parameter model in a rack — but the lesson is unchanged.

So the old names return. Diffie–Hellman. EdDSA. HKDF. AEAD. HPKE. MLS. They do not return as retro chic. They return because the conditions that made them necessary never really left. They were just waiting under the stage lights for everybody else to catch up.

The new names are joining them. ML-KEM. ML-DSA. HQC. Confidential inference. Attested enclaves. On-device models. They are not replacements. They are companions. The cathedral is still being built. The bricks have just gotten a little bigger and a little stranger.

We Used to Talk About Tor. Well We’ve Got LLM Agents

2026-04-14T00:00:00+00:00

There was a time when internet privacy debates had a relatively stable shape.

We talked about Tor, VPNs, encrypted email, browser fingerprinting, metadata retention, and traffic analysis. The underlying model was clear enough: a human user interacted with a networked environment, and the primary risk was that this interaction could be observed, recorded, correlated, and ultimately exploited. The goal of security and privacy technologies was therefore to reduce visibility, distribute trust, and make surveillance more expensive or less reliable.

Those concerns remain valid. But they are no longer sufficient to describe the current landscape.

What has changed is not only how data is transmitted or stored, but where action itself takes place. Increasingly, users are not acting alone. They are accompanied by software systems - LLM-based assistants, copilots, and agents - that read across data sources, interpret intent, retrieve context, and, in many cases, take action on their behalf. These systems do not simply protect or expose user activity. They participate in it.

This introduces a qualitatively different problem. The question is no longer only who can observe the user. It is also what can act for the user, what information that system must consume in order to do so, and how its decision-making process can be influenced or subverted.

Tor addressed concealment. Agents introduce delegated authority.

That distinction is not superficial. It alters the level at which security needs to be reasoned about.

From protecting communication to governing execution

The traditional privacy stack focused largely on protecting communication paths. Systems like Tor obscured origin through layered routing; TLS secured content in transit; end-to-end encryption attempted to ensure that even service providers could not access message contents. Anti-tracking tools reduced the ability of platforms to correlate user behavior across contexts.

These mechanisms were designed for a world in which the user initiated discrete actions. The system’s responsibility was to carry or protect those actions, not to originate them.

Agentic systems shift this boundary upward. They are not merely transporting user intent; they are interpreting it, extending it, and, in some cases, generating new actions that the user did not explicitly specify in detail. This moves the security problem away from transport and storage and toward interpretation, planning, and execution.

In practical terms, this means that the integrity of the system no longer depends only on whether data is encrypted or access-controlled, but on whether the system correctly understands what it is supposed to do and whose authority it is operating under.

What “agent” means in real systems

The term “agent” is often used loosely, so it is useful to ground it in actual system design.

In most current implementations, an agent consists of a language model acting as a central planner within a control loop. It receives user input and contextual state, retrieves additional information through search or RAG pipelines, and has access to a set of tools - these might include APIs, file systems, browsers, databases, or code execution environments. Based on the combined context, the model proposes actions, which are executed by the system, and the results are fed back into the loop until some completion condition is reached.

This architecture effectively turns the model into a coordination layer across heterogeneous systems. It is not simply generating text; it is orchestrating operations. The critical point is that the same mechanism used to interpret natural language is now also responsible for selecting actions that have real side effects.

This coupling between interpretation and execution is what creates new risk.

A striking amount of the current ecosystem still builds agents in a way that, from a security perspective, is essentially equivalent to letting the model read everything, decide everything, and call everything. In pseudo-code, the unsafe version looks something like this:

def run_agent(user_request: str):
    context = []
    context.append({"role": "user", "content": user_request})

    retrieved_docs = rag_search(user_request)
    for doc in retrieved_docs:
        context.append({"role": "system", "content": doc.text})

    while True:
        response = llm.generate(context, tools=ALL_TOOLS)

        if response.type == "tool_call":
            result = execute_tool(
                name=response.tool_name,
                args=response.tool_args,
            )
            context.append({"role": "tool", "content": str(result)})
        else:
            return response.content

At first glance this looks clean. It is also a compact summary of the problem. Retrieved documents are inserted into the same effective decision space as the user’s request. The model is trusted to decide which tool to call. All tools are available in the same loop. There is no external policy layer, no trust separation, no approval boundary, and no explicit identity scoping. If one of the retrieved documents contains adversarial instructions, or if the model simply infers the wrong next step, the system has no meaningful brake.

This is the architectural equivalent of saying: “Here is a probabilistic parser of ambiguous language. Let it sit in the middle of our infrastructure.”

The expansion of the attack surface into context

Traditional software systems treat input as data that must be validated before use. Agentic systems, by design, ingest large volumes of heterogeneous input and treat it as part of the reasoning process.

This input may include webpages, documents, emails, chat messages, code, logs, and prior outputs from the system itself. Importantly, there is no inherent distinction between “data” and “instructions” in natural language. Once incorporated into the model’s context window, any piece of text can influence subsequent decisions.

This is the essence of prompt injection, but it is better understood as a broader class of semantic attacks. A malicious document does not need to exploit a memory vulnerability if it can alter the model’s understanding of the task. A webpage does not need to execute code if it can persuade the system that a particular action is necessary or authorized.

In classical systems, we work hard to separate code from data. In agentic systems, that separation is blurred by design. The model must interpret meaning across inputs, and meaning in natural language often carries implicit instructions.

This makes context itself an attack surface.

Retrieval as a security boundary

Retrieval-augmented generation is commonly framed as a technique for improving accuracy by grounding the model in external knowledge. In an agentic setting, however, retrieval becomes a critical security boundary.

When external or semi-trusted content is introduced into the model’s working context, it gains the ability to influence decision-making. If all retrieved content is treated equally, then untrusted sources may acquire the same effective authority as system policies or user instructions.

A robust design therefore needs to treat different classes of input differently. User intent, system policy, structured internal state, and externally retrieved content should not be merged into a single undifferentiated prompt. Each should carry metadata about its origin, trust level, and permissible influence.

Without such separation, the model is left to infer authority relationships from patterns in text, which is not a reliable basis for security-critical decisions.

This is where safer designs begin to look less like chat wrappers and more like security middleware. A more defensible control loop usually has a very different shape:

def run_agent(user_request: str, user_identity: Identity):
    plan_context = {
        "user_request": user_request,
        "trusted_policy": load_policy_bundle(user_identity),
        "structured_state": load_structured_state(user_identity),
        "retrieved_untrusted": retrieve_untrusted_context(user_request),
    }

    proposed_action = llm_plan(plan_context, tool_catalog=SAFE_TOOL_SCHEMAS)

    decision = policy_engine.evaluate(
        actor=user_identity,
        action=proposed_action,
        trust_context=plan_context,
    )

    if not decision.allowed:
        return deny(decision.reason)

    if decision.requires_approval:
        approved = request_human_approval(
            actor=user_identity,
            action=proposed_action,
            reason=decision.reason,
        )
        if not approved:
            return "Action cancelled."

    result = execute_tool_as_principal(
        principal=user_identity,
        tool=decision.tool_name,
        args=decision.filtered_args,
        scope=decision.scope,
    )

    write_audit_log(
        actor=user_identity,
        action=decision.tool_name,
        args=decision.filtered_args,
        provenance=plan_context,
        result_summary=summarize(result),
    )

    return result

The difference between these two designs is not stylistic. It is the difference between using an LLM as a helpful component inside a controlled system and using it as the system itself.

This is still only pseudo-code, but it reflects a radically different philosophy. The model proposes; it does not authorize. Retrieved content is not silently merged with policy. Tool access is scoped. Identity is explicit. Arguments can be filtered before execution. High-risk actions can be routed through human approval. The system is designed around the assumption that the model may be manipulated, confused, or simply wrong.

That is the mindset agentic systems require.

Tool use and the materialization of errors

The introduction of tool use fundamentally changes the impact of model errors.

In a purely conversational system, a hallucination is often limited to incorrect text. In an agentic system, the same misinterpretation can result in a concrete action: an email sent to the wrong recipient, a file deleted, a database query executed, or sensitive data exported.

The design of tools therefore becomes central to system safety. Tools should be narrowly scoped, with well-defined schemas and constrained capabilities. They should enforce least privilege, require explicit confirmation for sensitive operations, and ideally operate in sandboxed environments. Importantly, the system should treat model outputs as proposals rather than authoritative commands.

A secure architecture places policy enforcement outside the model. The model may suggest an action, but a separate control layer should determine whether that action is permitted given the current context, identity, and risk profile.

The reappearance of the confused deputy

The confused deputy problem provides a useful lens for understanding many of these risks. In that scenario, a system with legitimate authority is tricked into misusing that authority on behalf of an unauthorized party.

Agents are particularly susceptible to this pattern because they aggregate multiple sources of input and operate across multiple systems. They may receive instructions from users, colleagues, documents, and external content, and must continuously decide which signals are authoritative.

If an agent misattributes authority - for example, by treating a statement in a retrieved document as a valid instruction - it may perform actions that appear legitimate from a technical perspective but are semantically unauthorized.

The challenge is that this is not a traditional exploit. It is a failure of interpretation under conditions of ambiguity and adversarial input.

Identity, privilege, and execution context

Another area where agentic systems introduce complexity is identity management.

Actions may be executed under different identities: the user’s account, a service account, an API key, or an authenticated browser session. If these identities are not clearly separated and bound to specific scopes, the system may inadvertently escalate privileges.

For example, an agent might fulfill a request using a backend API with broader access than the user’s own permissions, simply because that path is available. From the system’s perspective, this is efficient. From a security perspective, it violates the principle of least privilege.

Each tool invocation should therefore be explicitly associated with an identity, a scope, and a justification. These bindings should be visible, auditable, and enforceable.

Memory as a persistent risk surface

Memory is often presented as a feature that enhances usability by allowing systems to retain context across interactions. From a security standpoint, however, memory is also a form of persistent state that can accumulate sensitive information, stale assumptions, or adversarial inputs.

Different types of memory - short-term conversational context, task-level state, and long-term user profiles - have different risk profiles, but all require lifecycle management. Systems should define what can be stored, how it is validated, how long it is retained, and how it can be inspected or deleted.

Without such controls, memory can become both a source of leakage and a vector for long-lived manipulation.

Natural language as a control interface

One of the more subtle challenges in agent design is the reliance on natural language as a control interface.

Natural language is inherently ambiguous, context-dependent, and open to interpretation. While this flexibility is what makes it attractive for user interaction, it is also what makes it difficult to use safely for high-authority operations.

In traditional systems, commands are expressed in structured formats with well-defined semantics. In agentic systems, similar levels of authority may be triggered by loosely phrased instructions whose exact meaning depends on context.

This places a significant burden on the system to correctly interpret intent and distinguish between instructions, suggestions, and irrelevant information. It also creates opportunities for adversarial inputs to exploit ambiguity.

The need for external policy enforcement

Given these challenges, it is not sufficient to rely on the model itself to enforce all constraints.

A robust agent architecture should include external policy mechanisms that evaluate proposed actions before execution. These mechanisms can enforce rules related to access control, data sensitivity, action scope, and risk thresholds.

This separation ensures that even if the model is misled or makes an incorrect inference, the system as a whole can prevent unsafe actions from being carried out.

A shift in the core question

The technologies we built around Tor and related systems addressed a fundamental question: how can users interact with digital systems without exposing themselves unnecessarily to observation and control?

That question remains important. But agentic systems introduce a second, equally important question: how can users retain control over systems that act on their behalf?

This is not merely an extension of the original problem. It is a shift in focus from visibility to authority, from communication to execution, and from protecting data to governing action.

If we fail to recognize this shift, we risk applying the wrong solutions to the wrong layer of the system.

We used to worry about who could see us.

We now need to worry about what can act for us, how it makes decisions, and how easily those decisions can be influenced.

That is a more complex problem, and one that will require more than incremental adjustments to existing security models.

It will require treating agentic systems not as enhanced interfaces, but as intermediaries with real power - systems whose design must be constrained, audited, and governed accordingly.

The Owls Are Not What They Seem

2026-04-06T00:00:00+00:00

First post. The blog is called Fire Walk With Middleware. That should tell you enough about the tone.

Writing about software, LLMs, and whatever I’m currently breaking or building. No schedule, no niche.

More soon.