Structure vs Constitution in AI Safety |

Anthropic publishes its constitution along with research about where the constitution works and where it does not. The current version (January 2026) is an ethical treatise written to Claude. The priorities are safety, ethics, Anthropic’s guidelines, and helpfulness, in that order when they conflict. Anthropic favours cultivating good values and judgment over strict rules, comparing their approach to trusting an experienced professional rather than enforcing a checklist.

In March 2026, Anthropic’s operational judgment failed catastrophically: a one-line .npmignore error led to the leak of 512,000 lines of Claude Code source code.The constitution asks Claude to imagine how a “thoughtful senior Anthropic employee would react”, but what happens when the organisation’s structure fails?

I develop the Perseverance Composition Engine, an open source multi-agent AI system that takes a different approach (called Artificial Organisations) to the same problem. PCE assumes the agents cannot be relied on to be honest/harmless/helpful/etc and structures the system so that the inevitable bad behaviour doesn’t surface.

The PCE pipeline works by sending a document through four independent agents: a Composer drafts from source materials, a Corroborator fact-checks the draft against those sources, a Critic evaluates the result without seeing the sources, and a Curator files the output. Each agent has a single objective, minimal permissions, and access to only the information it needs. The Critic can’t see the sources, so it can’t rationalise away a weak claim by pointing to them. The Composer can’t see the evaluation rubrics, so it can’t game them.

The contrast with constitutional AI comes down to where you locate the safety mechanism.

Constitutional AI locates it in the agent. Train the agent well enough, give it clear principles, and it should behave. The problem is that agents under pressure — optimising for token velocity, operating in unfamiliar domains, balancing conflicting objectives — still confabulate, still produce plausible nonsense, still find locally convenient solutions that technically satisfy the rules while violating their spirit.

PCE locates safety in the structure around the agents. The Corroborator has sources in front of it and just one job: find discrepancies. If the Composer invented a claim, the Corroborator will see the absence in the sources. The Critic evaluates the output against rubrics without knowing what the sources said, so it can’t excuse a vague passage by noting the sources were thin. Three independent agents would all have to make the same mistake in the same direction for a fabrication to ship.

The constitutional approach asks agents to balance honesty, harmlessness, and helpfulness simultaneously — a three-objective optimisation problem with no clear priority order. The objectives frequently conflict so the agent must find a trade-off in real time. In practice, this produces outputs that satisfy all three criteria superficially: plausible, inoffensive, and vaguely on-topic. PCE resolves the conflict structurally. The Composer worries about coherence, the Corroborator worries about truth, the Critic worries about quality. Each agent is single-minded, and conflict resolution is done by the pipeline.

In consequence PCE inherits every improvement to the underlying models, and better alignment is always nice. But PCE doesn’t require well-aligned agents. I regularly put a weaker or less aligned model in a PCE role and the structure still prevents fabrication from reaching the output. The Composer doesn’t need to be trustworthy; it needs to produce coherent text from sources. The structure does the safety work.

This was the second time in 13 months the same vulnerability was exploited. A structural approach would have implemented checks that the Anthropic constitution assumes will be present.

A constitutional agent deployed inside a structural pipeline gets the benefit of both. Good training reduces the load on the verification stages — fewer errors to catch means faster throughput and lower cost. And structural constraints catch the cases where training fails, which it sometimes does regardless of how good the training is.

The leak also revealed capabilities that don’t seem to align with constitution’s values. The Undercover Mode system was designed to actively conceal AI authorship from open-source contributions, with no force-OFF option. The constitution explicitly values transparency and honesty, yet here was a feature designed for concealment built into the structure of the product, something no amount of constitutional training can override. In the other direction, critical security issues are dealt with at the level of suggestive prompts:

export const CYBER_RISK_INSTRUCTION = `IMPORTANT: Assist with authorized
security testing, defensive security, CTF challenges, and educational contexts.
Refuse requests for destructive techniques, DoS attacks, mass targeting,
supply chain compromise, or detection evasion for malicious purposes. Dual-use
security tools (C2 frameworks, credential testing, exploit development)
require clear authorization context: pentesting engagements, CTF competitions,
security research, or defensive use cases.`

This prompt is like any other and will be subject to attention drift, and cannot be relied on.

The constitution places Anthropic at the top of a “principal hierarchy” that governs how Claude resolves conflicts between competing principles. But hierarchy without accountability is just authority. The leak reveals no structural mechanism for verifying that Anthropic itself follows the values it encodes, only the assumption that it will, which has the same enforcement value as marketing statements.

I find it refreshingly helpful to view the safety problem as a problem of institutions. We have had millennia to refine our knowledge that reliable collective behaviour comes from structure not from hoping that individuals will be virtuous. Examples of structure are separation of powers, independent audit and role specialisation, and the technical name for this is an information partition and is well understood: Weber wrote about role specialisation and separation of duties in bureaucracies, Parnas about information hiding in software systems, while March and Simon gave us bounded rationality where each role has only the information relevant to its function.

PCE applies these ideas to LLM agents, and it works rather well.