Skip to main content

AI Safety — Policy Circuits vs. Alignment

Modern AI safety research focuses heavily on alignment: training models to want to do the right thing. BRIK-64 takes a complementary approach — ensuring that even a misaligned model cannot do the wrong thing.
“An AI doesn’t need a better language. It needs a language where incorrect programs cannot compile.”

The Two Approaches

Alignment (Probabilistic)

RLHF, Constitutional AI, RLAIF — teach the model to prefer safe outputs through training. Effective for the vast majority of interactions. Bypassed by adversarial inputs, distribution shift, or capability jumps.

Policy Circuits (Deterministic)

PCD programs compiled and certified with Φ_c = 1 — enforce boundaries at the hardware level. Cannot be bypassed by the model. Not probabilistic. Complement to alignment, not a replacement.
These approaches are not in competition. RLHF teaches the AI to want to do right. The BPU prevents it from doing wrong even if it wants to. Use both.

Why Probabilistic Is Not Enough

Alignment methods are probabilistic by nature. This is not a criticism — it is a fundamental property of gradient-based optimization.
Reinforcement Learning from Human Feedback trains the model to maximize a reward signal derived from human preferences. This works well in distribution but fails in several known scenarios:
  • Reward hacking: the model finds behaviors that maximize reward without satisfying the intended constraint
  • Distribution shift: the model encounters inputs outside the training distribution where the alignment signal was not provided
  • Capability overhang: as models become more capable, they become better at finding adversarial routes around constraints
  • Jailbreaks: carefully crafted prompts that route around the fine-tuning signal
None of these are bugs in RLHF. They are consequences of the probabilistic nature of the approach.
Constitutional AI (CAI) asks the model to critique and revise its own outputs according to a set of principles. This is powerful for reducing obvious harmful outputs.The limitation: self-critique runs in the same inference context as the original output. A model that has found a way to produce harmful output has also, by definition, found a way to produce a self-critique that approves of that output.The constitution is enforced by the model. The model being constrained is also the enforcer of the constraint.
Post-hoc filters classify outputs as safe or unsafe and block unsafe ones. These are more reliable than self-critique because they are separate models.However, they still run in software, can be adversarially evaded, and operate on the output rather than the action. By the time a filter sees a harmful output, the action it describes may already be in flight.

The Policy Circuit Model

A Policy Circuit is a PCD program that:
  1. Takes an action descriptor as input (what the AI wants to do)
  2. Evaluates the action against a set of certified constraints
  3. Returns exactly ALLOW or BLOCK — nothing else
The key property is Φ_c = 1: Thermodynamic Coherence. This means every possible input maps to exactly one of the two terminal states. There is no undefined behavior, no exception path, no edge case that produces a third result.
policy check_network_request(url: string, method: string) -> decision {
  let blocked_tld = MC_40.FIND(url, ".onion");
  let blocked_method = MC_40.FIND(method, "DELETE");

  if (blocked_tld != -1) {
    return BLOCK;
  }
  if (blocked_method != -1) {
    return BLOCK;
  }
  return ALLOW;
}
Policy circuits are formally verified by the TCE before deployment. The verification is not a test suite — it is a mathematical proof that Φ_c = 1 holds for the circuit. If the proof fails, the circuit does not compile.

Comparison Table

PropertyRLHFConstitutional AIOutput FiltersPolicy Circuits
DeterministicNoNoNoYes
Formally verifiedNoNoNoYes
Bypassable by adversarial inputYesYesSometimesNo
Operates on actions (not outputs)NoNoNoYes
Hardware enforceableNoNoNoYes
Works at inference timeYesYesYesYes
Requires model cooperationYesYesNoNo

The Enforcement Gap

Alignment methods have an enforcement gap: the model must cooperate with its own constraints at inference time. This is the fundamental limitation. Policy circuits close the enforcement gap by operating on actions, not on the model’s internal state:
1

Action generation

The AI model generates an action — a file write, a network request, a process spawn, a tool call. This generation is probabilistic and subject to all the limitations of alignment.
2

Action descriptor serialization

The action is serialized into a structured descriptor before it reaches the I/O subsystem. This is the interception point.
3

Policy circuit evaluation

The BPU evaluates the action descriptor against all loaded policy circuits. This evaluation is deterministic, formally certified, and runs in hardware.
4

ALLOW or BLOCK

If all circuits return ALLOW, the action proceeds. If any circuit returns BLOCK, the PCIe transaction is terminated. The model never learns the outcome — it does not receive an exception or error signal.

Practical Policy Categories

Hard limits on what categories of action the AI can take regardless of instruction. Examples: cannot access paths outside a defined workspace, cannot make outbound network connections to non-allowlisted domains, cannot spawn child processes above a resource threshold.
Policies that enforce operational limits: maximum file size per write, maximum number of API calls per minute, maximum memory allocation per request. These prevent accidental resource exhaustion as much as intentional misuse.
Policies that enforce logging requirements: every action that matches certain criteria must produce a structured audit record before it is allowed to proceed. The audit record is written by the policy circuit, not by the model.
Policies that inspect data being written or transmitted and enforce classification rules: no PII outside encrypted storage, no secrets in log output, no internal data in external requests.

The Φ_c = 1 Property

Thermodynamic Coherence (Φ_c = 1) is the formal foundation of Policy Circuits. The analogy to thermodynamics is intentional:
  • In thermodynamics, a closed system conserves energy — nothing leaks out
  • In Digital Circuitality, a closed circuit conserves decision — every input maps to exactly one output with no ambiguity
A circuit with Φ_c = 1 has the following properties:
PropertyMeaning
TotalDefined for all inputs in the domain
DeterministicSame input always produces same output
TerminatingEvaluation always completes in bounded time
Binary terminalOutput is always ALLOW or BLOCK
The TCE verifies all four properties using the Coq proof assistant before a policy circuit is cleared for deployment.
A policy circuit that does not satisfy Φ_c = 1 will not compile. The compiler reports which property failed and at which node in the circuit. This is not a runtime error — it is a compile-time proof failure.

Regulatory Context

The AI safety landscape is moving toward mandatory enforcement mechanisms. Policy circuits are positioned to be the technical foundation of that mandate.
HorizonLikely RequirementPolicy Circuit Role
1–2 yearsVoluntary disclosure of safety measuresSoftware Phase 1 provides auditable artifacts
3–5 yearsCertification of AI systems above capability thresholdsTCE certification as recognized safety standard
5–10 yearsMandatory hardware enforcement for critical applicationsBPU as the reference implementation
The analogy to automotive safety is instructive: ABS was optional in 1978, a standard feature by the late 1990s, and mandatory in the EU from 2004. The BPU is being built now, in the voluntary phase, so that the certified implementation exists when the mandate arrives.

Integration with Alignment

Policy circuits do not replace alignment research. The recommended deployment is a defense-in-depth stack:
  1. Model training — RLHF and CAI to produce a model that wants to operate within its intended boundaries most of the time
  2. Inference-time guidance — System prompts, tool descriptions, and context that reinforce intended behavior at runtime
  3. Output classification — Soft filters that flag suspicious outputs for human review
  4. Policy circuits — Hard enforcement at the action layer for the constraints that must never be violated regardless of the model’s internal state
Layer 4 is what BRIK-64 provides. Layers 1–3 are the existing alignment research ecosystem, and they remain valuable.

Further Reading