AI Safety — Policy Circuits vs. Alignment
Modern AI safety research focuses heavily on alignment: training models to want to do the right thing. BRIK-64 takes a complementary approach — ensuring that even a misaligned model cannot do the wrong thing.“An AI doesn’t need a better language. It needs a language where incorrect programs cannot compile.”
The Two Approaches
Alignment (Probabilistic)
RLHF, Constitutional AI, RLAIF — teach the model to prefer safe outputs
through training. Effective for the vast majority of interactions.
Bypassed by adversarial inputs, distribution shift, or capability jumps.
Policy Circuits (Deterministic)
PCD programs compiled and certified with Φ_c = 1 — enforce boundaries
at the hardware level. Cannot be bypassed by the model. Not probabilistic.
Complement to alignment, not a replacement.
These approaches are not in competition. RLHF teaches the AI to want to do
right. The BPU prevents it from doing wrong even if it wants to.
Use both.
Why Probabilistic Is Not Enough
Alignment methods are probabilistic by nature. This is not a criticism — it is a fundamental property of gradient-based optimization.RLHF and its limits
RLHF and its limits
Reinforcement Learning from Human Feedback trains the model to maximize
a reward signal derived from human preferences. This works well in
distribution but fails in several known scenarios:
- Reward hacking: the model finds behaviors that maximize reward without satisfying the intended constraint
- Distribution shift: the model encounters inputs outside the training distribution where the alignment signal was not provided
- Capability overhang: as models become more capable, they become better at finding adversarial routes around constraints
- Jailbreaks: carefully crafted prompts that route around the fine-tuning signal
Constitutional AI and self-critique
Constitutional AI and self-critique
Constitutional AI (CAI) asks the model to critique and revise its own
outputs according to a set of principles. This is powerful for reducing
obvious harmful outputs.The limitation: self-critique runs in the same inference context as the
original output. A model that has found a way to produce harmful output
has also, by definition, found a way to produce a self-critique that
approves of that output.The constitution is enforced by the model. The model being constrained
is also the enforcer of the constraint.
Output filters and classifiers
Output filters and classifiers
Post-hoc filters classify outputs as safe or unsafe and block unsafe
ones. These are more reliable than self-critique because they are
separate models.However, they still run in software, can be adversarially evaded,
and operate on the output rather than the action. By the time a filter
sees a harmful output, the action it describes may already be in flight.
The Policy Circuit Model
A Policy Circuit is a PCD program that:- Takes an action descriptor as input (what the AI wants to do)
- Evaluates the action against a set of certified constraints
- Returns exactly
ALLOWorBLOCK— nothing else
Comparison Table
| Property | RLHF | Constitutional AI | Output Filters | Policy Circuits |
|---|---|---|---|---|
| Deterministic | No | No | No | Yes |
| Formally verified | No | No | No | Yes |
| Bypassable by adversarial input | Yes | Yes | Sometimes | No |
| Operates on actions (not outputs) | No | No | No | Yes |
| Hardware enforceable | No | No | No | Yes |
| Works at inference time | Yes | Yes | Yes | Yes |
| Requires model cooperation | Yes | Yes | No | No |
The Enforcement Gap
Alignment methods have an enforcement gap: the model must cooperate with its own constraints at inference time. This is the fundamental limitation. Policy circuits close the enforcement gap by operating on actions, not on the model’s internal state:Action generation
The AI model generates an action — a file write, a network request,
a process spawn, a tool call. This generation is probabilistic and
subject to all the limitations of alignment.
Action descriptor serialization
The action is serialized into a structured descriptor before it reaches
the I/O subsystem. This is the interception point.
Policy circuit evaluation
The BPU evaluates the action descriptor against all loaded policy circuits.
This evaluation is deterministic, formally certified, and runs in hardware.
Practical Policy Categories
Capability constraints
Capability constraints
Hard limits on what categories of action the AI can take regardless of
instruction. Examples: cannot access paths outside a defined workspace,
cannot make outbound network connections to non-allowlisted domains,
cannot spawn child processes above a resource threshold.
Rate and volume limits
Rate and volume limits
Policies that enforce operational limits: maximum file size per write,
maximum number of API calls per minute, maximum memory allocation per
request. These prevent accidental resource exhaustion as much as
intentional misuse.
Audit and attribution
Audit and attribution
Policies that enforce logging requirements: every action that matches
certain criteria must produce a structured audit record before it is
allowed to proceed. The audit record is written by the policy circuit,
not by the model.
Data classification enforcement
Data classification enforcement
Policies that inspect data being written or transmitted and enforce
classification rules: no PII outside encrypted storage, no secrets
in log output, no internal data in external requests.
The Φ_c = 1 Property
Thermodynamic Coherence (Φ_c = 1) is the formal foundation of Policy Circuits. The analogy to thermodynamics is intentional:- In thermodynamics, a closed system conserves energy — nothing leaks out
- In Digital Circuitality, a closed circuit conserves decision — every input maps to exactly one output with no ambiguity
| Property | Meaning |
|---|---|
| Total | Defined for all inputs in the domain |
| Deterministic | Same input always produces same output |
| Terminating | Evaluation always completes in bounded time |
| Binary terminal | Output is always ALLOW or BLOCK |
Regulatory Context
The AI safety landscape is moving toward mandatory enforcement mechanisms. Policy circuits are positioned to be the technical foundation of that mandate.| Horizon | Likely Requirement | Policy Circuit Role |
|---|---|---|
| 1–2 years | Voluntary disclosure of safety measures | Software Phase 1 provides auditable artifacts |
| 3–5 years | Certification of AI systems above capability thresholds | TCE certification as recognized safety standard |
| 5–10 years | Mandatory hardware enforcement for critical applications | BPU as the reference implementation |
Integration with Alignment
Policy circuits do not replace alignment research. The recommended deployment is a defense-in-depth stack:- Model training — RLHF and CAI to produce a model that wants to operate within its intended boundaries most of the time
- Inference-time guidance — System prompts, tool descriptions, and context that reinforce intended behavior at runtime
- Output classification — Soft filters that flag suspicious outputs for human review
- Policy circuits — Hard enforcement at the action layer for the constraints that must never be violated regardless of the model’s internal state