AI Safety — Policy Circuits vs. Alignment

Modern AI safety research focuses heavily on alignment: training models to want to do the right thing. BRIK-64 takes a complementary approach — ensuring that even a misaligned model cannot do the wrong thing.

“An AI doesn’t need a better language. It needs a language where incorrect programs cannot compile.”

The Two Approaches

Alignment (Probabilistic)

RLHF, Constitutional AI, RLAIF — teach the model to prefer safe outputs through training. Effective for the vast majority of interactions. Bypassed by adversarial inputs, distribution shift, or capability jumps.

Policy Circuits (Deterministic)

PCD programs compiled and certified with Φ_c = 1 — enforce boundaries at the hardware level. Cannot be bypassed by the model. Not probabilistic. Complement to alignment, not a replacement.

These approaches are not in competition. RLHF teaches the AI to want to do right. The BPU prevents it from doing wrong even if it wants to. Use both.

Why Probabilistic Is Not Enough

Alignment methods are probabilistic by nature. This is not a criticism — it is a fundamental property of gradient-based optimization.

RLHF and its limits

Reinforcement Learning from Human Feedback trains the model to maximize a reward signal derived from human preferences. This works well in distribution but fails in several known scenarios:

Reward hacking: the model finds behaviors that maximize reward without satisfying the intended constraint
Distribution shift: the model encounters inputs outside the training distribution where the alignment signal was not provided
Capability overhang: as models become more capable, they become better at finding adversarial routes around constraints
Jailbreaks: carefully crafted prompts that route around the fine-tuning signal

None of these are bugs in RLHF. They are consequences of the probabilistic nature of the approach.

Constitutional AI and self-critique

Constitutional AI (CAI) asks the model to critique and revise its own outputs according to a set of principles. This is powerful for reducing obvious harmful outputs.The limitation: self-critique runs in the same inference context as the original output. A model that has found a way to produce harmful output has also, by definition, found a way to produce a self-critique that approves of that output.The constitution is enforced by the model. The model being constrained is also the enforcer of the constraint.

Output filters and classifiers

Post-hoc filters classify outputs as safe or unsafe and block unsafe ones. These are more reliable than self-critique because they are separate models.However, they still run in software, can be adversarially evaded, and operate on the output rather than the action. By the time a filter sees a harmful output, the action it describes may already be in flight.

The Policy Circuit Model

A Policy Circuit is a PCD program that:

Takes an action descriptor as input (what the AI wants to do)
Evaluates the action against a set of certified constraints
Returns exactly ALLOW or BLOCK — nothing else

The key property is Φ_c = 1: Thermodynamic Coherence. This means every possible input maps to exactly one of the two terminal states. There is no undefined behavior, no exception path, no edge case that produces a third result.

policy check_network_request(url: string, method: string) -> decision {
  let blocked_tld = MC_40.FIND(url, ".onion");
  let blocked_method = MC_40.FIND(method, "DELETE");

  if (blocked_tld != -1) {
    return BLOCK;
  }
  if (blocked_method != -1) {
    return BLOCK;
  }
  return ALLOW;
}

Policy circuits are formally verified by the TCE before deployment. The verification is not a test suite — it is a mathematical proof that Φ_c = 1 holds for the circuit. If the proof fails, the circuit does not compile.

Comparison Table

Property	RLHF	Constitutional AI	Output Filters	Policy Circuits
Deterministic	No	No	No	Yes
Formally verified	No	No	No	Yes
Bypassable by adversarial input	Yes	Yes	Sometimes	No
Operates on actions (not outputs)	No	No	No	Yes
Hardware enforceable	No	No	No	Yes
Works at inference time	Yes	Yes	Yes	Yes
Requires model cooperation	Yes	Yes	No	No

The Enforcement Gap

Alignment methods have an enforcement gap: the model must cooperate with its own constraints at inference time. This is the fundamental limitation. Policy circuits close the enforcement gap by operating on actions, not on the model’s internal state:

Action generation

The AI model generates an action — a file write, a network request, a process spawn, a tool call. This generation is probabilistic and subject to all the limitations of alignment.

Action descriptor serialization

The action is serialized into a structured descriptor before it reaches the I/O subsystem. This is the interception point.

Policy circuit evaluation

The BPU evaluates the action descriptor against all loaded policy circuits. This evaluation is deterministic, formally certified, and runs in hardware.

ALLOW or BLOCK

If all circuits return ALLOW, the action proceeds. If any circuit returns BLOCK, the PCIe transaction is terminated. The model never learns the outcome — it does not receive an exception or error signal.

Practical Policy Categories

Capability constraints

Hard limits on what categories of action the AI can take regardless of instruction. Examples: cannot access paths outside a defined workspace, cannot make outbound network connections to non-allowlisted domains, cannot spawn child processes above a resource threshold.

Rate and volume limits

Policies that enforce operational limits: maximum file size per write, maximum number of API calls per minute, maximum memory allocation per request. These prevent accidental resource exhaustion as much as intentional misuse.

Audit and attribution

Policies that enforce logging requirements: every action that matches certain criteria must produce a structured audit record before it is allowed to proceed. The audit record is written by the policy circuit, not by the model.

Data classification enforcement

Policies that inspect data being written or transmitted and enforce classification rules: no PII outside encrypted storage, no secrets in log output, no internal data in external requests.

The Φ_c = 1 Property

Thermodynamic Coherence (Φ_c = 1) is the formal foundation of Policy Circuits. The analogy to thermodynamics is intentional:

In thermodynamics, a closed system conserves energy — nothing leaks out
In Digital Circuitality, a closed circuit conserves decision — every input maps to exactly one output with no ambiguity

A circuit with Φ_c = 1 has the following properties:

Property	Meaning
Total	Defined for all inputs in the domain
Deterministic	Same input always produces same output
Terminating	Evaluation always completes in bounded time
Binary terminal	Output is always ALLOW or BLOCK

The TCE verifies all four properties using the Coq proof assistant before a policy circuit is cleared for deployment.

A policy circuit that does not satisfy Φ_c = 1 will not compile. The compiler reports which property failed and at which node in the circuit. This is not a runtime error — it is a compile-time proof failure.

Regulatory Context

The AI safety landscape is moving toward mandatory enforcement mechanisms. Policy circuits are positioned to be the technical foundation of that mandate.

Horizon	Likely Requirement	Policy Circuit Role
1–2 years	Voluntary disclosure of safety measures	Software Phase 1 provides auditable artifacts
3–5 years	Certification of AI systems above capability thresholds	TCE certification as recognized safety standard
5–10 years	Mandatory hardware enforcement for critical applications	BPU as the reference implementation

The analogy to automotive safety is instructive: ABS was optional in 1978, a standard feature by the late 1990s, and mandatory in the EU from 2004. The BPU is being built now, in the voluntary phase, so that the certified implementation exists when the mandate arrives.

Integration with Alignment

Policy circuits do not replace alignment research. The recommended deployment is a defense-in-depth stack:

Model training — RLHF and CAI to produce a model that wants to operate within its intended boundaries most of the time
Inference-time guidance — System prompts, tool descriptions, and context that reinforce intended behavior at runtime
Output classification — Soft filters that flag suspicious outputs for human review
Policy circuits — Hard enforcement at the action layer for the constraints that must never be violated regardless of the model’s internal state

Layer 4 is what BRIK-64 provides. Layers 1–3 are the existing alignment research ecosystem, and they remain valuable.

BPU Architecture

The hardware implementation of policy circuit enforcement.

TCE — Coherence Engine

How Φ_c = 1 is formally verified before deployment.

PCD Examples

Writing and compiling policy circuits in PCD.

EVA Algebra

The formal composition laws underlying circuit construction.

Getting Started

PCD Language

CLI Reference

Lifter

Transpiler

API Reference

MCP Integration

SDKs

Theory

AI Safety — Policy Circuits vs. Alignment

AI Safety — Policy Circuits vs. Alignment

The Two Approaches

Alignment (Probabilistic)

Policy Circuits (Deterministic)

Why Probabilistic Is Not Enough

The Policy Circuit Model

Comparison Table

The Enforcement Gap

Practical Policy Categories

The Φ_c = 1 Property

Regulatory Context

Integration with Alignment

Further Reading

BPU Architecture

TCE — Coherence Engine

PCD Examples

EVA Algebra

Getting Started

PCD Language

CLI Reference

Lifter

Transpiler

API Reference

MCP Integration

SDKs

Theory

​AI Safety — Policy Circuits vs. Alignment

​The Two Approaches

Alignment (Probabilistic)

Policy Circuits (Deterministic)

​Why Probabilistic Is Not Enough

​The Policy Circuit Model

​Comparison Table

​The Enforcement Gap

​Practical Policy Categories

​The Φ_c = 1 Property

​Regulatory Context

​Integration with Alignment

​Further Reading

BPU Architecture

TCE — Coherence Engine

PCD Examples

EVA Algebra

AI Safety — Policy Circuits vs. Alignment

The Two Approaches

Why Probabilistic Is Not Enough

The Policy Circuit Model

Comparison Table

The Enforcement Gap

Practical Policy Categories

The Φ_c = 1 Property

Regulatory Context

Integration with Alignment

Further Reading