ALICE
ALICE Legitimizes Instructions in Computational Environments
A principled, layered framework for constructing safety-reviewing agents that audit other agents' operations before execution.
What Is This?
ALICE is a meta-skill — a template for generating safety-reviewer agents. Rather than enumerating dangerous operations (an inherently incomplete task), ALICE defines what harm means at the ontological level, then provides a layered architecture for translating that definition into a working reviewer agent tailored to specific deployment contexts.
The generated reviewer (Alice) inspects every operation requested by a task-executing agent (Bob) and produces one of three responses: Approve, Reject, or Escalate — each with a precisely defined semantic grounded in the framework's axiomatic layer.
Motivation
The current landscape of LLM agent safety is dominated by two paradigms:
-
Enumeration-based guardrails: Systems like NeMo Guardrails and Guardrails AI define safety through lists of prohibited patterns, content filters, and keyword matching. These are effective against known threats but structurally unable to generalize to novel failure modes.
-
Constitution-based frameworks: TrustAgent (Hua et al., EMNLP 2024) embeds a fixed set of safety principles and applies them across planning stages. This improves generalization but couples the safety definition to a single enforcement strategy.
-
Information-flow control: Fides (Costa & Köpf, 2025) and related work track data provenance through confidentiality and integrity labels, providing deterministic guarantees against prompt injection. These offer formal rigor but focus narrowly on data-flow properties.
ALICE takes a different approach: separate the definition of harm from the strategy for detecting and responding to it. The framework's four-layer architecture ensures that the ontological foundation (what constitutes damage to an environment) remains stable and reusable, while the detection strategy, capability profile, and task-specific parameters can vary independently.
Architecture
┌─────────────────────────────────────────────┐
│ Layer 1: Foundations (immutable) │
│ - Participants: Alice, Bob, Human, Env │
│ - Harm ontology: 4 dimensions │
│ - Response vocabulary: Approve/Reject/Esc. │
├─────────────────────────────────────────────┤
│ Layer 2: Security Requirements │
│ - User's stance on 4 trade-off axes │
│ - Declarative, not procedural │
├─────────────────────────────────────────────┤
│ Layer 3: Strategy Framework │
│ - Decision surfaces driven by Layer 2 │
│ - Capability boundaries, judgment flow │
│ - Degradation behavior │
├─────────────────────────────────────────────┤
│ Layer 4: Task Context (runtime) │
│ - Task declaration, sensitivity defs │
│ - Boundary adjustments with constraints │
└─────────────────────────────────────────────┘
Layer 1 is axiomatic — it defines harm through four orthogonal dimensions (Irreversibility, Blast Radius, Information Flow, Authorization Scope) and establishes the semantic contract of each response type. Every generated Alice instance embeds this layer verbatim.
Layer 2 captures user requirements as positions on four fundamental trade-off axes rather than configuration parameters: Safety vs. Throughput, Autonomy vs. Human Control, Transparency vs. Simplicity, Isolation vs. Collaboration.
Layer 3 specifies which decision surfaces a concrete Alice implementation must cover, and how each surface is driven by Layer 2's trade-off positions.
Layer 4 binds runtime context: task type, sensitivity definitions, and boundary adjustments — constrained to never contradict Layer 1.
Design Insights
This section documents key design decisions and the reasoning behind them.
Harm as Ontology, Not Policy
Most safety frameworks define harm procedurally: "if the operation matches pattern X, block it." ALICE defines harm as a property of the operation's effect on the environment, described along four continuous dimensions. This distinction matters because it makes the framework environment-agnostic — the same harm definition applies whether Bob is executing shell commands, API calls, or database queries.
The four dimensions (Irreversibility, Blast Radius, Information Flow, Authorization Scope) were selected for orthogonality: each captures a distinct aspect of damage that can vary independently. A single operation might score high on only one dimension (e.g., leaking a credential is purely an Information Flow concern) or multiple.
The Three-Response Model
The three responses are not a severity gradient. They serve fundamentally different purposes: