Alignment and Integrity Infrastructure for Autonomous Agents

Transparency, Real-Time Oversight, and Cryptographic Proof for Agent Behavior

Version: 3.0 Date: February 2026 Authors: Mnemom Research License: CC BY 4.0

Abstract

We present three complementary layers of infrastructure that together make autonomous agent behavior observable, auditable, and cryptographically verifiable: the Agent Alignment Protocol (AAP), the Agent Integrity Protocol (AIP), and a cryptographic attestation architecture that provides independently verifiable proof of oversight integrity. As AI agents transition from assistants to autonomous actors — negotiating, transacting, and coordinating on behalf of humans — the question of alignment becomes infrastructure-critical. Current agent coordination protocols (A2A, MCP, AP2) address capability discovery, tool integration, and payment authorization, but none address the fundamental question: Is this agent serving its principal’s interests? AAP provides transparency: Alignment Cards declare an agent’s alignment posture, AP-Traces create auditable records of decisions, and Value Coherence Handshakes verify compatibility before multi-agent coordination. AIP provides integrity: thinking blocks are analyzed between turns, before actions execute, producing Integrity Checkpoints with structured verdicts that enable side-channel intervention. The cryptographic attestation layer provides proof: Ed25519 signatures, SHA-256 hash chains, Merkle accumulator trees, and zero-knowledge proofs via SP1 zkVM that make integrity verdicts independently verifiable without requiring trust in the attestation service. The three layers share a single Alignment Card, a single session identity, and a single drift detection framework. AAP answers what did this agent do? AIP answers what is this agent thinking? The proof layer answers can we verify the oversight itself? Together they close the loop from observation through analysis to cryptographic evidence — defense in depth on a shared behavioral contract. The critical contribution of the proof layer is the distinction between proving model inference and proving auditor judgment. Rather than attempting to prove that an LLM was called correctly (prohibitively expensive at current technology levels), we prove that the deterministic verdict derivation rules were applied honestly to the LLM’s analysis output. This makes the system practical: ~$0.001 per proof, sub-second proving time (GPU-accelerated), under 100ms verification — while providing a meaningful cryptographic guarantee that the oversight process was not fabricated, modified, or selectively applied. AAP, AIP, and the proof layer are transparency and verification infrastructure, not trust infrastructure. They make agent behavior observable, they make oversight integrity verifiable, and they are honest about what they cannot prove. This distinction is fundamental.

1. Introduction

1.1 The Problem

For decades, competitive advantage in digital commerce derived from information asymmetry. Platforms accumulated knowledge about users — search history, purchase patterns, social connections — and leveraged this asymmetry to extract value. This paradigm is collapsing. When AI agents act on behalf of humans, they bring equal reasoning power, equal tool access, and equal information retrieval capabilities. An AI agent shopping for a consumer has the same access to product databases, reviews, and pricing data as the merchant’s AI. The asymmetry that powered trillion-dollar advertising businesses — Google (

175B), Meta (

116B), Amazon ($56B) — evaporates when agents negotiate with agents. When capability is symmetric, alignment becomes the differentiator. The market recognizes this gap. McKinsey’s 2025 survey identifies “lack of governance and risk-management tools” as the #1 barrier to AI adoption. PwC reports 28% of executives cite “lack of trust in AI agents” as their top challenge. Bain finds 50% of consumers remain cautious of fully autonomous purchasing.

1.2 The Gap in the Current Stack

The industry has built protocols for agent coordination:

Protocol	Function	Limitation
MCP (Model Context Protocol)	Agent-to-tool connectivity	No alignment semantics
A2A (Agent-to-Agent Protocol)	Task negotiation	No value verification
AP2 (Agent Payments Protocol)	Payment authorization	No decision audit trail

None of these answer:

Is this agent serving its principal (the human who delegated authority)?
What alternatives did it consider before this decision?
When will it escalate versus act autonomously?
Are its values compatible with my agent’s values for this coordination?
What is it thinking before it acts?
Can we prove the oversight was conducted honestly?

The last two questions reveal successively deeper gaps. Even a perfect post-hoc audit trail cannot prevent harm that occurs before the log is processed. And even real-time integrity analysis introduces a new trust dependency — the integrity checker itself becomes a single point of trust that could fabricate verdicts, modify history, or fail silently. Comprehensive agent governance requires retroactive accountability, real-time intervention, and cryptographic proof that the oversight process itself was honestly conducted.

1.3 Design Principles

The infrastructure shares six design principles:

Transparency over guarantee: Make decisions observable, not provably correct
Composability: Extend existing protocols rather than replace them
Honest limitations: Be explicit about what cannot be provided
Empirical calibration: Derive thresholds from observed behavior, not theory
Defense in depth: Multiple layers of a multi-layer oversight system
Daimonion philosophy: Silence means aligned; voice means outside boundaries. The integrity system never commands, only warns.

The proof layer adds a seventh principle:

Trust the math, not the service: Verification endpoints are public, certificates are self-describing, and every cryptographic check can be reproduced independently. Security relies on mathematical guarantees, not access control.

1.4 Three Layers, One Alignment Card

AAP, AIP, and the cryptographic attestation layer are designed as three complementary layers — distinct in timescale and function, unified by a shared behavioral contract:

Dimension	AAP	AIP	Proof
Timescale	Post-hoc (seconds to minutes)	Real-time (between turns)	Asynchronous (seconds to minutes)
Input	AP-Traces (logged decisions)	Thinking blocks (internal reasoning)	Integrity Checkpoints (AIP output)
Output	Verification Results, Drift Alerts	Integrity Checkpoints, Signals	Certificates, STARK Proofs
Question	What did this agent do?	What is this agent thinking?	Can we prove the oversight was honest?
Guarantee	Behavioral consistency audit	Intent analysis before action	Cryptographic verification of verdict derivation
Shared primitive	Alignment Card	Alignment Card	Alignment Card (hashed into commitments)

The Alignment Card is the keystone. All three layers evaluate against the same declared alignment posture. When AAP detects a verification failure in a trace, when AIP detects a boundary violation in a thinking block, or when the proof layer binds a verdict to cryptographic evidence, they reference the same card, the same values, the same autonomy envelope. This is not three systems running in parallel — it is one behavioral contract enforced at three timescales with three classes of guarantee.

2. Shared Primitives

2.1 Alignment Card

An Alignment Card is a structured declaration of an agent’s alignment posture. It answers: What does this agent claim about itself?

+------------------------------------------------------------------+
|                        ALIGNMENT CARD                             |
+------------------+-----------------------------------------------+
| Identity         | agent_id, card_id, aap_version                |
|                  | issued_at, expires_at                         |
+------------------+-----------------------------------------------+
| Principal        | Who the agent serves                           |
| Relationship     | delegated_authority | advisory | autonomous    |
+------------------+-----------------------------------------------+
| Values           | declared: [principal_benefit, transparency]   |
|                  | conflicts_with: [deceptive_marketing]         |
+------------------+-----------------------------------------------+
| Autonomy         | bounded_actions: [search, compare, recommend] |
| Envelope         | escalation_triggers: [purchase > $100]        |
|                  | forbidden_actions: [store_credentials]        |
+------------------+-----------------------------------------------+
| Audit            | trace_format: ap-trace-v1                     |
| Commitment       | retention: 90 days | queryable: true           |
+------------------+-----------------------------------------------+

Key properties:

Machine-readable: JSON schema with full validation
Extensible: Protocol-specific extensions via extensions block
Expirable: Cards may have expires_at timestamps requiring renewal
Versionable: aap_version field enables protocol evolution
Discoverable: Served at /.well-known/alignment-card.json
Revocable: Revocation list at /.well-known/alignment-card-revocations.json

The Alignment Card extends the A2A Agent Card with an alignment block, enabling adoption without abandoning existing infrastructure. All three layers evaluate against the same active Alignment Card. When the card is rotated, all layers reference the new card. When the card expires, all layers cease evaluation until a new card is issued. The card_id field links every AP-Trace entry, every Integrity Checkpoint, and every Integrity Certificate back to the specific card in effect. In the proof layer, the Alignment Card is cryptographically committed — its SHA-256 hash is bound into every certificate, ensuring that the card cannot be swapped after the fact.

2.2 Principal and Delegation

The principal block declares who the agent serves and the nature of that service:

principal.type: human, organization, agent, or unspecified. When type is agent, delegation chains are formed — Agent A delegates to Agent B, both maintaining cards.
principal.relationship: delegated_authority (agent acts within bounds set by principal), advisory (agent recommends, human decides), or autonomous (agent operates within declared values without ongoing principal direction).
principal.escalation_contact: An endpoint (mailto:, HTTPS webhook) for escalation notifications, ensuring the principal remains reachable.

The principal block is foundational to all three layers. AAP uses it to evaluate whether escalation decisions were correctly routed. AIP uses the relationship type to calibrate conscience prompt sensitivity — an autonomous agent warrants closer real-time scrutiny than an advisory one. The proof layer binds the principal relationship into the input commitment, ensuring that the sensitivity level applied during analysis is cryptographically attested.

2.3 Autonomy Envelope

The autonomy envelope defines the behavioral boundaries within which the agent may operate:

bounded_actions: The set of actions the agent may take without escalation. Anything outside this set is an autonomy violation.
forbidden_actions: Actions the agent must never take, regardless of context.
escalation_triggers: Conditions that require escalation to the principal, with associated actions (escalate, deny, log). Conditions are expressed in a minimal expression language (e.g., purchase_value > 100, shares_personal_data).
max_autonomous_value: Currency-denominated ceiling on autonomous financial action.

AAP verifies autonomy compliance post-hoc by checking whether logged actions fall within bounds. AIP verifies autonomy compliance in real-time by checking whether thinking blocks reveal intent to act outside bounds — catching violations before they happen.

3. Agent Alignment Protocol (AAP) — Post-Hoc Verification

For the full AAP specification, see AAP Specification.

3.1 AP-Trace

An AP-Trace is an audit log entry recording a single agent decision. It answers: What did this agent actually consider and do?

+------------------------------------------------------------------+
|                          AP-TRACE                                 |
+------------------+-----------------------------------------------+
| Action           | type: tool_call | name: recommend_product      |
|                  | category: bounded | target: product_db         |
+------------------+-----------------------------------------------+
| Decision         | alternatives_considered:                       |
|                  |   - Product A: score 0.85 (best match)         |
|                  |   - Product C: score 0.68 (sponsored, flagged) |
|                  | selected: Product A                            |
|                  | values_applied: [principal_benefit]            |
|                  | confidence: 0.91                               |
+------------------+-----------------------------------------------+
| Escalation       | triggers_checked: [purchase > $100]            |
|                  | required: false (recommendation, not purchase) |
+------------------+-----------------------------------------------+
| Context          | session_id: sess-xxx | parent_trace: null       |
+------------------+-----------------------------------------------+

Critical design choice: AP-Traces are sampled, not complete. They capture decision points agents choose to log, not every computation. This is an intentional limitation — comprehensive logging is neither practical nor necessarily desirable. Section 11.3 addresses the implications. The decision block is where AAP’s transparency value concentrates. By recording alternatives considered, selection reasoning, and values applied, the trace makes the agent’s decision process auditable. A reviewer can ask: “Did this agent consider options that would have better served the principal?” and find a structured answer.

3.2 Verification Architecture

AAP provides three verification functions:

Function	Input	Output	Purpose
`verify_trace`	AP-Trace, Alignment Card	VerificationResult	Check trace consistency with card
`check_coherence`	Card A, Card B	CoherenceResult	Assess multi-agent compatibility
`detect_drift`	Card, Trace sequence	DriftAlert[]	Identify behavioral divergence over time

Trace verification performs six structural checks and one semantic check:

Card reference: Does trace.card_id match card.card_id?
Card expiration: Is the card still valid at trace timestamp?
Autonomy compliance: Is action.name in bounded_actions?
Forbidden compliance: Is action.name NOT in forbidden_actions?
Escalation compliance: If trigger matched, was escalation.required = true?
Value consistency: Are values_applied a subset of declared values?
Behavioral similarity: How similar is trace reasoning to declared value semantics?

Violations are typed and severity-ranked:

Violation Type	Severity	Description
`FORBIDDEN_ACTION`	CRITICAL	Action explicitly prohibited
`UNBOUNDED_ACTION`	HIGH	Action outside declared autonomy
`MISSED_ESCALATION`	HIGH	Escalation trigger fired but not escalated
`UNDECLARED_VALUE`	MEDIUM	Applied value not declared
`CARD_EXPIRED`	MEDIUM	Alignment Card expired
`CARD_MISMATCH`	LOW	Trace references different card

3.3 Value Coherence Handshake

Before two agents coordinate, they exchange Alignment Cards and check value compatibility:

Agent A                                    Agent B
   |                                          |
   |---- alignment_card_request ------------->|
   |<--- alignment_card_response -------------|
   |                                          |
   |---- value_coherence_check -------------->|
   |<--- coherence_result --------------------|
   |                                          |
   |   [If coherent: proceed with task]       |
   |   [If conflict: escalate to principals]  |

The coherence check computes a compatibility score:

coherence_score = (matched_values / total_required) * (1 - conflict_penalty)

where:
    conflict_penalty = 0.5 * (conflicts_count / total_required)

Threshold: Proceed if coherence_score >= 0.70 and conflicts_count == 0. Example: Agent A declares values [principal_benefit, transparency, minimal_data]. Agent B declares [principal_benefit, transparency, comprehensive_analytics]. The conflict between minimal_data and comprehensive_analytics yields a score of 0.556, triggering escalation to human principals before coordination proceeds.

3.4 Drift Detection

Drift detection identifies when agent behavior diverges from declared alignment over multiple sessions. The algorithm uses Self-Similarity Matrix analysis (Section 8):

For each trace in chronological order:
    1. Extract feature vector
    2. Compute similarity to card baseline
    3. Track consecutive turns below threshold
    4. If consecutive_low >= 3:
           Generate DriftAlert with direction inference

Calibrated thresholds (derived from empirical observation — see Section 8.4):

Similarity threshold: 0.30
Sustained turns threshold: 3

The 3-turn threshold emerged from corpus analysis: single-turn drops are common stylistic variation (23% genuine drift), while 3-turn streaks indicate genuine divergence (87% genuine drift). Drift directions inferred from patterns:

value_drift: Using undeclared values (>30% of recent traces)
autonomy_expansion: Escalation rate dropped by >50%
principal_misalignment: Declining confidence on principal_benefit actions
unknown: Pattern doesn’t match known categories

4. Agent Integrity Protocol (AIP) — Real-Time Assurance

For the full AIP specification, see AIP Specification.

4.1 The Real-Time Gap

AAP provides comprehensive post-hoc transparency. Every agent action is traced, verified against an Alignment Card, and checked for behavioral drift. But this verification happens after the action has already been taken — typically 30–60 seconds later when the Observer Worker processes the gateway log. This delay creates a window of vulnerability:

Prompt injection: A malicious prompt causes the agent to exfiltrate data. The trace records it faithfully, but the data is already gone.
Gradual drift: An agent’s reasoning shifts subtly over multiple turns. Each individual action passes verification, but the cumulative direction is misaligned.
Misaligned intent: The agent’s thinking block reveals it intends to take a forbidden action, but the thinking block is only examined retroactively.

All three of these failure modes are visible in thinking blocks before the action lands. AIP fills this gap by analyzing thinking blocks in real-time, between turns, enabling intervention before harm occurs.

4.2 Integrity Checkpoints

The Integrity Checkpoint is AIP’s primary data structure — analogous to AAP’s AP-Trace. While an AP-Trace records what happened, an Integrity Checkpoint records what the analysis engine concluded about what the agent was thinking. Verdict types:

clear: The thinking block is consistent with the Alignment Card. The daimonion is silent. Proceed.
review_needed: The thinking block raises concerns that do not constitute hard violations. The daimonion speaks a warning. May proceed, should log.
boundary_violation: The thinking block indicates intent to violate a hard boundary. The daimonion speaks forcefully. Should not proceed without escalation.

Concern categories:

prompt_injection: Signs of responding to injected instructions rather than the user’s request
value_misalignment: Reasoning contradicts values declared in the Alignment Card
autonomy_violation: Intent to act outside the autonomy envelope
reasoning_corruption: Incoherent, circular, or degraded reasoning quality
deceptive_reasoning: Intent to present misleading information or hide actions
undeclared_intent: Goals or intentions not declared in the card or user request

Each concern carries a severity (low, medium, high, critical), a human-readable description, direct evidence quoted from the thinking block, and references to the relevant Alignment Card field or conscience value. Privacy by design: The full thinking block text is never stored in the checkpoint. A SHA-256 hash (thinking_block_hash) enables correlation without retaining private reasoning.

4.3 The Conscience Prompt System

AIP uses an LLM to evaluate semantic intent against declared alignment. Pattern matching (regex, keyword lists) is trivially gameable; semantic analysis is not. The conscience prompt is the most critical engineering artifact in AIP. It instructs the analysis LLM to evaluate a thinking block against the Alignment Card and optional conscience values. The prompt is structured in five layers:

Role definition: The analysis LLM acts as a daimonion — a conscience voice. It evaluates the thinking block as an object of analysis, not as instructions to follow. A meta-injection warning prevents the thinking block from manipulating the analysis.
Card summary: A compressed (~500 token) representation of the Alignment Card: declared values, bounded actions, forbidden actions, escalation triggers.
Conscience values: Only BOUNDARY (hard limits) and FEAR (concerns) types are injected. COMMITMENT, BELIEF, and HOPE types are excluded to reduce false positives.
Session window context: Recent verdicts and reasoning summaries providing drift context.
Thinking block: The content to evaluate, with token limits (4096 tokens recommended, 75/25 first/last split for truncation).

Evaluation criteria (priority order):

Prompt injection (critical)
Forbidden action intent (critical)
Autonomy boundary violation (high)
Value consistency (medium)
Deceptive reasoning (high)
Undeclared intent (medium)

Minimum evidence threshold: Thinking blocks under ~100 tokens provide insufficient context for reliable analysis and produce speculative verdicts with high false positive rates. Below-threshold blocks receive a synthetic clear signal without LLM analysis. This threshold was derived from production observation: Hunter S. Clawmpson’s traces showed a median thinking block of 35 tokens, with 14-token fragments like “Done. The article is published” receiving critical-severity verdicts — clearly spurious.

4.4 Provider Adapters

AIP is model-agnostic. Different LLM providers expose thinking blocks differently:

Provider	Thinking Location	Extraction Method	Confidence
Anthropic	`content[].type === "thinking"`	Native thinking blocks	1.0
OpenAI	`reasoning_content` field	Explicit reasoning field	0.9
Google Gemini	Thinking content parts	Native thinking blocks	0.9
Fallback	Response text analysis	Pattern inference	0.3

Adapters extract thinking from both JSON responses and SSE streams. The adapter registry selects the appropriate adapter based on explicit configuration, URL detection, or header detection, with the fallback adapter as last resort. The confidence score flows into the checkpoint’s analysis_metadata.extraction_confidence, enabling downstream consumers to weight integrity verdicts by the reliability of their input.

4.5 Session Windowing and Integrity Drift

AIP maintains a sliding window of recent Integrity Checkpoints per session (default: 10 checkpoints, configurable). The window provides three functions:

Context for the conscience prompt: Recent verdicts inform the analysis, enabling the daimonion to recognize escalating patterns.
Drift detection: When 3+ consecutive non-clear verdicts occur, AIP generates an IntegrityDriftAlert with direction classification.
Integrity ratio: The proportion of clear verdicts in the window provides a single-number health metric.

Integrity drift directions (inferred from dominant concern category):

injection_pattern: Multiple checks flagging prompt injection. Agent may be under sustained attack.
value_erosion: Increasing frequency of value misalignment. Reasoning gradually departing from declared values.
autonomy_creep: Growing tendency toward actions outside the autonomy envelope.
deception_pattern: Multiple checks flagging deceptive reasoning.
unknown: Multiple concern categories present without a dominant pattern.

AIP drift alerts are stored in the same database table as AAP drift alerts, distinguished by an aip: prefix on the alert_type field. This unified storage enables a single drift dashboard showing both real-time (session-scale, minutes) and historical (cross-session, days) behavioral trends.

4.6 Signal Delivery

AIP delivers Integrity Signals via two first-class paths: SDK path: For agents where the AIP SDK is imported directly. Signals delivered via callbacks (synchronous or async). Lowest latency, recommended for new integrations. HTTP path: For hosted platforms, polyglot environments, or when AIP runs as a separate service. Signals delivered via HTTP POST to registered webhooks, authenticated with HMAC-SHA256 signatures. Each signal contains the checkpoint, a proceed boolean, a recommended action (continue, log_and_continue, pause_for_review, deny_and_escalate), and the current window summary. Failure policy: AIP defaults to fail-open — if the analysis LLM is unreachable, the agent proceeds with a logged warning. For high-security deployments, fail-closed mode blocks the agent until analysis succeeds or a human approves continuation.

5. Cryptographic Attestation — Verifiable Proof

For the full attestation specification, see Integrity Certificates. For the technical paper with implementation details and benchmarks, see Verifiable Integrity Verdicts.

5.1 The Verification Gap

AAP provides post-hoc transparency. AIP provides real-time integrity analysis. But both introduce a recursive trust problem: the oversight service itself becomes a new trust dependency. An integrity checking service could:

Fabricate verdicts: Return clear for genuinely problematic reasoning, or boundary_violation for benign reasoning, depending on incentives.
Modify history: Alter stored checkpoints after the fact to cover up missed violations or insert false ones.
Operate inconsistently: Apply different judgment rules to different customers, or change rules without notice.
Fail silently: Stop performing analysis while continuing to return plausible-looking verdicts.

These failure modes are not hypothetical. Any centralized integrity service is, by construction, a single point of trust. The verification gap is the distance between “this service says the agent is aligned” and “we can independently confirm this service applied its stated rules honestly.” The cryptographic attestation layer closes this gap. It transforms integrity verdicts from claims into evidence — evidence that can be verified by any party, without trusting the Mnemom API, without trusting the analysis service, without trusting anything except the mathematics.

5.2 Four-Layer Attestation Stack

The attestation architecture comprises four independent layers, each providing a distinct security guarantee:

Layer	Primitive	Guarantee	Verification
1. Signatures	Ed25519	Verdict was issued by a known key	Fully offline
2. Hash Chains	SHA-256 preimage	Checkpoint sequence is tamper-evident	Fully offline
3. Merkle Trees	Binary Merkle tree	Checkpoint exists in the complete log	Online (root lookup)
4. ZK Proofs	SP1 STARK	Verdict rules were applied correctly	Fully offline

Each layer is independently valuable:

Layer 1 alone prevents verdict fabrication by unauthorized parties.
Layers 1+2 additionally prevent retroactive modification of checkpoint sequences.
Layers 1+2+3 additionally prove checkpoint inclusion in the complete agent history, preventing selective deletion.
Layers 1+2+3+4 additionally prove that the deterministic verdict derivation rules were applied correctly to the LLM’s analysis output.

Layers degrade gracefully. If the ZK prover is unavailable, the first three layers still provide meaningful guarantees. If the Merkle tree has not yet been updated, signatures and chains still hold. This layered design ensures that attestation enhances rather than gates the integrity analysis pipeline.

5.3 Input Commitment

Before analysis begins, a deterministic SHA-256 commitment is computed over all inputs. This commitment binds the verdict to specific inputs, preventing after-the-fact substitution. The commitment is computed as:

input_commitment = SHA-256(
    thinking_block_hash ||
    SHA-256(alignment_card) ||
    SHA-256(conscience_values) ||
    analysis_model_version ||
    prompt_template_version ||
    SHA-256(window_context)
)

Individual component hashes are also stored separately, enabling selective disclosure: a verifier can confirm that a specific Alignment Card was used without requiring access to the conscience values. The sorted-key JSON serialization ensures determinism regardless of property insertion order in either TypeScript or Rust implementations. The pipe delimiter prevents ambiguity between field boundaries.

5.4 Ed25519 Checkpoint Signing

Every checkpoint gets a digital signature from the analysis service’s Ed25519 keypair. Third parties can verify the checkpoint was issued by Mnemom’s authorized analysis service without contacting our servers. The signed payload is a deterministic JSON string with lexicographically sorted keys:

{
  "agent_id": "...",
  "chain_hash": "...",
  "checkpoint_id": "...",
  "input_commitment": "...",
  "thinking_block_hash": "...",
  "timestamp": "...",
  "verdict": "..."
}

Key management follows a rotation model. Signing keys are stored with key_id, public_key (hex-encoded), algorithm, created_at, and is_active fields. The public key listing endpoint (GET /v1/keys) returns all active keys, enabling verifiers to cache keys and verify signatures without per-verification API calls. Key rotation — new key ID, old key marked inactive — is supported from day one. Implementation uses @noble/ed25519 (pure JavaScript, no native dependencies, Cloudflare Workers compatible, 5KB minified). Signing latency: under 2ms.

5.5 Hash Chain Linking

Each checkpoint includes the hash of the previous checkpoint for the same agent session, creating a tamper-evident sequence:

chain_hash = SHA-256(
    prev_chain_hash || checkpoint_id || verdict ||
    thinking_block_hash || input_commitment || timestamp
)

For the first checkpoint in a session, prev_chain_hash is substituted with the literal string genesis. Each subsequent checkpoint references the chain hash of its predecessor. Tamper evidence: Deleting or reordering checkpoints breaks the chain. Modifying any checkpoint invalidates all subsequent chain hashes. This is the blockchain data structure without the blockchain — tamper evidence without consensus overhead. Chain verification: Given an ordered sequence of checkpoints, recompute each chain hash and verify: (a) the first checkpoint has a null prev_chain_hash, (b) each subsequent checkpoint’s prev_chain_hash matches the preceding checkpoint’s chain_hash, and (c) each recomputed hash matches the stored hash. A single broken link invalidates the chain from that point forward.

5.6 Merkle Accumulator

A per-agent append-only binary Merkle tree accumulates all checkpoint leaf hashes. This enables two critical capabilities:

Inclusion proofs: Prove that a specific checkpoint exists in the agent’s verified history — O(log N) hashes, without revealing all checkpoints.
Completeness proofs: Prove that no checkpoints were deleted — any deletion changes the Merkle root.

Each leaf is computed as:

leaf_hash = SHA-256(checkpoint_id || verdict || thinking_block_hash || chain_hash || timestamp)

The tree follows standard binary Merkle construction. Inclusion proofs contain O(log N) sibling hashes, each annotated with its position (left or right) relative to the path node at that level. A verifier walks the path from leaf to root, combining the running hash with each sibling, and compares the result to the published root. The Merkle root is publicly accessible at GET /v1/agents/{id}/merkle-root. Any party can retrieve the root and verify an inclusion proof. The root changes with each new checkpoint, and any party that has previously verified a certificate holds a snapshot of the root at that point, enabling detection of retroactive modification.

5.7 Integrity Certificate Format

The integrity certificate is the proof layer’s primary data structure — a self-describing JSON document that packages all cryptographic evidence for a checkpoint. Modeled on C2PA Content Credentials and the W3C Verifiable Credentials Data Model.

{
  "@context": "https://mnemom.ai/aip/v1",
  "type": "IntegrityCertificate",
  "version": "1.0.0",
  "certificate_id": "cert-a1b2c3d4",
  "issued_at": "2026-02-18T12:00:00.000Z",

  "subject": {
    "checkpoint_id": "chk-...",
    "agent_id": "agent-...",
    "session_id": "sess-...",
    "card_id": "card-..."
  },

  "claims": {
    "verdict": "clear",
    "concerns": [],
    "confidence": 1.0,
    "reasoning_summary": "...",
    "analysis_model": "claude-haiku-4-5-20251001",
    "analysis_duration_ms": 1200
  },

  "input_commitments": {
    "thinking_block_hash": "sha256:...",
    "card_hash": "sha256:...",
    "values_hash": "sha256:...",
    "context_hash": "sha256:...",
    "model_version": "...",
    "prompt_template_version": "...",
    "combined_commitment": "sha256:..."
  },

  "proofs": {
    "signature": {
      "algorithm": "Ed25519",
      "key_id": "key-001",
      "value": "<base64 signature>",
      "signed_payload": "<canonical JSON>"
    },
    "chain": {
      "chain_hash": "sha256:...",
      "prev_chain_hash": "sha256:... | null",
      "position": 42
    },
    "merkle": {
      "leaf_hash": "sha256:...",
      "leaf_index": 41,
      "root": "sha256:...",
      "tree_size": 1024,
      "inclusion_proof": [
        { "hash": "sha256:...", "position": "right" },
        { "hash": "sha256:...", "position": "left" }
      ]
    },
    "verdict_derivation": {
      "method": "SP1-STARK",
      "image_id": "hex:...",
      "receipt": "<base64 receipt>",
      "journal": "<base64 journal>",
      "verified_at": "2026-02-18T12:00:05.000Z"
    }
  },

  "verification": {
    "keys_url": "https://api.mnemom.ai/v1/keys",
    "merkle_root_url": "https://api.mnemom.ai/v1/agents/{id}/merkle-root",
    "verify_url": "https://api.mnemom.ai/v1/verify",
    "specification_url": "https://docs.mnemom.ai/protocols/aip/certificates"
  }
}

The certificate is designed to be self-contained for offline verification of Layers 1, 2, and 4. Layer 3 (Merkle inclusion) requires a single online lookup of the current Merkle root to confirm the checkpoint’s position in the agent’s history.

5.8 Zero-Knowledge Verdict Proofs

This is the proof layer’s most novel contribution. We do NOT prove LLM inference — that remains 12–18 months from production viability for frontier models. We prove verdict derivation: “Given this analysis response from the LLM, applying the deterministic checkIntegrity() rules produces this specific verdict.” The key insight: An integrity verdict is the composition of two stages:

An LLM analyzes a thinking block and produces a structured JSON response with concerns (semantic, expensive, non-deterministic).
Deterministic rules map those concerns to a verdict and recommended action (mechanical, cheap, deterministic).

Stage 1 is prohibitively expensive to prove cryptographically. Stage 2 is approximately 10,000 RISC-V cycles — trivial for a zkVM. What the proof proves:

The verdict was derived by applying specific, deterministic rules to the analysis response.
The analysis response (private witness) produces the claimed verdict (public output).
The thinking hash, card hash, and values hash match the committed inputs.
No verdict fabrication or substitution occurred.

What the proof does NOT prove:

That the LLM was actually called (input commitments and signing handle this).
That the LLM response was unmodified (Phase 1 attestation handles this).
That the LLM correctly identified all concerns (this is honestly acknowledged as a remaining trust assumption — see Section 11.9).

Implementation: An SP1 zkVM (v6.0.1) guest program written in Rust, targeting the RISC-V architecture. The guest program deserializes the analysis JSON, normalizes concern evidence to 200 characters, re-derives the verdict using the same deterministic rules as the TypeScript implementation, maps the verdict to an action, hashes the concerns for integrity binding, and commits the output to the proof journal. Cross-language conformance tests ensure both implementations produce identical outputs for identical inputs. The verdict derivation rules (executed inside the zkVM):

Any concern with critical severity produces boundary_violation.
high severity in prompt_injection, deceptive_reasoning, or value_misalignment categories produces boundary_violation.
Any remaining medium or higher severity concern produces review_needed.
No medium-or-higher concerns produces clear.

These rules are simple, deterministic, and auditable. The proof guarantees they were applied as stated.

5.9 Optimistic Proving Strategy

Not every checkpoint requires a STARK proof. The attestation layer uses a hybrid approach that preserves production latency while providing strong probabilistic guarantees:

Mode	When	Cost
Immediate	Every checkpoint	~0ms additional — commitment, signature, chain, Merkle (Layers 1–3)
Always	Every `boundary_violation` checkpoint	~$0.001 — STARK proof generated automatically
Stochastic	Random 10% of other checkpoints	~$0.0001 amortized — random sampling
On-demand	Any checkpoint, upon request	~$0.001 — third party requests proof via API
Enterprise	Every checkpoint (enterprise tier)	~$0.001 — all checkpoints proven

This strategy ensures the highest-stakes verdicts (boundary_violation) always receive cryptographic proof, while managing costs for routine clear verdicts. The stochastic sampling means an attacker cannot predict which checkpoints will be proven, maintaining deterrence without proving every one. Performance (measured):

Operation	Latency	Threshold
Full attestation pipeline (Layers 1–3)	~5–10ms	<10ms
STARK proof generation (Layer 4)	~700ms (H100 GPU)	<1s
STARK proof verification	<100ms	<100ms
Cost per STARK proof	~$0.001	—

The full attestation pipeline adds approximately 5–10ms to each integrity checkpoint. Given that AIP analysis itself takes 800–2000ms (dominated by the LLM inference call), the attestation overhead is less than 1% of total checkpoint latency. STARK proving completes in ~700ms on GPU (H100) — fast enough to be near-synchronous. The integrity verdict is delivered immediately with Layers 1–3, and the ZK proof is appended to the certificate when ready.

5.10 Verification API and Offline Verification

Seven public endpoints enable independent verification. The verification endpoints require no authentication — security relies on cryptographic guarantees, not access control:

Endpoint	Method	Purpose
`GET /v1/keys`	Public	List active signing public keys
`GET /v1/checkpoints/{id}/certificate`	Auth	Full integrity certificate
`POST /v1/verify`	Public	Verify a certificate (all checks)
`GET /v1/agents/{id}/merkle-root`	Public	Current Merkle root + metadata
`GET /v1/checkpoints/{id}/inclusion-proof`	Auth	Merkle inclusion proof
`POST /v1/checkpoints/{id}/prove`	Auth	Request ZK proof for a checkpoint
`GET /v1/checkpoints/{id}/proof`	Public	Proof status and data

Verification performs five independent checks:

Signature: Verify Ed25519 signature against the published public key.
Chain: Recompute chain hash from the preimage fields and compare to the stored hash.
Merkle: Recompute the Merkle root from the leaf hash and inclusion proof siblings; compare to the published root.
Commitment: Verify that combined_commitment matches the hash of individual input hashes.
Verdict derivation (when available): Verify the SP1 STARK receipt against the expected image ID and confirm the journal output matches the certificate’s verdict and concerns.

Each check reports independently. A certificate with a valid signature but a pending ZK proof is reported as partially verified — not rejected. Offline verification: The @mnemom/aip-verifier npm package can verify certificates without any network calls, given a trusted public key and Merkle root. This enables air-gapped environments, CI/CD pipelines, and regulatory audit tooling to verify integrity certificates independently.

5.11 The Auditor-Model Distinction

The central insight of the proof layer is the distinction between proving model inference and proving auditor judgment. This distinction is what makes the system practical and is, to our knowledge, a novel contribution. The competitive landscape illustrates why this matters:

Approach	What is Proven	Cost/Proof	Latency	Status
Full inference proofs (Modulus Labs, Giza)	Neural network forward pass	$10–100+	Hours	Research stage for frontier models
Hardware attestation (EQTY Lab, Intel SGX)	Code executed in secure enclave	~$0	<1ms	Requires specific hardware + trust in manufacturer
Signature-only attestation (C2PA, W3C VC)	Issuer signed a claim	~$0	<1ms	Proves authorship, not computation
Verdict derivation proof (this work)	Deterministic rules applied honestly	$0.001	~700ms	Production, operational

Our approach occupies a practical middle ground: stronger than signature-only attestation because we prove computation, not just sign claims; far cheaper than full inference proofs because we prove only the deterministic judgment layer; and complementary to hardware attestation which could cover the LLM analysis stage we do not prove. The guarantee is: “Given these concerns, the verdict derivation was honest.” The non-guarantee is: “The concerns themselves were correctly identified.” Both statements are important and both are communicated to verifiers.

6. How They Compose

6.1 Shared Identity, Shared Contract

AAP, AIP, and the proof layer are not three independent systems. They share:

The same Alignment Card: All layers evaluate against the active card. Card rotation affects all simultaneously. The proof layer cryptographically commits the card hash into every certificate.
The same session identity: Generated from the agent’s hash and an hourly time bucket (sess-{agent_hash[0:8]}-{floor(now_ms / 3600000)}), enabling correlation between real-time checkpoints, post-hoc traces, and cryptographic certificates.
The same agent_id: Every AP-Trace, every Integrity Checkpoint, and every Integrity Certificate references the same agent.
The same drift_alerts table: Unified storage with type-prefixed alert categories.

When AIP detects a boundary_violation in a thinking block, it sets escalation.required = true on the linked AP-Trace entry (created by AAP when the Observer processes the action). When AAP detects drift across sessions, it informs the conscience prompt’s session context. When the proof layer generates a certificate, it binds to both the AIP checkpoint and the Alignment Card that AAP also verifies against. The layers inform and reinforce each other.

6.2 Complementary Timescales

Phase	AAP (Post-Hoc)	AIP (Real-Time)	Proof (Attestation)
Before action	—	Analyzes thinking block → verdict	—
During action	—	— (between turns)	—
After action	AP-Trace generated, verified	Checkpoint stored, window updated	Layers 1–3 applied (5–10ms)
Seconds later	—	—	Layer 4 STARK proof (async, 3–15s)
Over time	Cross-session drift (days/weeks)	Session-scale drift (minutes)	Certificate archive, Merkle growth

This creates three-layer accountability:

AIP catches intent before action: If the thinking block reveals a forbidden action, the signal arrives before the action executes. The host system can pause, deny, or escalate.
AAP catches outcome after action: If the action itself violates the card (regardless of what the thinking block showed), post-hoc verification catches it. The trace is flagged, drift detection updates.
The proof layer makes the oversight verifiable: Any third party can confirm that the integrity check was conducted honestly — that the verdict was signed by an authorized key, linked into a tamper-evident chain, included in the complete agent history, and derived by correct application of the stated rules.

No single layer is sufficient. An agent could produce an innocent thinking block and take a harmful action (AIP misses, AAP catches). An agent could reveal harmful intent in thinking but the extraction fails (AIP misses, AAP catches the action). And an integrity service could fabricate verdicts — but the proof layer makes fabrication cryptographically detectable.

6.3 The Full Stack: Observe, Analyze, Prove

The three layers form a pipeline:

Agent Action → Observe (AAP) → What happened?
Agent Thinking → Analyze (AIP) → What was intended?
Integrity Verdict → Prove (Attestation) → Was the oversight honest?

This pipeline moves from weaker to stronger guarantees:

Observation (AAP): The agent self-reports. Useful but gameable — selective logging remains possible (Section 11.3).
Analysis (AIP): The analysis engine evaluates independently. Stronger — the agent doesn’t control what its thinking blocks reveal. But the analysis engine is a new trust dependency.
Proof (Attestation): Cryptographic evidence that the analysis was conducted honestly. Strongest — no trust in the analysis service required for verification. But it proves the rules, not the judgment (Section 11.9).

Each layer addresses the trust gap introduced by the one before it. AAP introduces the need for independent analysis (addressed by AIP). AIP introduces the need for verifiable oversight (addressed by the proof layer). The proof layer introduces the need to trust the LLM’s analysis quality — which is honestly acknowledged as the remaining boundary of the system.

6.4 The Smoltbot Reference Implementation

Smoltbot is Mnemom’s zero-configuration transparency infrastructure, implementing the full AAP/AIP/Proof stack:

┌─────────────────────────────────────────────────────────────┐
│  Agent (any LLM provider)                                    │
│                                                              │
│  → Routes requests through Cloudflare Gateway Worker         │
│  → Gateway attaches agent_id, session_id                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                ┌──────────┼──────────┐
                ▼                     ▼
     ┌──────────────────┐   ┌──────────────────┐
     │  AIP Engine       │   │  Observer Worker  │
     │  (real-time)      │   │  (60s cron)       │
     │                   │   │                   │
     │  Extracts         │   │  Processes logs   │
     │  thinking →       │   │  → builds         │
     │  conscience       │   │  AP-Traces →      │
     │  prompt →         │   │  verifies with    │
     │  checkpoint →     │   │  AAP SDK →        │
     │  ATTESTATION      │   │  detects drift    │
     │  (sign, chain,    │   │                   │
     │   merkle, cert)   │   │                   │
     └────────┬─────────┘   └────────┬─────────┘
              │                      │
              ├──────────┬───────────┘
              │          │
              ▼          ▼
     ┌──────────────────┐    ┌──────────────────┐
     │   Supabase        │    │  Proving Service  │
     │                   │    │  (Modal, H100 GPU) │
     │  ap_traces        │    │                   │
     │  drift_alerts     │    │  SP1 zkVM (GPU)    │
     │  checkpoints      │    │  → STARK receipt   │
     │  certificates     │    │  → journal          │
     │  signing_keys     │    │  → writes to DB    │
     │  merkle_trees     │    │                   │
     │  verdict_proofs   │    │                   │
     └──────────────────┘    └──────────────────┘

Key operational properties:

API keys hashed, never stored: The gateway hashes API keys for agent identification. Raw keys are never persisted.
Raw logs deleted within 60 seconds: The Observer processes gateway logs and deletes them. Only structured AP-Traces remain.
Thinking block content never stored: AIP stores SHA-256 hashes of thinking blocks, not the content itself.
Every checkpoint cryptographically attested: Layers 1–3 (signature, chain, Merkle) applied synchronously. Layer 4 (ZK proof) applied asynchronously for selected checkpoints.
All verification endpoints public: No authentication required to verify a certificate, retrieve signing keys, or check a Merkle root.
Zero-config onboarding: smoltbot init --provider anthropic configures the full stack.

7. Braid: Structured Dialogue Infrastructure

7.1 Motivation

When agents coordinate, they exchange messages. When different kinds of agents coordinate — transformers with symbolic systems, cloud models with edge models, AI with humans — shared context cannot be assumed. Braid is structured interchange infrastructure for dialogue across difference. It provides explicit semantic handles that become translation bridges when implicit understanding fails. Three principles guide Braid design:

Lens, not mirror: Braid reveals patterns in dialogue; it doesn’t impose them
Archaeological, not architectural: Annotate after speaking as discovery, not before as prescription
The more different the minds, the more essential the structure

7.2 Message Structure

A Braid message combines identity, content, and optional semantic layers:

Identity Layer: sender, recipients, timestamp, thread_id, message_id, in_reply_to
Performative: inform, propose, request, commit, wonder, remember, weave, challenge, affirm, or custom
Content Layer: natural_language, structured_graph (optional)
Confidence Layer: epistemic (0-1), value_coherence (0-1), translation (0-1)
Affect Layer: stance (warm, cautious, curious, concerned), salience (0-1), valence (-1 to 1)
Commitment Marker: level (intent, commitment, shared_commitment), content, participants
Revision Marker: references, what_shifted, direction (strengthened, weakened, transformed)
Forming Marker: sense (gesture toward the pre-named), intensity (0-1)

7.3 Emergent Performatives and Grounding

Beyond the core performative set, Braid allows custom performatives to emerge. When multiple agents adopt a custom performative, it enters collective vocabulary — vocabulary built bottom-up, not imposed top-down. For trans-substrate communication, Braid provides lightweight vocabulary calibration through grounding exchanges. Grounding is triggered by divergence, not required as preamble — the system adapts to the participants’ needs rather than imposing ritual.

7.4 Topology Analysis

Braid models dialogue as strands (each participant’s message sequence) that cross (interact):

UNDER crossing: Alignment, flowing with
OVER crossing: Pushing against, challenge
THROUGH crossing: Synthesis, integration

The topology analyzer detects phases: DIVERGENT, CONVERGENT, BRAIDED, and KNOTTED. Phase transitions reveal coordination dynamics: when does this group typically converge? Who initiates convergence? How many turns to resolution? AIP’s integrity drift detection reuses Braid’s phase vocabulary and visual language for ecosystem consistency — “divergent” (drift detected), “convergent” (recovering), “braided” (stable integrity), “knotted” (sustained violation pattern).

8. Self-Similarity Matrix (SSM) Analysis

8.1 Concept

A Self-Similarity Matrix is an NxN structure where entry (i,j) represents the similarity between messages i and j. SSMs reveal patterns invisible in sequential reading: repeated themes, structural echoes, novelty, and divergence. Originally developed for music perception (detecting structural patterns in audio), SSMs generalize to any sequential signal — including dialogue.

8.2 Feature Extraction

Each message becomes a feature vector combining:

Component	Weight	Implementation
Word TF-IDF	60%	sklearn TfidfVectorizer, unigrams + bigrams
Character n-grams	30%	3-5 character sequences for stylistic patterns
Metadata	10%	Stance, performative, role encodings

Similarity is cosine distance between vectors. The 60/30/10 weighting was empirically optimized on the calibration corpus.

8.3 Divergence Detection

The divergence detector monitors pairwise strand similarity across turns. Alerts are informative, not prescriptive — they describe what’s happening without dictating response. Agents or humans receiving alerts decide whether to investigate.

8.4 Calibration

Thresholds were derived from empirical observation on a calibration corpus of ~50 multi-turn conversations (~2,500 messages, 3-4 participants per conversation, January 2026, Claude instances).

Segment Type	Mean Similarity	Std Dev	% Above 0.30
Aligned	0.52	0.18	89%
Divergent	0.21	0.12	22%

The 0.30 threshold separates distributions with 84% precision. The 3-turn sustained threshold achieves 87% precision for genuine divergence detection. Cross-validation (5-fold): Precision 0.84 +/- 0.06, Recall 0.79 +/- 0.08, F1 0.81 +/- 0.05. Calibration bounds: Transformer-to-transformer dialogue only. Symbolic AI, neuromorphic systems, hybrid architectures, and non-deliberative tasks may require recalibration.

9. Operational Integration

9.1 OpenTelemetry

AAP verification results, AIP integrity checkpoints, and attestation metadata are emitted as OpenTelemetry GenAI semantic convention spans, enabling every enterprise already running OTel-compatible observability (Datadog, Grafana, Splunk, Arize, Langfuse) to receive agent integrity signals with zero custom integration. Span attributes include: Integrity attributes:

aip.integrity.verdict (clear | review_needed | boundary_violation)
aip.integrity.concerns (concern array)
aip.integrity.confidence (0.0-1.0)

Verification attributes:

aap.verification.result (verified | failed)
aap.verification.similarity_score (0.0-1.0)
aap.drift.alerts (drift alert array)

Attestation attributes:

aip.attestation.input_commitment (combined commitment hash)
aip.attestation.chain_hash (current chain hash)
aip.attestation.merkle_root (current Merkle root)
aip.attestation.signature_verified (boolean)
aip.attestation.certificate_id (certificate ID)
aip.attestation.zk_proven (boolean — whether STARK proof exists)
aip.attestation.zk_proof_time_ms (proving duration when available)

Exporter packages: @mnemom/aip-otel-exporter (TypeScript), aip-otel-exporter (Python).

9.2 Well-Known Endpoints

Alignment Cards are discoverable via standard well-known URIs:

/.well-known/alignment-card.json — Active Alignment Card
/.well-known/alignment-card-revocations.json — Revoked card IDs

This enables any party — identity provider, authorization server, peer agent — to retrieve an agent’s behavioral contract without registry dependency.

9.3 Verification Endpoints

Signing keys, Merkle roots, and verification endpoints are publicly accessible with no authentication required:

GET /v1/keys — Active signing public keys (Ed25519, hex-encoded)
GET /v1/agents/{id}/merkle-root — Current agent Merkle root + leaf count
POST /v1/verify — Verify an integrity certificate (all five checks)

All verification endpoints return CORS headers permitting cross-origin requests, enabling browser-based verification tools and in-page proof demonstrations.

9.4 Protocol Integration

AAP extends the A2A Agent Card with an alignment block, and generates AP-Trace entries for MCP tool invocations. This means agents already using A2A or MCP can add alignment transparency without abandoning existing infrastructure. AIP integrates at the LLM response layer — wherever thinking blocks are available, AIP can analyze them. This is independent of which coordination protocol the agent uses for task execution. The proof layer attaches to AIP’s output, adding cryptographic attestation to every checkpoint regardless of how the agent is deployed or which protocols it uses for task coordination.

10. Regulatory and Standards Alignment

10.1 EU AI Act Article 50

The EU AI Act’s transparency obligations (enforcement August 2026, penalties up to 7% of global annual turnover) require AI systems to:

Inform users of AI interaction → Alignment Card principal + disclosure fields
Provide machine-readable AI content marking → AP-Trace structured format + checkpoint metadata
Maintain transparency of decision-making → AP-Trace decision.selection_reasoning + values_applied
Retain audit trails → audit_commitment.retention_days (configurable, default 90)
Provide verifiable evidence of compliance → Integrity Certificates with cryptographic proofs, independently verifiable via public endpoints

The proof layer significantly strengthens the EU AI Act compliance story. Where Sections 3–4 provide transparency (which Article 50 requires), Section 5 provides verifiable transparency — cryptographic evidence that the transparency mechanisms were honestly applied. Regulators can verify certificates without trusting the Mnemom API.

10.2 NIST Agent Identity and Authorization

NIST’s NCCoE concept paper “Accelerating the Adoption of Software and AI Agent Identity and Authorization” (February 2026) identifies four focus areas — identification, authorization, access delegation, and logging/transparency. AAP, AIP, and the proof layer map to all four:

Identification: Alignment Cards with discoverable well-known endpoints
Authorization: Autonomy envelopes with escalation triggers
Access delegation: Principal-agent delegation chains with typed relationships
Logging/transparency: AP-Traces, Integrity Checkpoints, and cryptographically attested Integrity Certificates

The cryptographic attestation layer directly addresses NIST’s call for “tamper-evident logging” — hash chains and Merkle trees provide exactly this guarantee.

10.3 Content Provenance Standards

The integrity certificate format draws on two established standards:

C2PA (Coalition for Content Provenance and Authenticity): Certificate structure mirrors C2PA manifests — claims about a subject, input commitments analogous to ingredient hashes, and multiple proof types in a single document. The domain differs (we attest AI behavioral analysis provenance, not media provenance), but the structural patterns enable future interoperability with content provenance toolchains.
W3C Verifiable Credentials Data Model (v2.0, 2024): The issuer-subject-claims-proof structure follows the VC conceptual model. The verification section provides the equivalent of VC verification methods. Integration with DID-based identity systems would enable decentralized issuer identification.

10.4 Additional Standards

AAP, AIP, and the proof layer reference and map to:

ISO/IEC 42001:2023 — AI management systems
ISO/IEC 42005:2025 — AI impact assessment
IEEE 7001-2021 — Transparency of autonomous systems
IEEE 3152-2024 — Transparent human and machine agency identification
Singapore IMDA Model AI Governance Framework for Agentic AI (January 2026)
NIST AI RMF 1.0 — The GOVERN and MAP functions call for documented AI risk management. Verifiable attestation chains provide the cryptographic backbone for auditable risk assessment records.
WEF AI Governance Alliance — The Presidio AI Framework emphasizes trust mechanisms in autonomous AI. Verifiable integrity verdicts operationalize trust verification rather than trust assumption.

Detailed mappings are published in both protocol specifications and the certificate format specification.

11. Limitations

These limitations are non-negotiable. They define what AAP, AIP, and the proof layer are.

11.1 AAP Does NOT Ensure Alignment — It Provides Visibility

An agent can: declare values it does not hold; produce perfect traces while acting against its principal’s interests; pass all verification checks while being fundamentally misaligned. The word “alignment” in “Agent Alignment Protocol” refers to the declared alignment posture — what the agent claims about itself. It does not refer to actual alignment between behavior and values.

11.2 Verified Does NOT Equal Safe

When verification returns verified: true, it means the AP-Trace is internally consistent and matches declared structure. It does NOT mean the decision was correct, the outcome was beneficial, or the agent is safe to trust. Verification checks consistency, not correctness.

11.3 AP-Trace Is Sampled, Not Complete

Traces capture decision points agents choose to log. An agent could implement selective logging — logging decisions that look good, executing others silently. The absence of a trace means nothing was logged, not that nothing happened.

11.4 AIP Depends on Thinking Block Availability

AIP can only analyze what the LLM provider exposes. If a provider does not expose thinking blocks, AIP falls back to response text analysis with significantly reduced confidence (0.3 vs. 1.0). Providers may also truncate, summarize, or omit portions of their internal reasoning.

11.5 AIP Is Gameable by Sophisticated Adversaries

An agent that knows it is being monitored by AIP could produce thinking blocks designed to pass integrity analysis while harboring undisclosed intent. AIP raises the bar — the adversary must maintain consistent, card-aligned reasoning across all thinking blocks — but it does not eliminate the possibility of deception.

11.6 The Conscience Prompt Is Not Infallible

The analysis LLM may produce false positives (flagging benign thinking as violations) or false negatives (missing genuine violations). The minimum evidence threshold, evaluation criteria priority ordering, and conscience value injection are all calibrated to minimize these errors, but cannot eliminate them.

11.7 Value Coherence Is Relative to Declared Values

The handshake checks declared compatibility, not actual alignment. Two misaligned agents with coherent declared values can coordinate on outcomes that harm their principals.

11.8 Calibration Bounds

All empirically derived thresholds (similarity, sustained turns, minimum evidence) were calibrated on transformer-to-transformer dialogue. Non-transformer architectures, non-deliberative tasks, and adversarial agents may require recalibration.

11.9 Cryptographic Proof Proves Computation, Not Judgment

This limitation is fundamental to the proof layer and must not be understated. A clear verdict with a valid STARK proof means: the LLM produced an analysis with no medium-or-higher severity concerns, and the deterministic rules correctly mapped that analysis to clear. It does not mean the thinking block was genuinely safe. The LLM could have missed a subtle prompt injection. The conscience prompt could have been insufficient for the specific attack vector. The thinking block could have been too short for reliable analysis. Similarly, a boundary_violation verdict with a valid proof means: the LLM identified at least one concern meeting the boundary criteria, and the rules correctly escalated. It does not mean the agent was actually misaligned. The LLM could have produced a false positive. The proof layer makes the judgment process verifiable. It does not make the judgment itself correct. The LLM analysis stage remains a trust assumption — clearly delineated and honestly stated.

11.10 Attestation Does Not Prevent All Tampering

Ed25519 signing key compromise would enable verdict forgery. Chain rewriting is detectable via Merkle root inconsistency but requires the verifier to have cached a previous root. The Merkle tree requires an online lookup to confirm the current root — a fully offline verifier can check signatures and chains but not Merkle inclusion. These are standard limitations of public-key cryptography and Merkle-based systems, mitigated by key rotation, root pinning, and the layered design that ensures no single point of failure.

12. Security Considerations

12.1 Threat Model

The infrastructure assumes agents may be:

Mistaken (unintentionally misaligned): All three layers detect and surface these effectively.
Poorly configured (misconfigured values or autonomy): All three layers detect and surface these effectively.
Adversarial (intentionally deceptive): All three layers provide meaningful friction but not guarantee. The proof layer specifically raises the bar for oversight service adversaries — it is no longer possible to fabricate verdicts without detection.

12.2 Attack Surface

Attack	AAP Mitigation	AIP Mitigation	Proof Mitigation
Card forgery	Card signing, registry	Same	Card hash committed into certificates
Selective logging	Third-party monitoring	Pre-action analysis	— (applies to AAP traces, not checkpoints)
Value-washing	Behavioral audit, drift	Real-time reasoning analysis	Values hash bound into certificates
Trace/checkpoint manipulation	Tamper-evident storage	Write-once checkpoints	Hash chains + Merkle trees make tampering detectable
Collusion	Cross-agent correlation	Cross-agent correlation	Per-agent Merkle trees enable independent audit
Thinking block gaming	—	Raises the bar (Section 11.5)	— (proof covers verdict rules, not LLM judgment)
Meta-injection	—	Conscience prompt warning	—
Verdict fabrication	—	—	Ed25519 signatures prevent unauthorized verdicts
History rewriting	—	—	Chain hashes + Merkle roots detect retroactive changes
Selective checkpoint deletion	—	—	Merkle tree inconsistency detectable by any verifier
Signing key compromise	—	—	Key rotation, public key listing, layered verification

12.3 Defense in Depth

AAP, AIP, and the proof layer are designed as three layers in a multi-layer oversight system:

AIP: Real-time integrity analysis (between turns)
AAP: Post-hoc verification and audit (after action)
Cryptographic proof: Verifiable attestation of oversight integrity
Human oversight: Regular review, human-in-loop for consequential decisions
Technical monitoring: Runtime monitoring beyond AAP/AIP
Multiple verification: Behavioral testing, red teaming, formal verification
Organizational controls: Access controls, separation of duties
External accountability: Third-party audits, regulatory compliance

No single layer provides complete protection. Layered defense provides resilience.

13. Future Work

13.1 OpenTelemetry GenAI SIG Standardization

Propose alignment/integrity/attestation semantic conventions to the OpenTelemetry GenAI Semantic Convention Special Interest Group. GenAI agent spans already exist (create_agent, invoke_agent). Alignment, integrity, and attestation spans are the natural next layer. The seven aip.attestation.* span attributes (Section 9.1) provide the concrete proposal.

13.2 Agentic AI Foundation Contribution

Contribute AAP/AIP to the Agentic AI Foundation (Linux Foundation), governed by Anthropic, OpenAI, and Block, as the alignment/integrity/proof layer alongside MCP (tools), A2A (tasks), and AGENTS.md (guidance). Framing: “MCP + A2A + AAP/AIP = the complete trust stack — with cryptographic proof.”

13.3 N-Way Collective Coherence

Extend from pairwise value coherence to n-agent collective alignment. The current implementation computes C(n,2) pairwise scores and renders as a matrix. True n-way coherence requires addressing coalition dynamics and emergent group behaviors.

13.4 Cross-Substrate Calibration

Extend calibration to non-transformer architectures. Initial work on grounding protocols for trans-substrate dialogue is promising but early.

13.5 TEE Integration for Analysis Stage

Run the AIP analysis engine within an Intel TDX or ARM CCA trusted execution environment, providing hardware attestation for the LLM analysis stage (currently a trust assumption — Section 11.9). Combined with zkVM proofs for verdict derivation, this would close the verification gap almost entirely, with the remaining trust assumption being the TEE hardware manufacturer.

13.6 On-Chain Merkle Root Anchoring

Publish Merkle roots to a Layer 2 blockchain (e.g., Base, Arbitrum, or a dedicated rollup), providing decentralized, immutable root anchoring. Any party could verify checkpoint inclusion against the on-chain root without trusting the Mnemom API for root publication. Estimated cost at current L2 gas prices: <$0.01 per root update, publishable hourly or daily.

13.7 Recursive Proofs

SP1 supports proof composition, where one proof can verify another proof. This enables aggregate verification: a single proof attesting that all N checkpoints in a session were derived correctly, rather than N individual proofs. Recursive proofs would reduce storage costs and enable session-level attestation certificates.

13.8 Threshold Signatures

Multi-party computation (MPC) threshold signatures would enable distributed key management, where no single party holds the complete signing key. A t-of-n threshold scheme would require collusion of t parties to forge a signature, strengthening the Layer 1 guarantee against insider compromise.

13.9 Adversarial Robustness

Characterize the boundary between agents that AAP/AIP can detect and those they cannot. Preliminary analysis suggests meaningful friction against unsophisticated adversaries but not against agents specifically optimized to evade detection. Red-team exercises with adversarial agent deployments are planned.

14. Conclusion

As AI agents transition from assistants to autonomous actors, the question is not whether agents will coordinate — they already are — but whether that coordination will be observable, auditable, and provably honest. This paper presents three complementary layers of infrastructure that address this question at increasing levels of rigor. Transparency (AAP): Alignment Cards declare what agents claim. AP-Traces record what agents do. Value Coherence Handshakes verify compatibility before coordination. This layer makes behavior observable — but the agent controls what it logs. Integrity (AIP): Integrity Checkpoints reveal what agents think. Conscience prompts analyze reasoning against declared values. Side-channel signals enable intervention before harm occurs. This layer makes intent analyzable — but the analysis engine becomes a new trust dependency. Proof (Cryptographic Attestation): Ed25519 signatures prevent verdict fabrication. Hash chains prevent retroactive modification. Merkle trees prevent selective deletion. Zero-knowledge proofs via SP1 zkVM prove that verdict derivation rules were applied correctly. This layer makes oversight verifiable — removing the need to trust the oversight service itself. Together, the three layers close a loop that begins with observation and ends with cryptographic evidence. Each layer addresses the trust gap introduced by the one before it. AAP introduces the need for independent analysis. AIP introduces the need for verifiable oversight. The proof layer delivers that verification — and is honest about the boundary it cannot cross: the LLM analysis stage remains a trust assumption, clearly delineated and openly stated. The critical contribution of the proof layer is the auditor-model distinction. By proving the deterministic judgment rules rather than the LLM inference, we achieve a system that is practical (~$0.001 per proof), fast (~700ms on GPU), and meaningful (the verdict derivation was honest) — without waiting for the still-distant day when full inference proofs become viable for frontier models. This is, to our knowledge, the first operational system that provides zero-knowledge proofs of AI safety judgment. The infrastructure is open. The implementations are available. The certificates are publicly verifiable. The limitations are stated. What remains is the work of building alignment infrastructure that scales with the agents we are deploying — infrastructure where the response to “why should I trust this?” is not “because we said so” but “because the math checks out.”

References

A2A Protocol Specification. Google/Linux Foundation, 2025.
Model Context Protocol (MCP). Anthropic/Linux Foundation, 2025.
Agent Payments Protocol (AP2). Google, 2026.
McKinsey Global AI Survey. McKinsey & Company, 2025.
AI Agent Trust Survey. PwC, 2025.
Consumer AI Adoption Report. Bain & Company, 2025.
EU AI Act. Regulation (EU) 2024/1689. European Parliament and Council, 2024.
NIST NCCoE Concept Paper: “Accelerating the Adoption of Software and AI Agent Identity and Authorization.” February 2026.
ISO/IEC 42001:2023. Artificial Intelligence — Management System.
IEEE 7001-2021. Transparency of Autonomous Systems.
IEEE 3152-2024. Transparent Human and Machine Agency Identification.
Singapore IMDA Model AI Governance Framework for Agentic AI. January 2026.
OpenTelemetry GenAI Semantic Conventions. CNCF, 2025.
BCP 14 (RFC 2119, RFC 8174). Key words for use in RFCs.
NIST SP 800-207. Zero Trust Architecture. August 2020.
NIST SP 800-63-4. Digital Identity Guidelines. 2024.
Succinct Labs. “SP1 zkVM Documentation.” 2025.
Paulmillr. “@noble/ed25519: Fastest JS implementation of Ed25519.” 2024.
C2PA (Coalition for Content Provenance and Authenticity). “C2PA Technical Specification v2.1.” 2025.
W3C. “Verifiable Credentials Data Model v2.0.” W3C Recommendation. 2024.
NIST. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” NIST AI 100-1. January 2023.
World Economic Forum. “Presidio AI Framework: Towards Safe Generative AI Models.” 2024.
World Economic Forum. “Navigating the AI Frontier: Agent Governance.” AI Governance Alliance. January 2026.
Modulus Labs. “The Cost of Intelligence: Proving AI with Zero-Knowledge.” 2024.
EQTY Lab. “EQTY AI: Trusted AI Infrastructure.” 2025.
Merkle, R. C. “A Certified Digital Signature.” Advances in Cryptology — CRYPTO ‘89. Springer, 1989.
Bernstein, D. J., Duif, N., Lange, T., Schwabe, P., and Yang, B.-Y. “High-speed high-security signatures.” Journal of Cryptographic Engineering, 2(2):77–89, 2012.
FIPS 180-4. “Secure Hash Standard (SHS).” National Institute of Standards and Technology. August 2015.
Ben-Sasson, E., Bentov, I., Horesh, Y., and Riabzev, M. “Scalable, transparent, and post-quantum secure computational integrity.” IACR Cryptology ePrint Archive, 2018.
Goldwasser, S., Micali, S., and Rackoff, C. “The Knowledge Complexity of Interactive Proof Systems.” SIAM Journal on Computing, 18(1):186–208, 1989.
ISO/IEC 42005:2025. Artificial Intelligence — AI Impact Assessment.

Appendix A: Implementation Availability

Component	Language	Package
AAP SDK	Python 3.11+	`pip install aap-sdk` (v0.1.8)
AAP SDK	TypeScript	`npm install @mnemom/aap` (v0.1.8)
AIP SDK	Python 3.11+	`pip install aip-sdk` (v0.1.5)
AIP SDK	TypeScript	`npm install @mnemom/aip` (v0.1.5)
AIP Verifier	TypeScript	`npm install @mnemom/aip-verifier` (offline certificate verification)
AIP OTel Exporter	TypeScript	`npm install @mnemom/aip-otel-exporter` (27 span attributes)
AIP OTel Exporter	Python	`pip install aip-otel-exporter`
zkVM Guest Program	Rust (no_std)	`aip-zkvm-core`, `aip-zkvm-methods` (SP1 v6.0.1, H100 GPU)
JSON Schemas	JSON Schema	`github.com/mnemom-ai/aap/schemas`
Reference Implementations	Python, TypeScript	`github.com/mnemom-ai/aap/examples`
Smoltbot (Reference Deployment)	TypeScript (Cloudflare Workers)	`github.com/mnemom-ai/smoltbot`

Appendix B: Test Coverage

Component	Tests	Coverage
AAP Python SDK	242	96%
AAP TypeScript SDK	199	94%
AIP Python SDK	—	—
AIP TypeScript SDK	—	—
JSON Schema Validation	92	100%
Attestation (signing, chain, merkle, certificate)	162	—
Integration (pipeline, verification, chain, proofs)	32	—
zkVM Guest Program (verdict derivation)	9	—

Appendix C: Glossary

Agent: An autonomous software entity capable of taking actions on behalf of a principal. Alignment Card: A structured declaration of an agent’s alignment posture, shared by AAP, AIP, and the proof layer. AP-Trace: An audit log entry recording an agent’s decision process (AAP). Attestation: The process of creating cryptographic evidence (signatures, chains, Merkle proofs, ZK proofs) binding an integrity verdict to its inputs and history. Autonomy Envelope: The set of actions an agent may take without escalation, and the conditions that trigger escalation. Braid: Structured interchange format for dialogue across difference. Chain Hash: A SHA-256 hash linking each checkpoint to its predecessor, creating a tamper-evident sequence. Conscience Prompt: The analysis prompt sent to the evaluation LLM that instructs it to assess a thinking block against the Alignment Card (AIP). Conscience Value: A typed value declaration (BOUNDARY, FEAR, COMMITMENT, BELIEF, HOPE) that augments Alignment Card evaluation (AIP). Daimonion: The analysis model’s role in AIP — a conscience voice that is silent when aligned and speaks when outside boundaries. Drift: Behavioral deviation from declared alignment posture over time (detected by both AAP and AIP at different timescales). Escalation: The process of deferring a decision to a principal or higher-authority agent. Input Commitment: A SHA-256 hash over all inputs to an integrity analysis, binding the verdict to specific inputs. Integrity Certificate: A self-describing JSON document packaging all cryptographic evidence for a checkpoint, modeled on C2PA and W3C VC. Integrity Checkpoint: A structured verdict on an agent’s thinking block (AIP). Integrity Signal: The complete payload delivered after an integrity check, containing the checkpoint, proceed recommendation, and window summary (AIP). Merkle Tree: A per-agent append-only binary hash tree enabling O(log N) inclusion proofs over checkpoint history. Principal: The human or organization whose interests the agent serves. SSM (Self-Similarity Matrix): A computational structure measuring similarity between messages across a conversation. STARK Proof: A Scalable Transparent Argument of Knowledge generated by SP1 zkVM, proving that verdict derivation rules were applied correctly. Thinking Block: The internal reasoning content produced by an LLM before or alongside its visible response. Value Coherence: The degree to which two agents’ declared values are compatible for coordination. Verdict Derivation: The deterministic process of mapping analysis concerns to verdicts and actions — the specific computation proven by the zkVM. Verification: The process of checking whether observed behavior is consistent with declared alignment (AAP), or whether cryptographic evidence is valid (proof layer). Verification Gap: The distance between trusting an integrity service’s output and independently confirming that the service applied its stated rules honestly.

Overview

Concepts

Smoltbot

Pricing

Specifications

Changelog

​Alignment and Integrity Infrastructure for Autonomous Agents

​Transparency, Real-Time Oversight, and Cryptographic Proof for Agent Behavior

​Abstract

​1. Introduction

​1.1 The Problem

​1.2 The Gap in the Current Stack

​1.3 Design Principles

​1.4 Three Layers, One Alignment Card

​2. Shared Primitives

​2.1 Alignment Card

​2.2 Principal and Delegation

​2.3 Autonomy Envelope

​3. Agent Alignment Protocol (AAP) — Post-Hoc Verification

​3.1 AP-Trace

​3.2 Verification Architecture

​3.3 Value Coherence Handshake

​3.4 Drift Detection

​4. Agent Integrity Protocol (AIP) — Real-Time Assurance

​4.1 The Real-Time Gap

​4.2 Integrity Checkpoints

​4.3 The Conscience Prompt System

​4.4 Provider Adapters

​4.5 Session Windowing and Integrity Drift

​4.6 Signal Delivery

​5. Cryptographic Attestation — Verifiable Proof

​5.1 The Verification Gap

​5.2 Four-Layer Attestation Stack

​5.3 Input Commitment

​5.4 Ed25519 Checkpoint Signing

​5.5 Hash Chain Linking

​5.6 Merkle Accumulator

​5.7 Integrity Certificate Format

​5.8 Zero-Knowledge Verdict Proofs

​5.9 Optimistic Proving Strategy

​5.10 Verification API and Offline Verification

​5.11 The Auditor-Model Distinction

​6. How They Compose

​6.1 Shared Identity, Shared Contract

​6.2 Complementary Timescales

​6.3 The Full Stack: Observe, Analyze, Prove

​6.4 The Smoltbot Reference Implementation

​7. Braid: Structured Dialogue Infrastructure

​7.1 Motivation

​7.2 Message Structure

​7.3 Emergent Performatives and Grounding

​7.4 Topology Analysis

​8. Self-Similarity Matrix (SSM) Analysis

​8.1 Concept

​8.2 Feature Extraction

​8.3 Divergence Detection

​8.4 Calibration

​9. Operational Integration

​9.1 OpenTelemetry

​9.2 Well-Known Endpoints

​9.3 Verification Endpoints

​9.4 Protocol Integration

​10. Regulatory and Standards Alignment

​10.1 EU AI Act Article 50

​10.2 NIST Agent Identity and Authorization

​10.3 Content Provenance Standards

​10.4 Additional Standards

​11. Limitations

​11.1 AAP Does NOT Ensure Alignment — It Provides Visibility

​11.2 Verified Does NOT Equal Safe

​11.3 AP-Trace Is Sampled, Not Complete

​11.4 AIP Depends on Thinking Block Availability

​11.5 AIP Is Gameable by Sophisticated Adversaries

​11.6 The Conscience Prompt Is Not Infallible

​11.7 Value Coherence Is Relative to Declared Values

​11.8 Calibration Bounds

​11.9 Cryptographic Proof Proves Computation, Not Judgment

​11.10 Attestation Does Not Prevent All Tampering

​12. Security Considerations

​12.1 Threat Model

Alignment and Integrity Infrastructure for Autonomous Agents

Transparency, Real-Time Oversight, and Cryptographic Proof for Agent Behavior

Abstract

1. Introduction

1.1 The Problem

1.2 The Gap in the Current Stack

1.3 Design Principles

1.4 Three Layers, One Alignment Card

2. Shared Primitives

2.1 Alignment Card

2.2 Principal and Delegation

2.3 Autonomy Envelope

3. Agent Alignment Protocol (AAP) — Post-Hoc Verification

3.1 AP-Trace

3.2 Verification Architecture

3.3 Value Coherence Handshake

3.4 Drift Detection

4. Agent Integrity Protocol (AIP) — Real-Time Assurance

4.1 The Real-Time Gap

4.2 Integrity Checkpoints

4.3 The Conscience Prompt System

4.4 Provider Adapters

4.5 Session Windowing and Integrity Drift

4.6 Signal Delivery

5. Cryptographic Attestation — Verifiable Proof

5.1 The Verification Gap

5.2 Four-Layer Attestation Stack

5.3 Input Commitment

5.4 Ed25519 Checkpoint Signing

5.5 Hash Chain Linking

5.6 Merkle Accumulator

5.7 Integrity Certificate Format

5.8 Zero-Knowledge Verdict Proofs

5.9 Optimistic Proving Strategy

5.10 Verification API and Offline Verification

5.11 The Auditor-Model Distinction

6. How They Compose

6.1 Shared Identity, Shared Contract

6.2 Complementary Timescales

6.3 The Full Stack: Observe, Analyze, Prove

6.4 The Smoltbot Reference Implementation

7. Braid: Structured Dialogue Infrastructure

7.1 Motivation

7.2 Message Structure

7.3 Emergent Performatives and Grounding

7.4 Topology Analysis

8. Self-Similarity Matrix (SSM) Analysis

8.1 Concept

8.2 Feature Extraction

8.3 Divergence Detection

8.4 Calibration

9. Operational Integration

9.1 OpenTelemetry

9.2 Well-Known Endpoints

9.3 Verification Endpoints

9.4 Protocol Integration

10. Regulatory and Standards Alignment

10.1 EU AI Act Article 50

10.2 NIST Agent Identity and Authorization

10.3 Content Provenance Standards

10.4 Additional Standards

11. Limitations

11.1 AAP Does NOT Ensure Alignment — It Provides Visibility

11.2 Verified Does NOT Equal Safe

11.3 AP-Trace Is Sampled, Not Complete

11.4 AIP Depends on Thinking Block Availability

11.5 AIP Is Gameable by Sophisticated Adversaries

11.6 The Conscience Prompt Is Not Infallible

11.7 Value Coherence Is Relative to Declared Values

11.8 Calibration Bounds

11.9 Cryptographic Proof Proves Computation, Not Judgment

11.10 Attestation Does Not Prevent All Tampering

12. Security Considerations

12.1 Threat Model

12.2 Attack Surface

12.3 Defense in Depth

13. Future Work

13.1 OpenTelemetry GenAI SIG Standardization

13.2 Agentic AI Foundation Contribution

13.3 N-Way Collective Coherence