Assay - Tamper-Evident Audit Trails for AI Systems

The problem with AI logs

Your agent runs. Something goes wrong. You check the logs. But the logs are on your server.

what happens today

Your agent calls gpt-4 with customer data
Agent gets a response, takes an action
Customer complains: "your AI did something wrong"
You check your logs -- everything looks fine
Customer: "prove it"
You show your logs
Customer: "those are YOUR logs on YOUR server"

   You can't prove it. Whoever controls the server controls the story.
      

Assay makes this harder to fake. Every LLM call gets a cryptographically signed receipt bundled into a portable proof pack. Edit one byte and verification fails. Skip a contracted call site (for supported frameworks) and completeness checks catch it. The verifier doesn't need access to your server, your database, or your trust.

Two lines. That's the diff.

Run assay patch . to auto-insert the integration into your source files (it finds your LLM call sites and adds the import for you), or add it manually. Then assay run. Every LLM call now produces a signed receipt. The proof pack is a 5-file evidence bundle that anyone can verify independently. Works with OpenAI, Anthropic, Google Gemini, LiteLLM, and LangChain.

BEFORE

Your code calls OpenAI. No evidence trail.

app.py

import openai
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4", messages=[{"role": "user", "content": prompt}]
)
      

AFTER

Add one import. Every call emits a cryptographically signed receipt.

app.py

+ from assay.integrations.openai import patch; patch()
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4", messages=[{"role": "user", "content": prompt}]
)
      

Then verify

assay run wraps your program, collects receipts, and bundles them into a proof pack. -c receipt_completeness runs the built-in check that all receipts are present. Everything after -- is your normal run command.

terminal

$ assay run -c receipt_completeness -- python app.py
Receipts captured: 3
Proof pack: ./proof_pack_20260211_143022/

$ assay verify-pack ./proof_pack_20260211_143022/
Integrity: PASS
Claims:    PASS
Exit code: 0
        

CI gate: three commands, three exit codes

Drop this into your pipeline. The lockfile catches config drift. Verify-pack catches tampering. Diff catches regressions and budget overruns.

terminal

$ assay run -c receipt_completeness -- python app.py
$ assay verify-pack ./proof_pack_*/ --lock assay.lock
$ assay diff ./baseline_pack/ ./proof_pack_*/ --gate-cost-pct 25 --gate-errors 0 --gate-strict
        

Decision Escrow — the protocol model behind this workflow.

What's in the proof pack

Five files. One signature. Independently verifiable.

LLM Call

SDK method fires

→

Receipt

Signed JSON record

→

Merkle Tree

Hash chain of receipts

→

Proof Pack

5 files, Ed25519 signed

→

Verifier

Anyone, anywhere

Four exit codes. That's the whole API.

Integrity	Claims	Exit	Meaning
PASS	PASS	0	Evidence checks out, behavior meets standards
PASS	FAIL	1	Honest failure: authentic evidence of standards violation
FAIL	--	2	Evidence has been tampered with
--		3	Input validation error (bad arguments, missing files)

The split is the point. Systems that can prove they failed honestly are more trustworthy than systems that always claim to pass.

The Completeness Contract

Scanning tells you where the gaps are. The completeness contract closes the loop: it bridges static analysis to runtime evidence, so you can prove what percentage of your LLM call sites actually emitted receipts.

1. SCAN

Find call sites

AST scan writes a contract file listing every LLM call site with a stable ID

assay scan --emit-contract coverage.json

→

2. RUN

Collect receipts

Integration patches tag each receipt with its callsite_id at runtime

assay run -- python app.py

→

3. VERIFY

Check coverage

Match receipt IDs against the contract. Fail if below threshold.

assay verify-pack --coverage-contract coverage.json --min-coverage 0.8

$ assay verify-pack ./proof_pack_*/ --coverage-contract coverage.json --min-coverage 0.8

Integrity:  PASS
Claims:     PASS
Coverage:   14/17 call sites covered (82%)
Exit code:  0
      

No other tool connects static scan results to runtime proof. The contract turns "we think we instrumented everything" into "we can prove 82% of call sites emitted signed evidence."

Try it in 60 seconds

No API key needed. Synthetic data. Real cryptography.

See an honest failure

Two-act scenario: a passing run, then someone swaps the model and drops the guardian.

# Act 1: PASS (exit 0). Act 2: honest FAIL (exit 1).
python3 -m pip install assay-ai
assay demo-incident

You'll see: same system, different behavior, caught by the same contract.

Spot the tampered pack

CTF-style challenge. One pack is legit, one has been tampered with. Can you tell which?

# Good pack vs tampered pack
assay demo-challenge

You'll see: one pack exits 0, the other exits 2 (tampered). Inline verification shows which bytes changed.

Scan your own code

Find every uninstrumented LLM call site. Get a self-contained HTML gap report.

# Interactive HTML report
assay scan . --report

Lock it into CI

Freeze the verification contract. Block merges when evidence degrades.

# Generate GitHub Actions workflow
assay ci init github
assay lock write --cards receipt_completeness,coverage_contract

Guided workflows

Step-by-step flows for trying, adopting, CI, MCP, and audit handoff.

# See the plan, then --apply to execute
assay flow try
assay flow ci --apply

Auditor handoff

Bundle evidence for an auditor. Self-contained archive with verify instructions.

# Create portable audit bundle
assay audit bundle ./proof_pack_*/
assay verify-signer ./proof_pack_*/

FAQ

Is this just logging?

No. Logs record what you say happened. Receipts prove the integrity and tamper-evidence of recorded events -- a third party can verify the evidence artifact was not modified after creation. Logs live on your server and you can edit them. Proof packs are portable and tamper-evident -- change one byte and verification fails.

What's the performance cost?

Low overhead. Ed25519 signing takes microseconds. Receipt emission happens after the LLM call returns and adds minimal latency to your application. The proof pack is assembled at the end of the run. Benchmark in your environment to confirm.

What data do receipts contain?

Each receipt records: model name, provider, timestamp, prompt hash, response hash, and token counts. Full prompt/response content is optional and off by default. You control what goes into the evidence bundle.

How do I verify a proof pack?

assay verify-pack ./proof_pack_*/. Exit 0 = pass, 1 = claims failed, 2 = tampered, 3 = bad input. No server, no account, no internet connection required. The verifier checks Ed25519 signatures, Merkle tree consistency, manifest completeness, and any declared claims.

I ran assay run and got "No receipts emitted". What now?

Run these in order: assay scan . (finds LLM call sites in your code), assay scan . --report (generates an HTML gap report), assay run -- python app.py (wraps your program and collects receipts), then assay doctor (runs 15 preflight checks on your setup). Together these tell you whether your project has supported call sites, whether those files are instrumented, and whether your run command is wired correctly (including the required -- separator).

What SDKs and frameworks are supported?

Five integrations ship built-in: OpenAI, Anthropic, Google Gemini, and LiteLLM use a one-line monkey-patch that wraps SDK methods transparently. LangChain uses a callback handler you pass to your LLM. Your application code doesn't change. The scanner also detects LlamaIndex call sites.

Quickstart says my directory is too large. Is that expected?

Yes. assay quickstart guards against scanning home/system-size directories by mistake. Run it from your project root, or bypass the guard when intentional: assay quickstart --force.

Threat model

Assay is an evidence integrity and completeness layer, not a lie detector. Here's exactly what it catches and what it doesn't.

Assay detects

Retroactive tampering (edit, append, truncate, replace)
Selective omission of call sites under a completeness contract
Claiming checks passed that were never run
Policy drift between locked baseline and current run

Assay does not prevent

A malicious operator fabricating a complete false run from scratch
Receipts that misrepresent what actually happened at the SDK level
Timestamp fraud (local clock, no external anchor in base tier)

To strengthen guarantees

Transparency log -- external append-only timestamp anchor (assay-ledger)
CI-held org key + branch protection -- separates signer from developer
External timestamping (RFC 3161) -- proves "before this date"

The cost of cheating scales with the complexity of the lie. Assay doesn't make fraud impossible -- it makes fraud expensive.

Tamper-evident audit trails
for AI systems.

The problem with AI logs

Two lines. That's the diff.

Then verify

CI gate: three commands, three exit codes

What's in the proof pack

Four exit codes. That's the whole API.

The Completeness Contract

Find call sites

Collect receipts

Check coverage

Try it in 60 seconds

See an honest failure

Spot the tampered pack

Scan your own code

Lock it into CI

Guided workflows

Auditor handoff

Evidence Gap Map

FAQ

Threat model

Assay detects

Assay does not prevent

To strengthen guarantees

Regulatory context

Ready to make your AI auditable?

Tamper-evident audit trailsfor AI systems.

The problem with AI logs

Two lines. That's the diff.

Then verify

CI gate: three commands, three exit codes

What's in the proof pack

Four exit codes. That's the whole API.

The Completeness Contract

Find call sites

Collect receipts

Check coverage

Try it in 60 seconds

See an honest failure

Spot the tampered pack

Scan your own code

Lock it into CI

Guided workflows

Auditor handoff

Evidence Gap Map

FAQ

Threat model

Assay detects

Assay does not prevent

To strengthen guarantees

Regulatory context

Ready to make your AI auditable?

Tamper-evident audit trails
for AI systems.