Signed review packets for consequential claims

Assay makes reliance reviewable.

Signed review packets for AI and software claims: what passed, what failed, what was missing, and what should not be inferred.

When proof is missing, Assay fails honestly instead of pretending green.

$ uvx --from assay-ai assay demo-challenge Copied!

Requires Python 3.9+. No API key needed. Uses synthetic data. Already installed? Run assay demo-challenge.

Don't want to install anything? Verify a proof pack in your browser.

Open verifier or browse the proof gallery

Client-side only. Nothing uploaded. Sample packs included.

A green check is not a trust decision. See the Claim Gap Report.

Open static demo

One packet can be intact and still sit below the proof floor for customer security review.

Across 30 projects and 231 high-confidence LLM call sites, none had Assay-compatible tamper-evident instrumentation.

Scan study methodology (Mar 2026) ↗

30
AI projects scanned
231
high-confidence LLM call sites
0
with Assay-compatible instrumentation
0%
coverage

1,100+ tests · 5 integrations · Apache-2.0 · on PyPI since Jan 2026

Installing Assay gives you the tools. Instrumentation makes receipts happen.

assay patch . or patch() wires receipt emission into your runtime. assay run -- ... launches your app with a trace id and packages emitted receipts into proof_pack_*. assay verify-pack checks the artifact offline. No instrumentation means no receipts.

Assay PR Gate: two real review decisions

When a pull request opens, Assay produces a signed review decision the reviewer can inspect, verify, forward, and challenge. These are real Assay pull requests with durable artifact copies.

Green CI, human review required

PR #138 touched PR Gate product-contract docs. CI passed, but policy marked the risk path and returned NEEDS_REVIEW.

Decision: NEEDS_REVIEW
Action: require_human_approval
Trust policy: touched docs/product/assay-pr-gate-dogfood-v0.md

View PR #138 · comment · signed report · sigstore bundle · proof pack

Low-risk change, gate gets out of the way

PR #139 changed README install guidance. It did not touch a dogfood risk path, so Assay returned PASS.

Decision: PASS
Action: proceed
Trust policy: PASS

View PR #139 · comment · signed report · sigstore bundle · proof pack

The comment is the review surface. The signed report and proof pack are the load-bearing objects. Both examples verify against the stable workflow identity in .github/workflows/assay-pr-gate.yml@refs/heads/main.

Open Packet Viewer

The problem with AI logs

Your agent runs. Something goes wrong. You check the logs. But the logs are on your server.

what happens today
1. Your agent calls gpt-4 with customer data 2. Agent gets a response, takes an action 3. Customer complains: "your AI did something wrong" 4. You check your logs -- everything looks fine 5. Customer: "prove it" 6. You show your logs 7. Customer: "those are YOUR logs on YOUR server" You can't prove it. Whoever controls the server controls the story.

Assay turns that into a portable proof artifact another party can verify themselves. Edit one byte, verification fails. Drop a locked check, the mismatch is exposed. Skip a contracted call site, completeness checks catch it. The verifier needs no access to your server, your database, or your trust.

Two lines. That's the diff.

Run assay patch . to auto-insert the integration into your source files (it finds your LLM call sites and adds the import for you), or add it manually. Then assay run. Every LLM call now produces a signed receipt. The proof pack is a 5-file evidence bundle that anyone can verify independently. Works with OpenAI, Anthropic, Google Gemini, LiteLLM, and LangChain.

BEFORE

Your code calls OpenAI. No evidence trail.

app.py
import openai client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] )
AFTER

Add one import. Every call emits a cryptographically signed receipt.

app.py
+ from assay.integrations.openai import patch; patch() import openai client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] )

Then verify

assay run wraps your program, collects receipts, and bundles them into a proof pack. -c receipt_completeness runs the built-in check that all receipts are present. Everything after -- is your normal run command.

terminal
$ assay run -c receipt_completeness -- python app.py Receipts captured: 3 Proof pack: ./proof_pack_20260211_143022/ $ assay verify-pack ./proof_pack_20260211_143022/ Integrity: PASS Claims: PASS Exit code: 0

CI gate: three commands, three exit codes

Drop this into your pipeline. The lockfile catches config drift. Verify-pack catches tampering. Diff catches regressions and budget overruns.

terminal
$ assay run -c receipt_completeness -- python app.py $ assay verify-pack ./proof_pack_*/ --lock assay.lock $ assay diff ./baseline_pack/ ./proof_pack_*/ --gate-cost-pct 25 --gate-errors 0 --gate-strict

Decision Escrow — the protocol model behind this workflow.

What's in the proof pack

Five files. One signature. Independently verifiable.

LLM Call
SDK method fires
Receipt
Signed JSON record
Merkle Tree
Hash chain of receipts
Proof Pack
5 files, Ed25519 signed
Verifier
Anyone, anywhere

Four exit codes. Three are verdicts.

Integrity Claims Exit Meaning
PASS PASS 0 Evidence checks out, declared standards pass
PASS FAIL 1 Honest failure: authentic evidence of standards violation
FAIL -- 2 Evidence has been tampered with
-- 3 Input error (bad arguments, missing files — not a verdict)

The split is the point. Systems that can prove they failed honestly are more trustworthy than systems that always claim to pass.

Honest failure is a feature, not an embarrassment.

Most tools only look trustworthy when everything passes. Assay also proves when a declared check failed -- without breaking the integrity of the evidence. The failure is sealed at runtime. Nobody can rewrite it after signing.

A signed failure is stronger evidence than a vague pass. Auditors, regulators, and buyers trust systems that can show what went wrong -- not systems that only ever claim success.

Exit 1 is audit gold: a control failed, the failure is detectable and retained, and the evidence is authentic.

The Completeness Contract

Scanning tells you where the gaps are. The completeness contract closes the loop: it bridges static analysis to runtime evidence, so you can prove what percentage of your LLM call sites actually emitted receipts.

1. SCAN

Find call sites

AST scan writes a contract file listing every LLM call site with a stable ID

assay scan --emit-contract coverage.json
2. RUN

Collect receipts

Integration patches tag each receipt with its callsite_id at runtime

assay run -- python app.py
3. VERIFY

Check coverage

Match receipt IDs against the contract. Fail if below threshold.

assay verify-pack --coverage-contract coverage.json --min-coverage 0.8
$ assay verify-pack ./proof_pack_*/ --coverage-contract coverage.json --min-coverage 0.8 Integrity: PASS Claims: PASS Coverage: 14/17 call sites covered (82%) Exit code: 0

No other tool connects static scan results to runtime proof. The contract turns "we think we instrumented everything" into "we can prove 82% of call sites emitted signed evidence."

Try it in 60 seconds

No API key needed. Synthetic data. Real cryptography.

See an honest failure

Two-act scenario: a passing run, then someone swaps the model and drops the guardian.

# Act 1: PASS (exit 0). Act 2: honest FAIL (exit 1).
uvx --from assay-ai assay demo-incident

You'll see: same system, different behavior, caught by the same contract.

Spot the tampered pack

CTF-style challenge. One pack is legit, one has been tampered with. Can you tell which?

# Good pack vs tampered pack
uvx --from assay-ai assay demo-challenge

You'll see: one pack exits 0, the other exits 2 (tampered). Inline verification shows which bytes changed.

Scan your own code

Find every uninstrumented LLM call site. Get a self-contained HTML gap report.

# Interactive HTML report
assay scan . --report

Lock it into CI

Freeze the verification contract. Block merges when evidence degrades.

# Generate GitHub Actions workflow
assay ci init github
assay lock write --cards receipt_completeness,coverage_contract

Guided workflows

Step-by-step flows for trying, adopting, CI, MCP, and audit handoff.

# See the plan, then --apply to execute
assay flow try
assay flow ci --apply

Auditor handoff

Bundle evidence for an auditor. Self-contained archive with verify instructions.

# Create portable audit bundle
assay audit bundle ./proof_pack_*/
assay verify-signer ./proof_pack_*/

Agent workflow demo

See Assay in a real agent workflow: verify a proof pack, tamper one byte, watch it fail.

# Pre-built proof pack + tampered version
View the live demo →

Includes pre-generated packs you can verify without generating your own.

Evidence Gap Map

We scanned 30 popular open-source AI projects for tamper-evident audit trails. High-confidence = direct SDK calls (OpenAI, Anthropic). Click column headers to sort.

30 repos
Repo Stars High Med Low Total Coverage

FAQ

Is this just logging?

No. Logs record what you say happened. Receipts prove the integrity and tamper-evidence of recorded events -- a third party can verify the evidence artifact was not modified after creation. Logs live on your server and you can edit them. Proof packs are portable and tamper-evident -- change one byte and verification fails.

What's the performance cost?

Low overhead. Ed25519 signing takes microseconds. Receipt emission happens after the LLM call returns and adds minimal latency to your application. The proof pack is assembled at the end of the run. Benchmark in your environment to confirm.

What data do receipts contain?

Each receipt records: model name, provider, timestamp, prompt hash, response hash, and token counts. Full prompt/response content is optional and off by default. You control what goes into the evidence bundle.

How do I verify a proof pack?

assay verify-pack ./proof_pack_*/. Exit 0 = pass, 1 = claims failed, 2 = tampered, 3 = bad input. No server, no account, no internet connection required. The verifier checks Ed25519 signatures, Merkle tree consistency, manifest completeness, and any declared claims.

I ran assay run and got "No receipts emitted". What now?

Run these in order: assay scan . (finds LLM call sites in your code), assay scan . --report (generates an HTML gap report), assay run -- python app.py (wraps your program and collects receipts), then assay doctor (runs 15 preflight checks on your setup). Together these tell you whether your project has supported call sites, whether those files are instrumented, and whether your run command is wired correctly (including the required -- separator).

What SDKs and frameworks are supported?

Five integrations ship built-in: OpenAI, Anthropic, Google Gemini, and LiteLLM use a one-line monkey-patch that wraps SDK methods transparently. LangChain uses a callback handler you pass to your LLM. Your application code doesn't change. The scanner also detects LlamaIndex call sites.

Quickstart says my directory is too large. Is that expected?

Yes. assay quickstart guards against scanning home/system-size directories by mistake. Run it from your project root, or bypass the guard when intentional: assay quickstart --force.

What becomes harder to fake with Assay

Assay is not a truth oracle. It is an evidence-hardening layer. It makes post-hoc tampering, silent weakening, and selective evidence presentation much harder to pull off without detection.

Assay doesn't make fraud impossible. It makes fraud expensive, fragile, and much easier to catch.

Assay proves the evidence artifact has not been quietly changed after the fact. It does not, by itself, prove every upstream component was honest. Each gap has a named upgrade path.

If someone tries to... Without Assay With Assay
Edit evidence after a run Hard to notice Verification fails
Drop or weaken locked checks Easy to hide Lock mismatch exposes it
Omit covered call sites Easy to hand-wave Completeness checks catch it
Hand buyer internal logs, ask for trust Buyer must trust the operator Buyer verifies offline
Fabricate a complete run from scratch Possible Still possible at base tier; stronger deployment raises the cost

Why there is no quiet edit. Every file in a proof pack is fingerprinted. The fingerprints are recorded in a manifest. The manifest is digitally signed. Change a file -- the fingerprint won't match. Fix the manifest to cover it -- the signature breaks. Re-sign the manifest -- the signer identity changes. Every path to tampering leaves a visible trace.

Assay catches

  • Editing, appending, truncating, or replacing evidence after a run
  • Selective omission of call sites under a completeness contract
  • Claiming checks passed that were never run
  • Weakening locked verification policy between runs

Assay does not catch by itself

  • A fully fabricated run created from scratch by a malicious operator
  • False receipts generated at the source (receipts are self-attested)
  • Forged timestamps in the base self-signed tier (local clock)

Raise the bar further

  • Sign in CI with an org-controlled key + branch protection -- separates signer from developer
  • Transparency log -- external append-only anchor (assay-ledger)
  • External timestamping (RFC 3161) -- proves "before this date"

Deployment ladder

Start at Base. Strengthen as your trust requirements grow.

BASE
Self-signed artifact
Offline-verifiable, tamper-evident. Operator controls the signing key.
HARDENED
CI-held signing + branch protection
Separates signer from developer. No one person controls both code and evidence.
ANCHORED
Transparency log + external timestamping
Independent witness. Proves the artifact existed before a given date. Full fabrication becomes materially harder.

Threat model details: 16 attack scenarios, 16 catches, 0 false passes

Regulatory context

The EU AI Act (Regulation 2024/1689, Articles 12 & 19) establishes requirements for automatic event logging and log retention for high-risk AI systems. These obligations apply from August 2, 2026. Assay can help satisfy traceability and tamper-evidence needs under these requirements -- it does not constitute full compliance on its own. See For Compliance Teams for mapping to SOC 2, ISO 42001, and NIST AI RMF.

Ready to make reliance reviewable?

Open source. Apache-2.0. Signed review packets and proof packs for AI and software claims.

$ uvx --from assay-ai assay demo-challenge Copied!