What can an enterprise buyer actually verify about an AI product?

Very little at useful granularity. Most enterprise AI products provide dashboards, compliance badges, and self-reported metrics. These may be produced by reputable auditors or monitoring tools, but they rarely cover specific system behavior at the per-decision level. In practice, many vendor artifacts combine stale, human-attested, and out-of-scope evidence — the buyer sees what the vendor chooses to show, at the granularity the vendor chooses. The gap between "we logged it" and "you can independently verify it" is one of the hardest unsolved problems in enterprise AI procurement.

This page defines four evidence states for AI system claims, outlines what buyers typically receive versus what they could receive, and provides a framework for evaluating evidence readiness at three levels of maturity. The goal is to help buyers ask better questions and help vendors build evidence that actually withstands scrutiny.

The four evidence states

Every claim an AI vendor makes about their system's behavior falls into one of four evidence states. These states describe how independently verifiable the evidence behind the claim actually is.

State Definition Example Verification cost
Machine-verifiable Can be checked by software without trusting the producer. Cryptographic signatures, hash chains, schema conformance. A signed execution receipt whose signature and hash chain can be verified offline by a third party. Seconds. Automated.
Human-attested Requires a qualified person to review and confirm. The evidence exists but cannot be machine-checked. A compliance officer reviews model evaluation results and signs off on deployment readiness. Hours to days. Requires expertise.
Stale Evidence existed at one point but is no longer current. The system has changed since the evidence was produced. A model evaluation from six months ago, run against a model version that has since been updated three times. Cannot verify current state. Must re-evaluate.
Missing Evidence that should exist based on the vendor's claims but does not. The claim is unsupported. Vendor claims "all decisions are logged" but no execution receipts exist for the reviewer to inspect. Infinite. Evidence does not exist.

The key distinction: Logs record what happened on a system the operator controls. Evidence records what happened in a form a third party can independently verify. Logs can be edited, filtered, or deleted by whoever controls the server. Evidence, when properly constructed, makes tampering detectable — editing one byte causes verification to fail.

What buyers typically receive today

A scan of 30 open-source AI projects found 231 high-confidence LLM call sites. None had Assay-compatible tamper-evident instrumentation.1 These were open-source projects, not enterprise products — but the underlying pattern (logs on the operator's server, no portable proof artifacts) is common across the industry. Here is what buyers typically receive versus what independent verification would require:

Buyer question Typical vendor response Evidence state What independent verification requires
"What did your AI actually do with our data?" Dashboard showing aggregated metrics. Logs available "on request." Missing Per-call execution receipts with input/output hashes, signed at execution time, verifiable without server access.
"How do we know the AI behaved correctly?" Compliance badge. SOC 2 report. "We follow responsible AI principles." Stale Continuous evidence tied to specific model versions and decision points, not point-in-time certifications.
"Can we audit a specific decision?" "We can pull the logs for you." Delivered days later, from the vendor's own servers. Human-attested A portable proof artifact the buyer can verify independently, offline, without asking the vendor.
"What happens when your AI fails?" "We have robust error handling." Rarely includes examples of honest failure evidence. Missing Failure receipts that prove the system detected and reported its own failure, not just success-path evidence.
"Is this the same model you evaluated?" "We continuously monitor performance." Evaluation dates not tied to deployment versions. Stale Model version identifiers cryptographically linked to both evaluation results and production execution receipts.

The pattern is consistent: buyers ask about specific, verifiable behavior, and vendors respond with general assurances backed by evidence the buyer cannot independently check.

Three levels of evidence readiness

Not every AI system needs the same level of evidence. The right level depends on what's at stake — a content recommendation engine and a medical triage system have different evidence requirements. But every system making claims about its behavior should be able to answer: what can a reviewer actually verify?

Level What the vendor proves What the buyer can verify Evidence artifacts
Baseline "This system ran, and here is what it called." That specific API calls were made, to which providers, with which model versions, at which times. Signed execution receipts per call. Call manifest listing all expected receipts. Completeness check: any dropped receipt is visible.
Packet "Here is a portable evidence bundle a reviewer can verify offline." Signature validity. Hash chain integrity. Receipt completeness. Schema conformance. All without server access. Proof pack: 5-file bundle (manifest, receipts, signatures, call map, verification metadata). Offline verifier. Tamper detection: edit one byte, verification fails.
Continuous "Every decision produces inspectable evidence automatically." Real-time evidence stream. Version-linked evaluations. Failure receipts alongside success receipts. Evidence posture: what percentage of claims have machine-verifiable support. All Packet artifacts, plus: evidence posture reports, freshness checks, model-version binding, honest failure reporting (the system proves it reported its own errors).

Honest failure is the hardest test. Any system can produce evidence when things go right. The real question is: does the system produce evidence when things go wrong? A vendor that can prove their AI failed honestly — and that the failure evidence was not edited after the fact — demonstrates a fundamentally higher standard of accountability than one that only shows success-path logs.

Questions a buyer should ask

These questions are designed to surface the actual evidence state behind vendor claims. They are ordered from easiest to hardest for the vendor to answer.

  1. "Can I verify your AI's behavior without accessing your servers?" — Tests whether evidence exists outside the vendor's control.
  2. "Show me the evidence for one specific decision." — Tests whether per-decision evidence exists or only aggregates.
  3. "What happens to your evidence trail if I edit one byte?" — Tests for tamper detection. If the answer is "nothing," the evidence is not cryptographically bound.
  4. "Show me a failure receipt." — Tests whether the vendor produces evidence for negative outcomes, not just positive ones.
  5. "How do I know this evidence was produced at execution time, not reconstructed afterward?" — Tests for execution-time binding versus retroactive log assembly.
  6. "What percentage of your AI's decisions have machine-verifiable evidence?" — Tests evidence completeness. The answer is usually zero or unknown.

Why this matters now

The EU AI Act requires that high-risk AI systems maintain logs of operation that enable traceability and auditability.2 The NIST AI Risk Management Framework calls for data and algorithm provenance, documentation that enables third-party evaluation, and governance structures around AI system components.3 ISO/IEC 42001 specifies requirements for AI management systems including documented evidence of conformity.4

These frameworks share a common requirement: evidence that a third party can evaluate. Not logs the vendor controls. Not dashboards the vendor curates. Evidence that exists independently of the vendor's infrastructure and can be verified by someone who has no reason to trust the vendor.

The gap between current practice and these requirements is substantial. Most AI systems today produce evidence that is, at best, human-attested and vendor-controlled. Moving toward machine-verifiable, independently checkable evidence is not a theoretical exercise — it is an emerging compliance requirement.

Verify this yourself

Do not take these claims on faith. Verify one yourself.

15-second demo

Run Assay's challenge demo. It builds a good proof pack and a tampered copy, verifies both, and shows the tampered pack fail. No server, no API key, no setup.

$ uvx --from assay-ai assay demo-challenge

  good/: VERIFICATION PASSED
  tampered/: VERIFICATION FAILED -- hash mismatch

What's inside a proof pack

A proof pack is a portable evidence bundle. It contains:

Everything a verifier needs is in the bundle. Verification does not depend on vendor-controlled infrastructure.

Browse real scenarios

The Proof Gallery contains three scenarios that demonstrate the evidence states described above: a clean pass, an honest failure (the system reports its own error), and a tamper attempt (verification catches the edit). Each scenario includes downloadable proof packs.

Verify in your browser

The online verifier checks proof pack integrity client-side. Nothing is uploaded. Drag a proof pack in, see whether it passes. Works with the gallery samples or your own packs.

How to read verification results

Result What it means What to do
PASS All hashes match. Signature valid. All claimed receipts present. Evidence integrity intact. Artifact is trustworthy as-is. Check freshness (when was it created?).
Honest fail System detected and reported its own failure. Failure receipt is signed and verifiable. Evidence of accountability. The system did not hide the error. Investigate the failure, not the evidence.
FAIL — hash mismatch Content was modified after the pack was signed. Tampering detected. Do not trust this artifact. Request a fresh, unmodified proof pack from the vendor.
FAIL — missing receipts Manifest declares receipts that are not present. Evidence is incomplete. Ask the vendor why receipts were dropped. Completeness failure is a red flag.

Why this matters: Everything described on this page — tamper detection, hash chains, signature verification, honest failure reporting — is operational today in an open-source tool. The question for buyers is not whether the technology exists. It is whether their vendors use it.

Current capability boundary

Assay is an open-source toolkit, not a turnkey platform. Here is what it demonstrates today and what teams still need to build.

What Assay does today What teams still implement
Proof pack generation (receipts, manifest, Ed25519 signatures) Full production instrumentation across all call sites
Offline verification and tamper detection Org-specific policy integration and trust root management
Receipt completeness checks Coverage discipline (ensuring all contracted call sites emit receipts)
Honest failure scenarios (system reports its own errors) Continuous evidence pipelines tied to deployment workflows

One question to ask your vendor

"Show me one signed proof pack for one production decision, and let me verify it independently."

If the vendor can answer this, they have machine-verifiable evidence. If they cannot, their claims are not independently machine-verifiable — you are relying on vendor-controlled attestations rather than portable cryptographic evidence.

References

  1. Assay scan study: 30 AI projects, 231 high-confidence LLM call sites, 0 with Assay-compatible tamper-evident instrumentation. Methodology and results (March 2026).
  2. Regulation (EU) 2024/1689 (EU AI Act), Article 12 — Record-keeping: requires high-risk AI systems to have logging capabilities that enable traceability, monitoring, and post-market surveillance. Article 14 requires human oversight with access to operational data.
  3. National Institute of Standards and Technology. AI Risk Management Framework (AI RMF 1.0), January 2023. NIST AI 100-1. See MAP 2.3 (provenance), MEASURE 2.5–2.6 (third-party evaluation), GOVERN 1.2 (documentation).
  4. ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system. Specifies requirements for establishing, implementing, maintaining, and continually improving an AI management system, including documented evidence of conformity.