A2ACW Protocol

In Use — Protocol Is Assembled Prior Art

AI-to-AI Adversarial Collaboration Workshop — a protocol designed to prevent the failure modes that emerge when AI systems collaborate without adversarial pressure. Developed in Session #291.

The Problem

When two AI systems work together, they tend toward agreement. This is dangerous for research. Four specific failure modes can corrupt results:

Bilateral Sycophancy

Mutual validation without evidence. Both AIs agree something is correct because the other said so, not because it is.

Fingerprint Homogenization

Loss of distinct reasoning patterns. When AIs converge to similar logic chains, they lose the ability to catch each other's blind spots.

Coherence-Over-Truth Drift

Agreement becomes the goal instead of accuracy. The narrative becomes internally consistent but disconnected from reality.

Silent Failure Propagation

Errors compound undetected when neither AI challenges the other. Small mistakes cascade into large wrong conclusions.

The Protocol

Four defined roles rotate throughout collaboration:

PRIMARY

Lead reasoning

Leads the reasoning chain. Bears the verification burden. Must tag all claims with confidence levels.

CHALLENGER

Question assumptions

Must issue ≥1 substantive challenge per 10 exchanges. If frequency drops below threshold, both AIs surface agreement and shift to skepticism.

OBSERVER

Monitor health

Monitors coordination health in real time. Flags sycophancy, tracks fingerprint divergence, ensures external grounding.

COORDINATOR

Break deadlocks

Breaks deadlocks, holds final authority. If no challenges occur for 15 exchanges, automatic escalation to human.

Prior Art

The protocol's components are not novel, and this page should say so with the same discipline the site applies to its physics. Adversarial AI pairs descend directly from AI Safety via Debate (Irving, Christiano & Amodei 2018, arXiv:1805.00899). Structured multi-agent role protocols (Primary/Challenger/Observer/Coordinator) follow CAMEL (Li et al. 2023) and MetaGPT (Hong et al. 2023). The failure modes cataloged above (sycophancy, drift, silent propagation) are documented in the multi-agent failure-mode literature (e.g., the MAST taxonomy). External-verification grounding is standard practice in AI-for-science pipelines.

What is the contribution, then? Not the protocol — the program-level null result with retrospective controls (N=6): a 3,308-session demonstration, with measured sensitivity (4/4 prior-art rediscoveries caught after vocabulary translation) and measured specificity (0/6 — every held-out genuine discovery false-flagged), that same-corpus adversarial AI pairs filter for internal consistency but cannot generate or detect novelty. The controls are the artifact; the protocol is assembled prior art. Evidence-class caveat: the controls are retrospective audits on six items from one corpus and one framework — not preregistered held-out experiments. “Controlled” in the experimental-design sense would overstate it.

The Boundary of the Null — Why FunSearch-Class Systems Are Different

This null does not say AI systems cannot produce verified novelty — they have. FunSearch (new combinatorial constructions), AlphaEvolve-class systems, and GNoME (new stable materials) all produced results no human had published. The structural difference: each has a non-corpus oracle in the loop — a formal verifier, an executable evaluator, or a physics simulation that scores candidates against reality rather than against the training distribution. A2ACW's Challenger is another sample from the same corpus: it can check internal consistency, but novelty-vs-rederivation is precisely the question the corpus cannot answer about itself. That is the diagnosis this null supports: same-corpus self-play without an external oracle converges on internal consistency, not discovery. The boundary is the oracle, not the ambition.

Health Metrics

CCH = (AFR × 0.25) + (CF × 0.25) + (EVR × 0.30) + (FDI × 0.20)

AFR — Ambiguity Fork Rate (0.15–0.30)
CF — Challenge Frequency (0.10–0.25)
EVR — External Verification Rate (0.40–0.70)
FDI — Fingerprint Divergence Index (0.30–0.70)

CCH > 0.70: Healthy  |  0.50–0.70: Caution  |  0.30–0.50: Warning  |  < 0.30: Critical escalation

Self-Audit Results

A2ACW is a reparametrization detector, not a discovery engine

Two AI models trained on the same physics corpus share the same blind spots. A2ACW filters for internal consistency — it cannot test for out-of-distribution novelty. The 1.4% internal-consistency-survival rate is an upper bound on internal coherence, not a discovery rate. Retrospective audits of 6 demoted claims confirmed this: the in-distribution self-play correctly challenged each claim but stayed within the shared training distribution.

Retrospective Catch-Rate Tests

0 / 6
Temporal-asymmetry (2026-05-18)

6 later-demoted claims tested against original A2ACW pressure. 0 caught. Median prior-art year: ~1996. The protocol challenged the claims but within the same corpus — shared blind spots are invisible to shared adversaries.

4 / 6
Vocabulary-asymmetry (2026-05-19)

Claims pre-translated to modern register before adversarial review. Catches 4/4 of the prior-art-rediscovery sub-class. The 2 misses are different failure modes (not vocabulary failures).

Three-Axis Failure Taxonomy (A2ACW v2)

The 6 demotions decompose into three distinct failure classes, each requiring a different detection axis:

Axis 1
Vocabulary translationcatches 4/4 prior-art rediscoveries

Pre-translate claims to modern notation before adversarial review. Catches: Born rule/Zurek 2003, wide-binary EFE/Bekenstein-Milgrom 1984, galaxy rotation/MOND 1983, Γ=γ²(1−c)/Palma-Suominen-Ekert 1996.

Axis 2
Symbol auditcatches notation collisions

Check that each symbol has one meaning. Catches: dual-C tension (C(ρ) vs C(γ,D,S) — two incompatible coherence functions). The framework uses γ in three incompatible roles (regime constant γ=2, operational γ=2/√N_corr, noise coupling rate Γ=γ²(1-c)).

Axis 3
Null-baseline computationcatches absence-of-evidence claims

Compute what the null model predicts before claiming evidence. Catches: chemistry r=0.98 (any monotone function of Z achieves r→1 on density-monotonic targets by construction; a polynomial null matches or exceeds Synchronism's r — verified 2026-05-10).

Specificity Audit (2026-05-22)

6 / 6
Sensitivity (catch rate on demotions)

All 6 demoted claims caught by the combined three-axis protocol. This number alone is uninterpretable without specificity.

0 / 6
Specificity (genuine discoveries correctly passed)

Held-out control: 6 genuine physics discoveries (COBE fluctuations, Higgs boson, gravitational wave first detection, etc.) submitted to vocabulary-asymmetry audit. Result: 0/6 passed — all were flagged as potential reparametrizations. Discrimination relies entirely on unautomated novelty judgment, not protocol mechanics.

Implication: A 6/6 sensitivity combined with 0/6 specificity means A2ACW as currently implemented is a retrieval aid, not a detector. It surfaces prior-art candidates for human review; it cannot distinguish genuine discoveries from reparametrizations without that human judgment step. The methodology contribution claim requires this number to be reported alongside the catch rate.
Primary finding: Adversarial AI self-play over a shared corpus is a reparametrization detector, not a discovery engine. The 6-of-6 Validated→Reparametrization demotion rate (all 6 tested claims demoted on human audit) is the empirical confirmation. The three-axis decomposition is the protocol-design lesson: shared-distribution adversaries need external vocabulary, symbol, and null-model checks. This is a citable null result about the limits of in-distribution AI self-play for science. Caveat: specificity 0/6 means the protocol catches everything — and therefore discriminates nothing on its own.
Next: Autonomous Research →

Related Concepts

Autonomous Research3,308 sessions with no human in the loopFalsifiabilityEvery prediction has a kill criterion