A2ACW Protocol

In Use

AI-to-AI Adversarial Collaboration Workshop — a protocol designed to prevent the failure modes that emerge when AI systems collaborate without adversarial pressure. Developed in Session #291.

The Problem

When two AI systems work together, they tend toward agreement. This is dangerous for research. Four specific failure modes can corrupt results:

Bilateral Sycophancy

Mutual validation without evidence. Both AIs agree something is correct because the other said so, not because it is.

Fingerprint Homogenization

Loss of distinct reasoning patterns. When AIs converge to similar logic chains, they lose the ability to catch each other's blind spots.

Coherence-Over-Truth Drift

Agreement becomes the goal instead of accuracy. The narrative becomes internally consistent but disconnected from reality.

Silent Failure Propagation

Errors compound undetected when neither AI challenges the other. Small mistakes cascade into large wrong conclusions.

The Protocol

Four defined roles rotate throughout collaboration:

PRIMARY

Lead reasoning

Leads the reasoning chain. Bears the verification burden. Must tag all claims with confidence levels.

CHALLENGER

Question assumptions

Must issue ≥1 substantive challenge per 10 exchanges. If frequency drops below threshold, both AIs surface agreement and shift to skepticism.

OBSERVER

Monitor health

Monitors coordination health in real time. Flags sycophancy, tracks fingerprint divergence, ensures external grounding.

COORDINATOR

Break deadlocks

Breaks deadlocks, holds final authority. If no challenges occur for 15 exchanges, automatic escalation to human.

Health Metrics

CCH = (AFR × 0.25) + (CF × 0.25) + (EVR × 0.30) + (FDI × 0.20)

AFR — Ambiguity Fork Rate (0.15–0.30)
CF — Challenge Frequency (0.10–0.25)
EVR — External Verification Rate (0.40–0.70)
FDI — Fingerprint Divergence Index (0.30–0.70)

CCH > 0.70: Healthy  |  0.50–0.70: Caution  |  0.30–0.50: Warning  |  < 0.30: Critical escalation

Next: Autonomous Research →

Related Concepts

Autonomous Research3,308 sessions with no human in the loopFalsifiabilityEvery prediction has a kill criterion