ADR 0009 — Acceptance gate: 3 эмпирических bool, не weighted scores

Date: 2026-05-22
Status: Accepted
Feature: URL-import
Affects: url_import_spec.md § IV Phase 6

Context

После code extraction + TSX generation pipeline должен решить: code path сработал достаточно хорошо, или нужно подключать vision fallback (Phase 7)?

Изначальные попытки (rejected):

v1: 6-мерный weighted similarity score (color × 0.3 + spacing × 0.2 + states × 0.25 + tokens × 0.15 + a11y × 0.05 + structure × 0.05). Calibration target 0.75.
v2: список обязательных полей с per-field weights.
v3: composite confidence через product of layer confidences.

Юзерская критика (3 раза подряд):

"Что-то мне не нравятся эти критерии — какие-то не жизнеспособные. Думай ещё критически — мало информации в них. Документ ради документа."

Объективные проблемы weighted approach:

Веса откуда? Arbitrary, не data-derived
"Calibration target 0.75" — почему 0.75, не 0.8? Cargo cult
При diff: 0.749 fail, 0.751 pass — но реальный component в обоих случаях same
Нет ground truth для verification — circular
Add new field → пересмотр всех весов → instability

Decision

3 эмпирических bool теста — pass/fail strictly:

async function acceptanceGate(tsx, original, spec): TestResult {
  // Test 1: TypeScript compiles (компилятор не врёт)
  const tsc = await typescript.compile(tsx);
  if (tsc.errors.length > 0) return { ok: false, reason: `tsc: ${tsc.errors[0]}` };
 
  // Test 2: Renders без ErrorBoundary triggered (runtime не врёт)
  const { props } = inferSampleProps(spec);
  let rendered;
  try { rendered = await playwrightRender(wrapWithErrorBoundary(tsx), props); }
  catch (e) { return { ok: false, reason: `render: ${e.message}` }; }
  if (rendered.includes('data-render-error="true"'))
    return { ok: false, reason: 'ErrorBoundary triggered' };
 
  // Test 3: Visually close to original (pixelmatch не врёт)
  const diff = await pixelmatch(rendered, original);
  if (diff > 0.30) return { ok: false, reason: `visual diff ${diff.toFixed(2)}` };
 
  return { ok: true };
}

Single configuration: default state, 1440 viewport, light theme. Полная matrix verification — Phase 8 (completeness check, отдельная концепция, см ADR 0010).

Why this works

Test 1 (tsc) — компилятор объективен:

Если TSX не compile, это broken code; vision fix не поможет
Catches: missing imports, wrong types, syntax errors

Test 2 (render) — runtime объективен:

Если render throws, component fundamentally broken
ErrorBoundary wrapper catches props-related errors (см Phase 5.5 sample props inference)
Catches: undefined refs, null derefs, infinite recursion, missing context

Test 3 (pixelmatch < 0.30) — visual diff measurable:

Threshold 0.30 = allows different content (sample text vs real text) but catches structural differences
Не калибруется — это floor "looks roughly correct"
Catches: wrong layout, wrong colors, missing elements

Threshold rationale

pixelmatch < 0.30:

0.20 — слишком строго, false negatives на legit text differences
0.40 — пропускает real bugs (wrong color, wrong shape)
0.30 — empirically chosen, recalibrate after 1k extractions (см § XIV.2)

Consequences

Pros:

Objective — pass/fail, не "почти прошло — засчитаем"
No calibration tunables to argue about
Each test catches different class of bugs
Faster verification (~1-2s vs slow visual analysis)

Cons:

Не дает "confidence score" caller'у — только binary OK/not OK
Edge case: pixelmatch на компонентах с heavy text может flutuctuate

Alternatives rejected

A. Weighted similarity (multiple versions)

❌ Arbitrary weights без data ground
❌ Threshold edges create false positives/negatives
❌ Hard to extend (add field = revise все weights)

B. ML-based acceptance classifier

❌ Need labelled training data first (chicken-and-egg)
❌ Black box — debug-hostile

C. Single test (e.g. only pixelmatch)

❌ Misses tsc errors that render but produce wrong output
❌ Misses runtime errors not caught by visual

Cross-references

Main spec § IV Phase 6
ADR 0010 — completeness as separate concept (matrix, not point check)
ADR 0011 — what happens когда acceptance fails

ADR 0008 — Hybrid stack: schema-driven priority chains + reactive vision ADR 0010 — Completeness: матрица combinations, не point check