ADRs
ADR 0017 — Shadow data: ToS-based disclosure, не registration checkbox

Context

Data flywheel (ADR 0013) требует shadow dataset — log of production extractions для quarterly LoRA training (ADR 0014).

Vопрос UX/legal: как получить consent на data collection?

Initial proposal (rejected): registration checkbox

☑ Я владею правом на этот URL
☐ Использовать мой импорт для улучшения ARNO (анонимно)  ← default OFF

Problems:

  • Default OFF → opt-in rate ~15% (industry baseline для secondary checkboxes)
  • Flywheel works в 5-10× slower → LoRA Q1 training tренируется на 1-2k examples вместо 10k
  • Accuracy below baseline → no LoRA deployment → § 0.8 "чем дольше живём — меньше платим" requirement broken
  • Data flywheel multiplier dies

Юзерское contre-proposal:

"А почему нельзя это делать в background и не показывать пользователю? Просто особенность системы."

Decision

ToS-based disclosure (industry pattern Linear / Figma / Notion / Sentry):

  1. At registration: юзер accepts Terms of Service, которые disclose shadow data collection
  2. No registration checkbox — clean UX, less friction
  3. Settings → Privacy has opt-out toggle (default ON)
  4. Privacy Policy footer discloses Tranco attribution + third-party services

ToS clause text

By using ARNO ("Service"), you acknowledge that we collect anonymized
usage data including:
  - Anonymized URLs (cryptographically hashed via SHA-256)
  - Component metadata (without personally identifiable information)
  - Extraction outcomes and any corrections you make

This data is used solely to improve Service accuracy and quality. We
do not sell or share this data with third parties.

You may opt out at any time via Settings → Privacy → "Anonymous data
contribution". Past contributions can be deleted via Settings →
Privacy → "Delete past contributions".

For European users (GDPR), this processing relies on legitimate
interest in service improvement (Art. 6(1)(f)). You have the right
to object at any time without affecting Service availability.

Полный draft → url_import_tos_clause_draft.md. Legal review pending (~$500-1000 with privacy lawyer).

Why this works

Legal foundation

  1. GDPR Art. 6(1)(f) legitimate interest — service improvement is recognized purpose
  2. GDPR Art. 13/14 — disclosure satisfied via ToS + Privacy Policy
  3. GDPR Art. 17/21 — opt-out + delete past contributions = right-to-erasure + right-to-object
  4. CCPA — disclosure at collection point + opt-out mechanism

Industry pattern

CompanyApproach
LinearUsage analytics ToS-disclosed, opt-out in settings
FigmaAnonymized usage data, ToS-based, opt-out
NotionTelemetry ToS-disclosed, settings toggle
SentryPerformance data ToS-disclosed
GitHub CopilotCode suggestions, ToS-disclosed, settings opt-out

Industry baseline — default ON, ToS-disclosed, settings opt-out.

Opt-out rate realistic

Industry data:

  • Default OFF + checkbox: ~15% opt-in
  • Default ON + settings opt-out: ~85-95% effective participation (5-15% opt-out)

ToS-based achieves ~95-100% effective opt-in (only ~0-5% bother to find settings opt-out).

Flywheel mathematics:

  • 15% participation → 1.5k examples/quarter → insufficient for LoRA
  • 95% participation → 9.5k examples/quarter → meets LoRA training threshold

UX implementation

Registration UI (clean)

URL: [_______________________]
☑ Я владею правом на этот URL
Лицензия: ⊙ Owned  ○ Licensed  ○ Public domain
⚠️ [warning для Tranco top 15k commercial sites]

[Принимая Terms of Service, вы соглашаетесь с использованием
 анонимизированных данных. Подробнее в Privacy Policy.]

[Импортировать]

Не показывается registration checkbox для shadow data — это в ToS.

Settings → Privacy

Privacy Settings
  ☑ Anonymous data contribution  (default ON)
     Help improve ARNO by allowing anonymized usage data.
     [Learn more] [Delete past contributions]

Settings opt-out applies к ВСЕМ data collection (shadow + analytics).

Implementation backend

async function logToShadowDataset(extraction, user) {
  if (!user.settings.contributeData) return;  // opt-out check
 
  const isGold = extraction.user_corrections != null;
  const shouldSample = isGold || Math.random() < 0.10;  // 100% gold + 10% uncorrected
  if (!shouldSample) return;
 
  await b2.upload({
    url_hash: sha256(extraction.url),
    capture_matrix: extraction.screenshots,
    component_spec: extraction.spec,
    tsx: extraction.tsx,
    user_corrections: extraction.corrections ?? null,
    mode: extraction.mode,
    cost_usd: extraction.cost,
    latency_ms: extraction.latency
  });
}

Anonymization technical

URL → SHA-256 hash (raw URL not stored). Component metadata stripped of:

  • User IDs
  • Email addresses
  • Auth tokens
  • Personal data в URLs (query params containing PII)

GDPR "anonymized" standard met (irreversible, no re-identification possible).

Consequences

Pros:

  • Flywheel mathematics work — sufficient data для quarterly LoRA
  • Cleaner registration UX — less friction
  • Industry-standard approach — proven legal pattern
  • § 0.8 requirement satisfied

Cons:

  • Requires ToS/Privacy Policy legal review (~$500-1000)
  • Settings opt-out path must be prominent (UX work)
  • Discovery: некоторые юзеры найдут opt-out only after months

Risks

RiskMitigation
Privacy backlash if discoveredIndustry-standard disclosure, prominent settings toggle, transparent docs
GDPR fine if found insufficientPre-launch legal review, anonymization meets standard, opt-out within 30 days
EU AI Act (Aug 2026) changes requirementsRe-review legal clauses post-effective date, update if needed

Alternatives rejected

A. Registration checkbox default OFF

  • ❌ Flywheel mathematics broken (15% opt-in insufficient)
  • ❌ Breaks § 0.8 hard requirement

B. Registration checkbox default ON

  • ❌ Marginal benefit over ToS approach
  • ❌ Adds registration friction
  • ❌ Dark pattern accusation risk

C. Hidden collection (no disclosure)

  • ❌ GDPR violation
  • ❌ Reputational disaster if discovered
  • ❌ ILLEGAL — не consider

D. Hidden collection + ToS clause только

  • ❌ Settings opt-out missing = no right-to-object
  • ❌ GDPR Art. 21 violation

Cross-references