ADR 0014 — Legal-clean distillation: Qwen teacher, не Anthropic/OpenAI

Date: 2026-05-22
Status: Accepted
Feature: URL-import (data flywheel — ADR 0013)
Affects: url_import_spec.md § X.5

Context

Cost reduction multiplier 2 (ADR 0013) — data flywheel через quarterly LoRA training. Это требует teacher модели для labelling unlabelled production extractions.

Юзерский вопрос:

"А можно используя платный ресурс на нём обучать свой — что если 100 000 долларов прогнанные через платный у нас будет свой бесплатный?"

Naive answer: использовать Claude API ($X) для generating labels → training data → in-house модель → $0 long-term.

Critical issue (discovered после research):

Anthropic Usage Policy (Feb 2026) — explicit prohibition:

You may not use outputs from Anthropic services to:
  (a) train, fine-tune, or develop AI/ML models that compete
      with our services
  (b) extract embeddings or representations for downstream
      model training

OpenAI ToS — similar restrictions (analogous clauses).

Если использовать Claude или OpenAI outputs для training → ToS violation → legal exposure + service ban risk.

Decision

Teacher selection ограничена Apache 2.0 / MIT моделями (commercial use + redistribution + derivative works разрешены):

Teacher	License	Quality vs Claude 4.6	Cost per 1M out
Qwen3-VL-235B-A22B-Instruct	Apache 2.0	~92%	$0.50
DeepSeek-V3	DeepSeek License (commercial OK)	~88%	$0.27
Llama 3.3 70B	Llama Community License	~85%	$0.59

Primary teacher: Qwen3-VL-235B (best quality + cleanest Apache 2.0). Backup: DeepSeek-V3 (cheaper, slightly less quality, commercial-friendly license).

Pipeline

Quarterly training:
  1. Collect ~10k production extractions с user corrections (gold labels)
  2. ~50k uncorrected extractions → Qwen3-VL-235B teacher labels them
  3. Split 90/10 train/holdout
  4. LoRA fine-tune Qwen3-VL-32B (Apache 2.0 student)
     - Rank 16, alpha 32, lr 1e-4
     - A100 на Modal, ~$110-300/run
  5. Validate ≥ pareto criteria (см § X.4)
  6. Deploy через 5% A/B → ramp

Все используемые модели Apache 2.0 → output redistributable, modifiable, без ToS restrictions на downstream training.

Why Qwen primary

Best quality среди Apache 2.0 models (Qwen3 series)
Same architecture family как student (Qwen3-VL-32B) → smoother distillation
Active development — Alibaba commits to open-source releases
Multilingual — handles RU/EN/CN sites uniformly
Already adopted на Together AI, Replicate, Modal — easy infra

Why exclude Claude/OpenAI

ToS violation — direct prohibition (Feb 2026)
Conflict of interest — ARNO не должен зависеть от competitors as teacher
Legal exposure — even hidden distillation discoverable through canary prompts; ban + lawsuit risk

Anthropic ToS detail

Quote из ADR 0015 (detailed analysis):

"You may not use outputs from Anthropic services to train, fine-tune,
or develop AI/ML models that compete with our services, or extract
embeddings or representations for downstream model training."

Это explicitly excludes:

Generating training labels via Claude (даже indirect через synthetic data)
Embedding extraction для downstream training
Fine-tuning student model on Claude responses

Pipeline исключает Claude entirely (см ADR 0015).

Consequences

Pros:

Pipeline fully legally compliant
No vendor lock-in to commercial AI providers
Student model Apache 2.0 → can be open-sourced eventually if ARNO chooses
Multilingual quality preserved (Qwen strong on EN/RU/CN)

Cons:

Slightly lower teacher quality vs Claude 4.6 (92% vs 100% baseline)
- Mitigated: user corrections (gold labels) > teacher labels weight в training
Active migration if Qwen license changes
- Mitigated: 3 teacher options, can switch

Risks

Risk	Mitigation
Qwen license changes (unlikely but possible)	DeepSeek backup, Llama secondary
Quality gap teacher → student affects deployment	Pareto criteria + escape valve (§ X.4)
New teacher model emerges better than Qwen	Quarterly evaluation, switch if pareto-improvement

Alternatives rejected

A. Use Claude API anyway (hope ToS not enforced)

❌ ToS violation = service termination risk
❌ Legal exposure on training data provenance audit
❌ Conflict of interest (ARNO using competitor's product)

B. No distillation, pay Gemini forever

❌ Breaks ADR 0013 cost reduction requirement
❌ Vendor lock-in to Google

C. Train from scratch (no teacher)

❌ 100×+ training cost
❌ Need massive labelled dataset upfront

Cross-references

Main spec § X — flywheel implementation
ADR 0013 — cost decrease requirement
ADR 0015 — detailed Claude exclusion rationale

ADR 0013 — "Чем дольше живём — тем меньше платим" как hard requirement ADR 0015 — Explicit non-use of Claude API в URL-import stack