- Date: 2026-05-22
- Status: Accepted
- Feature: URL-import (data flywheel — ADR 0013)
- Affects: url_import_spec.md § X.5
Context
Cost reduction multiplier 2 (ADR 0013) — data flywheel через quarterly LoRA training. Это требует teacher модели для labelling unlabelled production extractions.
Юзерский вопрос:
"А можно используя платный ресурс на нём обучать свой — что если 100 000 долларов прогнанные через платный у нас будет свой бесплатный?"
Naive answer: использовать Claude API ($X) для generating labels → training data → in-house модель → $0 long-term.
Critical issue (discovered после research):
Anthropic Usage Policy (Feb 2026) — explicit prohibition:
You may not use outputs from Anthropic services to:
(a) train, fine-tune, or develop AI/ML models that compete
with our services
(b) extract embeddings or representations for downstream
model trainingOpenAI ToS — similar restrictions (analogous clauses).
Если использовать Claude или OpenAI outputs для training → ToS violation → legal exposure + service ban risk.
Decision
Teacher selection ограничена Apache 2.0 / MIT моделями (commercial use + redistribution + derivative works разрешены):
| Teacher | License | Quality vs Claude 4.6 | Cost per 1M out |
|---|---|---|---|
| Qwen3-VL-235B-A22B-Instruct | Apache 2.0 | ~92% | $0.50 |
| DeepSeek-V3 | DeepSeek License (commercial OK) | ~88% | $0.27 |
| Llama 3.3 70B | Llama Community License | ~85% | $0.59 |
Primary teacher: Qwen3-VL-235B (best quality + cleanest Apache 2.0). Backup: DeepSeek-V3 (cheaper, slightly less quality, commercial-friendly license).
Pipeline
Quarterly training:
1. Collect ~10k production extractions с user corrections (gold labels)
2. ~50k uncorrected extractions → Qwen3-VL-235B teacher labels them
3. Split 90/10 train/holdout
4. LoRA fine-tune Qwen3-VL-32B (Apache 2.0 student)
- Rank 16, alpha 32, lr 1e-4
- A100 на Modal, ~$110-300/run
5. Validate ≥ pareto criteria (см § X.4)
6. Deploy через 5% A/B → rampВсе используемые модели Apache 2.0 → output redistributable, modifiable, без ToS restrictions на downstream training.
Why Qwen primary
- Best quality среди Apache 2.0 models (Qwen3 series)
- Same architecture family как student (Qwen3-VL-32B) → smoother distillation
- Active development — Alibaba commits to open-source releases
- Multilingual — handles RU/EN/CN sites uniformly
- Already adopted на Together AI, Replicate, Modal — easy infra
Why exclude Claude/OpenAI
- ToS violation — direct prohibition (Feb 2026)
- Conflict of interest — ARNO не должен зависеть от competitors as teacher
- Legal exposure — even hidden distillation discoverable through canary prompts; ban + lawsuit risk
Anthropic ToS detail
Quote из ADR 0015 (detailed analysis):
"You may not use outputs from Anthropic services to train, fine-tune,
or develop AI/ML models that compete with our services, or extract
embeddings or representations for downstream model training."Это explicitly excludes:
- Generating training labels via Claude (даже indirect через synthetic data)
- Embedding extraction для downstream training
- Fine-tuning student model on Claude responses
Pipeline исключает Claude entirely (см ADR 0015).
Consequences
Pros:
- Pipeline fully legally compliant
- No vendor lock-in to commercial AI providers
- Student model Apache 2.0 → can be open-sourced eventually if ARNO chooses
- Multilingual quality preserved (Qwen strong on EN/RU/CN)
Cons:
- Slightly lower teacher quality vs Claude 4.6 (92% vs 100% baseline)
- Mitigated: user corrections (gold labels) > teacher labels weight в training
- Active migration if Qwen license changes
- Mitigated: 3 teacher options, can switch
Risks
| Risk | Mitigation |
|---|---|
| Qwen license changes (unlikely but possible) | DeepSeek backup, Llama secondary |
| Quality gap teacher → student affects deployment | Pareto criteria + escape valve (§ X.4) |
| New teacher model emerges better than Qwen | Quarterly evaluation, switch if pareto-improvement |
Alternatives rejected
A. Use Claude API anyway (hope ToS not enforced)
- ❌ ToS violation = service termination risk
- ❌ Legal exposure on training data provenance audit
- ❌ Conflict of interest (ARNO using competitor's product)
B. No distillation, pay Gemini forever
- ❌ Breaks ADR 0013 cost reduction requirement
- ❌ Vendor lock-in to Google
C. Train from scratch (no teacher)
- ❌ 100×+ training cost
- ❌ Need massive labelled dataset upfront
Cross-references
- Main spec § X — flywheel implementation
- ADR 0013 — cost decrease requirement
- ADR 0015 — detailed Claude exclusion rationale