URL-import
Spec

Status: Accepted (post 3 rounds of /au). Master spec: unparks §V "Small-company path / ARNO Studio" → first feature: URL-import onboarding. Implementation gate: см § XV blockers перед launch (4 pre-V1 items).

URL-import = новый юзер при регистрации даёт URL → за ~30s в ARNO staging area появляются React TSX компоненты со всеми состояниями, темами, viewports. Готовы к редактированию в ARNO editor; V2 — push to git.

Cross-references: ADRs 0007-0018 содержат rationale per pivot. Этот документ — canonical "что есть сейчас".


§ 0. История решений

#PivotОт → КWhyADR
1Approachvision-first → code-firstURL даёт исходник, не картинку0007
2Hybrid stackmonolith → schema-driven priority chains + reactive visionЮзер прописал: "если 100% не заполняется — подключаем скрины"0008
3Acceptance gateweighted scores → 3 эмпирических bool"Критерии не жизнеспособные" ×30009
4Completeness (отдельно от acceptance)weighted → матрица (state × viewport × theme)Юзер отделил концепции явно0010
5Vision activationpredictive → reactive0 ложных активаций дорогого пути0011
6Uniquenessalgorithmic → top-K → user0 ложных reuse0012
7Cost reduction"стек дёшев" → 3 независимых множителяЮзерское требование: "чем дольше живём — меньше платим"0013
8DistillationAnthropic → Apache 2.0 onlyToS Feb 2026 + conflict of interest0014, 0015
9Shadow data UXcheckbox → ToS-based disclosureIndustry pattern, flywheel works0017
10Positionparked §V → unparked V1 onboardingMaster spec v1.2 → v1.3
11Atom embeddingsDINOv2 visual → E5-small textAtoms семантически сравнимы, не визуально0018
12V1 integrationdirect git → staging area70% small-biz не имеют git, drop-off risk0016

3 раунда /au фиксов в архитектуре закрыли 18 P0 + 44 P1 на стыках. Дальнейшая итерация в чате непродуктивна — нужен PoC code (Task #6 atom decomposition validation).


§ I. Концепция

TriggerRegistration → URL + copyright checkbox + license_type
Outcome V1~30s → N TSX компонентов в ARNO staging (Modal Volume + B2 async)
In-scope V1React, HTML, статический Vue. Public URLs. Light + dark theme. 4 viewports. Staging без git auth
Out V1Auth-gated sites, canvas-rendered UI, Svelte/SolidJS, mobile-only, multi-page crawl
V2GitHub App + direct PR, sitemap crawl
ParkedHAR upload для auth, native mobile import

§ II. 13 принципов

  1. Code-first, vision-reactive. URL даёт исходник.
  2. Schema-driven per-field priority chains. Частичный отказ слоя ≠ отказ pipeline.
  3. Эмпирическая приёмка (3 bool): tsc + render + pixelmatch < 0.30.
  4. Reactive activation. Vision только после acceptance gate fail.
  5. Provenance везде. source + confidence + model_version для каждого поля.
  6. Юзер решает uniqueness. Top-K соседей → юзерское решение.
  7. Completeness = матрица combinations. 90% pass pixelmatch < 0.15.
  8. Cost decreases over time. 3 независимых множителя (cache, flywheel, tier routing).
  9. Legal-clean distillation. Teacher только Apache 2.0/MIT.
  10. Shadow mode = training data. ToS-disclosed, anonymized.
  11. Self-host где можно, paid где критично. Floor cost минимален.
  12. Атомизация A1-A8. Catalog для переиспользования.
  13. Failure modes явные. "Почему сломалось", не "что-то пошло не так".

§ III. Стек

3.1 Используем

СлойToolЛицензияЦена
Browser automationPlaywrightApache 2.0free
Anti-bot fallbackHyperbrowsercommercial$0.01/req, ≤5% URLs
UI detection (Tier 2+)OmniParser v2.0MITself-host Modal keep-warm $288/мес
Text LLM + VLMGemini 2.5 Flash-LiteAPI$0.10 / $0.40 per 1M
Code genCerebras CloudAPI1M tok/day free, $0.60/1M after
Custom VLM (Q2+)Qwen3-VL-32B-Instruct + LoRAApache 2.0self-host Modal
Distillation teacherQwen3-VL-235B-A22B-InstructApache 2.0$0.50/1M via Together
GPU computeModal Labscommercial$0.40/hr A10G, $1.10/hr A100
Visual similarity (Phase 9)DINOv2 ViT-S/14 ONNX CPU (384 dim)Apache 2.0self-host, ~$10/мес at 100k URLs
Atom embeddingsE5-small multilingualMIT (Xenova)CPU inference, 80MB model
Vector storagepgvector в PostgresPostgreSQLSupabase free → $25
Persistent queueModal Volumes + SQLitecommercial$0.30/GB-mo
StorageBackblaze B2commercial$0.006/GB-mo
TSX sanitizationts-morph (AST, whitelist) + postcss-safe-parserMITfree
Domain reputationTranco top 15k + Bloom filterCC-BY 4.0free (attribution req.)
Image deduppHashBSDfree
Pixel diffpixelmatchMITfree
Status pageCachet on Cloudflare Workersopenfree

3.2 НЕ используем

ToolПочему
Claude APIConflict of interest + Anthropic ToS distillation (Feb 2026)
OpenAI5-10× дороже Gemini Flash-Lite + ToS distillation risk
Vercel v0 / Builder.io / LocofyClosed source, vendor lock-in, $20-30+/мес min
BrowserlessДороже Hyperbrowser при сопоставимом anti-bot
DOMPurify для JSXНе понимает JSX expressions — ts-morph AST

3.3 Tier escalation (realistic)

Tier 0 ($15-25/мес):  Playwright + OmniParser skip + Cerebras free→paid
                      + Gemini paid + Modal spot
                      → ~1-2k URLs/мес

Tier 1 ($50-100/мес): + Modal A10G keep-warm для OmniParser
                      → 5-10k URLs/мес

Tier 2 ($300-600/мес): + Hyperbrowser anti-bot + Supabase $25
                      → 50-100k URLs/мес

Tier 3 ($2-5k/мес):    + A100 LoRA + dedicated pgvector
                      → 500k+ URLs/мес

3.4 CSS-in-JS handling

styled-components / emotion → runtime-injected styles, computed_style ловит. CSS variables не извлекаются (нет :root). Mitigation: detect через JS bundle pattern → tokens создаются программно из observed values. Real-world bad CSS → PostCSS lenient (safe-parser) с error capture.


§ IV. Pipeline (10 фаз)

Phase 1 — Code extraction

async function phase1(url: string): Promise<RawCapture> {
  const { ip } = await safeFetchGuard(url);  // SSRF + random pick from safe IPs
  const parsed = new URL(url);
 
  let browser;
  try {
    browser = await playwright.chromium.launch({
      args: [`--host-resolver-rules=MAP ${parsed.hostname} ${ip}`]
    });
    const page = await browser.newPage({
      userAgent: 'ARNOBot/1.0',
      viewport: { width: 1440, height: 900 }
    });
 
    const responses: NetworkResponse[] = [];
    page.on('response', r => responses.push(r));
 
    try {
      await page.goto(url, { waitUntil: 'networkidle', timeout: 30_000 });
    } catch (e) {
      if (isAntiBotSignal(e)) return phase1ViaHyperbrowser(url);
      throw new ExtractionError('navigation_failed', e);
    }
 
    const { ast: cssAST, errors: cssErrors } = await parseCSSGracefully(responses);
 
    return {
      dom: await page.content(),
      cssAST,
      cssParseErrors: cssErrors.length,
      cssInJsDetected: detectCssInJs(responses),
      computedStyles: await captureComputedStyles(page),
      sourceMaps: await tryExtractSourceMaps(responses),  // jackpot 30-50% сайтов
      framework: detectFramework(responses),
      jsBundle: await captureJsBundle(responses)
    };
  } finally {
    await browser?.close();  // P1 leak fix
  }
}
 
async function parseCSSGracefully(responses) {
  const errors: string[] = [];
  const cssTexts = responses.filter(r => r.contentType?.includes('css')).map(r => r.body);
  const asts = cssTexts.map(css => {
    try { return postcss.parse(css); }
    catch {
      return postcss.parse(css, { parser: safeParser, warn: msg => errors.push(msg) });
    }
  });
  return { ast: asts, errors };
}

Source maps jackpot: 30-50% сайтов экспонируют → доступен оригинальный TSX/JSX + propTypes + component tree.

Anti-bot signals: 403/429/CAPTCHA HTML / Cloudflare challenge / unusual TTFB → Hyperbrowser fallback (1 retry).

Phase 2 — Themes + States + Viewports

Theme detection cascade:

  1. @media (prefers-color-scheme: dark) в CSS AST
  2. [data-theme] / .dark selectors в DOM
  3. localStorage.theme pattern в JS bundle
  4. setTheme/toggleTheme pattern в JS bundle
  5. CSS at-rule color-scheme: dark на :root
  6. Vision fallback (3-sample median): emulateMedia({colorScheme}) light vs dark, wait 500ms, pixelmatch > 0.20 → dark-also

States capture через Playwright CDP Emulation.forceState: default, hover, focus, active, disabled, custom (open/closed для dropdowns, loading/error для async).

Viewports: 320 (mobile) / 768 (tablet) / 1024 (laptop) / 1440 (desktop).

Output: CaptureMatrix: до 40 screenshots per компонент (2 themes × 4 viewports × ~5 states).

Phase 3 — Coverage gate

function coverageGate(spec: PartialComponentSpec): 'pass' | 'enrich' {
  const filled = REQUIRED_FIELDS.filter(f =>
    spec[f] !== undefined && spec[f].confidence > 0.6
  );
  return (filled.length / REQUIRED_FIELDS.length) >= 0.90 ? 'pass' : 'enrich';
}

Phase 4 — Text LLM enrichment (conditional)

Только если coverage < 0.90. Без скриншотов (text-only Gemini Flash-Lite), ~$0.0005/component.

Phase 5 — TSX generation

async function generateTSX(spec): Promise<GeneratedCode> {
  let raw: GeneratedCode;
  try {
    if (isAtomicLevel(spec)) raw = templateGenerate(spec);
    else raw = await cerebras.generate({ model: 'llama-3.1-70b', prompt: buildTsxPrompt(spec) });
  } catch {
    raw = await gemini.generate({ model: 'gemini-2.5-flash-lite', prompt: buildTsxPrompt(spec) });
  }
  raw.tsx = sanitizeTsx(raw.tsx).tsx;  // см § XI.8
  return raw;
}

Output: index.tsx + types.ts + tokens.css (CSS variables, не hardcode!) + stories.tsx.

Phase 5.5 — Sample props inference (cycle-safe)

function inferSampleProps(spec, visited = new Set(), depth = 0): { props, unknownTypes } {
  if (depth > 3) return { props: {}, unknownTypes: ['__max_depth__'] };
 
  const props: Record<string, any> = {};
  const unknownTypes: string[] = [];
 
  for (const [name, def] of Object.entries(spec.props)) {
    if (def.default !== undefined) { props[name] = def.default; continue; }
 
    // String literal union: '"primary" | "secondary"'
    const literals = (String(def.type)).match(/"([^"]+)"/g);
    if (literals) { props[name] = literals[0].replace(/"/g, ''); continue; }
 
    // Primitives
    const primitives = { string: name, number: 0, boolean: false,
                         ReactNode: 'Sample', function: () => {}, array: [] };
    if (def.type in primitives) { props[name] = primitives[def.type]; continue; }
 
    // TS union array → first
    if (Array.isArray(def.type)) { props[name] = def.type[0]; continue; }
 
    // Object recursive (cycle-safe)
    if (def.type?.fields) {
      const key = JSON.stringify(def.type);
      if (visited.has(key)) {
        props[name] = null;
        unknownTypes.push(`${name}__cycle__`);
        continue;
      }
      visited.add(key);
      const nested = inferSampleProps({ props: def.type.fields }, visited, depth + 1);
      props[name] = nested.props;
      continue;
    }
 
    // Unknown — placeholder + flag
    props[name] = null;
    unknownTypes.push(name);
  }
 
  return { props, unknownTypes };
}
 
// forwardRef detection + generic instantiation
const forwardRefMatch = tsx.match(/(?:React\.)?forwardRef<[^,>]+,\s*([^>]+)>/);
if (forwardRefMatch) spec.props = extractPropsFromType(forwardRefMatch[1]);
// Generic T → string substitution
for (const def of Object.values(spec.props)) {
  if (typeof def.type === 'string') def.type = def.type.replace(/\bT\b/g, 'string');
}

Phase 6 — Acceptance gate (explicit configuration)

Renders ONE configuration: default state, 1440 viewport, light theme. Полная matrix verification — Phase 8.

async function acceptanceGate(tsx, original, spec): Promise<TestResult> {
  // Test 1: TypeScript compiles
  const tsc = await typescript.compile(tsx);
  if (tsc.errors.length > 0) return { ok: false, reason: `tsc: ${tsc.errors[0].message}` };
 
  // Test 2: Renders без ErrorBoundary trigger
  const { props } = inferSampleProps(spec);
  const wrapped = wrapWithErrorBoundary(tsx);
  let rendered;
  try { rendered = await playwrightRender(wrapped, props); }
  catch (e) { return { ok: false, reason: `render: ${e.message}` }; }
 
  if (rendered.includes('data-render-error="true"')) {
    return { ok: false, reason: 'ErrorBoundary triggered', failedFields: ['props_inference'] };
  }
 
  // Test 3: Visually close
  const diff = await pixelmatch(rendered, original);
  if (diff > 0.30) {
    return { ok: false, reason: `visual diff ${diff.toFixed(2)}`,
             failedFields: identifyMismatchRegions(rendered, original) };
  }
 
  return { ok: true, sampleProps: props };
}

Phase 7 — Vision fallback (reactive, partial-failure tolerant)

async function visionEnrich(spec, failedFields, screenshot) {
  // Tier 2+: OmniParser pre-filter (×5-20 token savings)
  // Tier 0-1: skip OmniParser, send full screenshot to Gemini VLM
  const targetRegions = await getTargetRegions(failedFields, screenshot);
 
  for (const field of failedFields) {
    try {
      const cropped = cropScreenshot(screenshot, targetRegions[field].bbox);
      const response = await gemini.vlm({
        model: 'gemini-2.5-flash-lite',
        image: cropped,
        prompt: `Extract ${field}. Context: ${JSON.stringify(spec.meta)}.`,
        timeout: 15_000
      });
      spec[field] = {
        value: response.value,
        source: 'vision',
        confidence: response.confidence,
        model_version: response.modelVersion ?? 'gemini-2.5-flash-lite-2026-02'
      };
    } catch (e) {
      // Partial failure: continue, не abort
      spec[field] = { value: null, source: 'vision', confidence: 0, error: e.message };
    }
  }
  return spec;
}

Phase 8 — Completeness verification (full matrix)

async function completenessCheck(component, captureMatrix): Promise<CompletenessReport> {
  const combinations = generateCombinations({
    themes: Object.keys(captureMatrix),
    viewports: [320, 768, 1024, 1440],
    states: detectAllStates(component)
  });
 
  const results = await Promise.all(combinations.map(async combo => {
    const original = captureMatrix[combo.theme][combo.viewport][combo.state];
    const generated = await renderGenerated(component.tsx, combo);
    const diff = await pixelmatch(original, generated, {
      ignoreText: true,    // юзер свой контент вставит
      ignoreImages: true
    });
    return { combo, diff, pass: diff < 0.15 };
  }));
 
  return {
    complete: results.filter(r => r.pass).length / results.length >= 0.90,
    coverage: results.filter(r => r.pass).length / results.length,
    failedCombos: results.filter(r => !r.pass).map(r => r.combo)
  };
}

Phase 9 — Uniqueness check (graceful degradation)

async function uniquenessCheck(component): Promise<UniquenessResult> {
  try {
    const embedding = await dinov2.embedONNXCPU(component.screenshot);
    const neighbors = await pgvector.query({
      table: 'component_embeddings_visual',
      vector: embedding,
      limit: 5,
      distance: 'cosine'
    });
 
    if (neighbors.length === 0 || neighbors[0].distance > 0.4) {
      return { decision: 'new', auto: true };
    }
    return { decision: 'pending', neighbors };  // UI shows → юзер выбирает
  } catch (e) {
    return { decision: 'new', auto: true, flag: 'uniqueness_check_skipped' };
  }
}

Phase 10 — ARNO integration (Modal Volume persistent + B2 async)

const QUEUE_VOLUME = '/mnt/arno-queue';  // Modal Volume — persistent across restarts
 
async function integrate(component, userDecision, manifest): Promise<void> {
  const arnoId = uuidv7();
  const localPath = `${QUEUE_VOLUME}/${arnoId}/`;
 
  // 1. Persistent local storage (survives worker restart)
  await fs.writeFiles(localPath, component.files);
 
  // 2. Queue entry в local SQLite (на том же Volume)
  await queueDb.insert({
    arno_id: arnoId,
    user_id: user.id,
    local_path: localPath,
    target_b2_path: `staging/${user.id}/${arnoId}/`,
    created_at: now(),
    attempts: 0,
    status: 'pending'
  });
 
  // 3. DB record — serving from local until uploaded
  await db.components.create({
    arno_id: arnoId,
    user_id: user.id,
    status: 'queued',
    serving_from: 'local',
    manifest
  });
 
  // NB: Yjs init lazy — only on editor open (см openComponentEditor)
}
 
// Background uploader worker
async function b2UploaderWorker() {
  while (true) {
    const pending = await queueDb.where({
      status: 'pending',
      next_retry_at: lte(now())
    }).limit(10);
 
    for (const entry of pending) {
      try {
        await b2.uploadDir(entry.local_path, entry.target_b2_path);
        await queueDb.update(entry.id, { status: 'uploaded' });
        await db.components.update({ arno_id: entry.arno_id }, {
          status: 'staged',
          serving_from: 'b2'
        });
        await fs.rm(entry.local_path);
      } catch (e) {
        const backoff = [60, 300, 900, 3600][entry.attempts] ?? 3600;
        await queueDb.update(entry.id, {
          attempts: entry.attempts + 1,
          next_retry_at: now() + backoff * 1000
        });
        if (entry.attempts > 3) alert.send('B2 outage, queue depth growing');
      }
    }
    await sleep(60_000);
  }
}
 
// ARNO editor — Yjs lazy init
async function openComponentEditor(arnoId: string) {
  let yjsDoc = await yjs.getDocument(arnoId);
  if (!yjsDoc) {
    const component = await loadFromStaging(arnoId);  // local OR b2
    yjsDoc = await yjs.initialize(arnoId, component);
  }
  return yjsDoc;
}
 
// V2: юзер connects git, push from staging
async function pushStagedToGit(userId: string, githubToken: string) {
  const staged = await db.components.where({ user_id: userId, status: 'staged' });
  for (const c of staged) {
    const branch = `import/${sanitizeBranchName(c.name)}-${Date.now()}-${randomBytes(2).toString('hex')}`;
    await git.createBranch(user.repo, branch, githubToken);
    await git.commitFiles(user.repo, branch, await b2.fetch(c.staging_path));
    await git.createPR(user.repo, branch, { title: `Import: ${c.name}` });
    await db.components.update(c.id, { status: 'pushed' });
  }
}
 
function sanitizeBranchName(name: string): string {
  return name.toLowerCase()
    .replace(/[^a-z0-9-]/g, '-')
    .replace(/-+/g, '-')
    .slice(0, 60);  // GitHub branch ~250 limit, 60 + ts + rand4 = ~85
}

Re-import same URL: detect via sha256(url) → UI prompt 3 options (update existing / new version / cancel). Manifest parent_import_id + version_number++.


§ V. Schemas

5.1 ComponentSpec

type ComponentSpec = {
  meta: {
    name: string;
    arno_id: string;           // UUIDv7 (timestamp-sortable)
    type: 'atomic' | 'molecule' | 'organism';
    origin_url: string;
    origin_selector: string;     // CSS path
    extraction_timestamp: ISO8601;  // КАНОНИЧЕСКИЙ timestamp
    extraction_mode: ExtractionMode;
  };
 
  props: {
    [name: string]: {
      type: TypeScriptType;
      required: boolean;
      default?: any;
      provenance: Provenance;
    }
  };
 
  variants: Array<{
    name: string;          // 'primary' | 'secondary' | etc
    when: Predicate;
    overrides: Partial<ComponentSpec>;
    provenance: Provenance;
  }>;
 
  states: {
    [name: string]: {
      // default | hover | focus | active | disabled | custom
      style_overrides: CSSProperties;
      attribute_overrides?: { [attr: string]: string };
      provenance: Provenance;
    }
  };
 
  tokens: {
    colors: { [name: string]: { value: string; provenance: Provenance } };
    spacing: { [name: string]: { value: string; provenance: Provenance } };
    typography: { [name: string]: TypographyToken & { provenance: Provenance } };
    shadows: { [name: string]: { value: string; provenance: Provenance } };
    radii: { [name: string]: { value: string; provenance: Provenance } };
    transitions: { [name: string]: { value: string; provenance: Provenance } };
  };
 
  responsive: {
    [breakpoint: number]: {
      style_overrides: CSSProperties;
      provenance: Provenance;
    }
  };
 
  accessibility: {
    aria: { [attr: string]: string };
    role: string;
    keyboard: KeyboardSpec;
    contrast_ratio_target: 4.5;          // WCAG AA
    contrast_ratio_actual?: number;
    provenance: Provenance;
  };
 
  composition?: {
    atoms: Array<{
      type: AtomType;       // A1-A8 (см url_import_atoms_a1_a8.md)
      instance_id: string;
      props_override: any;
    }>;
  };
};
 
const REQUIRED_FIELDS = [
  'meta.name', 'meta.type',
  'tokens.colors.primary',
  'states.default',
  'accessibility.role'
];

5.2 Provenance (model_version validated)

type Provenance = {
  source: 'source_map' | 'css_variable' | 'computed_style' | 'dom' | 'aria'
        | 'llm_inference' | 'vision';
  layer: 1 | 2 | 3 | 4 | 5 | 6 | 7;
  confidence: number;          // 0-1
  raw_value?: any;
  extracted_at: ISO8601;
  model_version?: string;      // REQUIRED if source in ['llm_inference', 'vision']
  model_canary_checksum?: string;  // daily-computed drift detection
};
 
function validateProvenance(p: Provenance) {
  if (['llm_inference', 'vision'].includes(p.source) && !p.model_version) {
    throw new ValidationError(`model_version required when source=${p.source}`);
  }
}
 
// Read-side migration для pre-v4 data
function readProvenance(raw): Provenance {
  if (['llm_inference', 'vision'].includes(raw.source) && !raw.model_version) {
    raw.model_version = 'legacy_pre_v4';
    metrics.increment('provenance.legacy_read');
  }
  return raw;
}
 
// Daily model canary cron — drift detection
async function dailyModelCanary() {
  const canary = 'Reply exactly: "arno-canary-2026"';
  const response = await gemini.generate({ contents: canary });
  const checksum = sha256(response.text);
 
  const prev = await db.modelCanaries.previous('gemini-2.5-flash-lite');
  if (prev && prev.checksum !== checksum) {
    alert.send('Model weights drifted без version change');
  }
  await db.modelCanaries.create({ model: 'gemini-2.5-flash-lite', checksum });
}

5.3 Manifest (.arno/manifest.json)

{
  "arno_id": "01923f8a-...-7b2c",
  "arno_version": "1.3",
  "version_number": 1,
  "parent_import_id": null,
  "imported_from": "https://example.com/products",
  "imported_at": "2026-05-22T10:30:00Z",
  "extraction_mode": "code+vision",
  "user_attestation": {
    "ownership_confirmed": true,
    "confirmed_at": "2026-05-22T10:29:55Z",
    "user_id": "uuid",
    "license_type": "owned",
    "domain_reputation_check": "passed"
  },
  "completeness": {
    "coverage": 0.94,
    "failed_combinations": [{ "theme": "dark", "viewport": 320, "state": "hover" }]
  },
  "provenance_summary": {
    "source_map_fields": 12,
    "dom_fields": 23,
    "vision_fields": 4,
    "llm_fields": 2,
    "model_versions_used": {
      "vision": "gemini-2.5-flash-lite-2026-02",
      "llm_inference": "gemini-2.5-flash-lite-2026-02"
    }
  },
  "uniqueness_decision": "new",
  "atoms": ["A1:Surface", "A2:Label", "A4:InteractionState"],
  "cost_actual_usd": 0.0024,
  "sanitization": {
    "rejected_expressions": [],
    "dangerous_attributes": 0,
    "sanitizer": "ts-morph-whitelist-3.0.0"
  },
  "css_parse_errors": 0,
  "sample_props_used": {
    "variant": "primary",
    "label": "Sample",
    "_unknown_types": []
  },
  "integration_status": "staged",
  "staging_path": "staging/user-uuid/01923f8a/.../",
  "serving_from": "b2"
}

§ VI. Mode taxonomy

ModeКогдаCost/componentBootstrap M1-3Steady M6+
code-onlyPhase 6 ✅ first try$0.00133%60%
code+visionPhase 7 vision enrich → ✅$0.00538%30%
vision-onlyFull VLM pass → ✅$0.02016%8%
code-only-degradedLLM enrich failed, partial specvaries5%1%
failedFull vision ❌, manual review$0.0258%1%

§ VII. Atom decomposition (A1-A8)

Полная реализация → url_import_atoms_a1_a8.md. Здесь — краткая ссылка.

IDAtomОписывает
A1SurfaceФон, граница, тень, радиус
A2LabelТекст + typography token
A3IconSVG/иконка + size + color
A4InteractionStateHover/focus/active/disabled visuals
A5SpacingMargin/padding system
A6LayoutFlex/grid container
A7MediaImage/video container
A8FormFieldInput/textarea/select primitive

Embedding: E5-small multilingual (384 dim), CPU inference, batch для всех atoms компонента (~300ms per URL). Хранение: pgvector таблица atom_embeddings_text (отдельно от component_embeddings_visual 384 dim DINOv2 ViT-S/14 — same dim by coincidence, разные spaces).

Seeding: pre-load shadcn/ui (MIT, ~50 components → ~200 atoms) → bootstrap L3 hit immediately 10-15%.

Lifecycle: atoms not referenced > 6 мес → deprecated → > 12 мес → physical delete (anonymized atoms preserved).

PoC pending (Task #6): validation на 20-30 реальных компонентов перед wide rollout.


§ VIII. Caching (3 уровня)

LevelMechanismHit signalLatency
L1pHash exactIdentical bytes (re-imports, SaaS templates)~0ms
L2DINOv2 ViT-S/14 ONNX CPU (384 dim) + pgvector (cosine > 0.95)Visual similarity~50ms
L3E5-small atoms + pgvector (cosine > 0.85)Semantic composition~100ms

Conservative trajectory (L3 pending E5 PoC):

VolumeL1L2L3Total
100 URLs5%8%1%14%
1k12%18%4%34%
10k20%25%9%54%
100k25%30%13%68%

Atom merging cron (weekly, transactional):

async function mergeAtom(canonical: Atom, duplicate: Atom) {
  await db.transaction(async tx => {
    const dup = await tx.atoms.findOne(
      { id: duplicate.id },
      { lockMode: 'pessimistic_write' }
    );
    if (!dup || dup.merged_into) return;
    await tx.atoms.update({ id: duplicate.id }, {
      merged_into: canonical.id, merged_at: now()
    });
    await tx.componentAtomRefs.update(
      { atom_id: duplicate.id }, { atom_id: canonical.id }
    );
  });
}
// Reduces vector count ~30%, отодвигает Supabase paid tier breakpoint

§ IX. Cost trajectory

9.0 Bootstrap reality (M1-3)

LoRA не обучена, atom catalog растёт, cache пустой → distribution смещён к expensive modes.

Period$/URLDistribution (code-only/code+vision/vision-only/degraded/failed)
M1-3 Bootstrap$0.06-0.1033/38/16/5/8
M4-6 Ramping$0.02-0.0450/35/12/2/1
M6-12 Steady$0.005-0.0160/30/8/1/1
M12+ Mature$0.002-0.00575/18/5/1/1

Unit economics check: small-biz LTV $300-600 vs bootstrap onboard cost $0.10 × ~3 imports = $0.30 = trivial. Бизнес-кейс держится.

9.1 Per-mode breakdown (steady)

ModeДоля$/componentWeighted
code-only60%$0.001$0.0006
code+vision30%$0.005$0.0015
vision-only8%$0.020$0.0016
code-only-degraded1%$0.0008~0
failed1%$0.025$0.00025
Avg/component$0.0040

URL ≈ 8 components → $0.032 steady, $0.07 bootstrap.

9.2 Hidden infrastructure (at 100k users scale)

SourceCost/мес
Staging hot (90d active)$30
Modal Volume persistent$150
DINOv2 ONNX CPU$10
OmniParser keep-warm (Tier 2+)$288
Shadow dataset (14.5% sampled)$0.70
pgvector dual tables$50
Total hidden~$530/мес at 100k users

9.3 Monthly compute at volume

URLs/месM1-3 BootstrapM6 SteadyM12 Mature
100$7$0.80$0.30
1k$70$8$3
10k$700$80$30
100k$7000$800$300

§ X. Data flywheel

10.1 Shadow mode logging (ToS-disclosed background)

Disclosure в ToS + Privacy Policy при registration (см url_import_tos_clause_draft.md). Opt-out toggle в Settings → Privacy (default ON).

Sampling rules:

  • 100% gold labels (user corrections)
  • 10% uncorrected production (random sample)
  • Stratified by segment priority (см below)

Storage: ~14.5% of all extractions → ~12GB/мес at 100k URLs/мес = ~$0.07/мес B2.

10.2 Stratified sampling (segment priority)

const SEGMENT_PRIORITY = [
  'e-commerce',        // strongest signal /shop|store|cart|product/
  'dashboard-app',     // /app|dashboard|admin/
  'tech-blog',         // /blog|medium|substack/
  'news-media',        // /news|times|post/
  'marketing-landing'  // /landing|home|about/ — weakest, catch-all
];
 
function detectSegment(url: string): string {
  for (const segment of SEGMENT_PRIORITY) {
    if (SEGMENTS[segment].test(url)) return segment;
  }
  return 'other';
}
 
async function buildTrainingDataset() {
  const all = await db.shadowDataset.all();
  const bySegment = groupBy(all, e => detectSegment(e.url_hash));
  const maxPerSegment = Math.floor(all.length * 0.25);  // cap 25% per segment
  const balanced = [];
  for (const items of Object.values(bySegment)) {
    balanced.push(...sample(items, Math.min(items.length, maxPerSegment)));
  }
  return shuffle(balanced);
}

10.3 Quarterly training

  • Dataset: ~10k gold + augmented uncorrected (Qwen3-VL-235B teacher generates labels на uncorrected)
  • Student: Qwen3-VL-32B + LoRA (rank 16, alpha 32, lr 1e-4)
  • Hardware: A100 на Modal Labs, ~$110-300 per training run
  • Per-quarter cap: $1500. Cumulative hard cap: $5000

10.4 Pareto-front deployment criteria

ALL must pass на holdout:

  • cost per URL ≤ 110% baseline
  • completeness coverage ≥ baseline
  • p95 latency ≤ 110% baseline
  • acceptance_rate ≥ baseline − 2%

Ramp safety: 5% shadow A/B for 7 days → degradation > 5% any metric → auto rollback.

Escape valve (если 3 consecutive quarter fails):

  1. Q4 fail → relax (cost 120%, latency 115%)
  2. Q5 fail → switch teacher (Qwen ↔ DeepSeek)
  3. Q6 fail → suspend training, focus cache + atoms (2/3 multipliers still work)

10.5 Teacher selection

TeacherLicenseQuality vs Claude 4.6Cost
Qwen3-VL-235B-A22B-Instruct (Sep 2025)Apache 2.0~92%$0.50/1M
DeepSeek-V3DeepSeek License (commercial OK)~88%$0.27/1M
Llama 3.3 70BLlama Community~85%$0.59/1M

Primary: Qwen3-VL-235B-A22B-Instruct. Native 256K context, MoE с 22B active params. Visual coding capabilities (Draw.io/HTML/CSS/JS generation) directly applicable к URL-import use case. См ADR 0014, ADR 0015.


§ XI. Safety / P0 fixes

11.1 SSRF + DNS random pick

const BLOCKED_NETWORKS = [
  '0.0.0.0/8', '10.0.0.0/8', '127.0.0.0/8',
  '169.254.0.0/16', '172.16.0.0/12', '192.168.0.0/16',
  '::1/128', 'fc00::/7', 'fe80::/10'
];
 
async function safeFetchGuard(url: string): Promise<{ ip: string }> {
  const parsed = new URL(url);
  if (!['http:', 'https:'].includes(parsed.protocol)) throw new SSRFError('scheme');
 
  const ips = await dns.resolve(parsed.hostname);
  const safeIps = ips.filter(ip => !isInBlockedNetwork(ip));
  if (safeIps.length === 0) throw new SSRFError('no_safe_ips');
 
  await rateLimiter.check(userId, { free: '10/hour', paid: '100/hour' });
 
  // Random pick from safe IPs — aggregate behavior = load-balanced across extractions
  return { ip: safeIps[Math.floor(Math.random() * safeIps.length)] };
}

Pinning через Playwright --host-resolver-rules=MAP hostname ip (см Phase 1). Закрывает DNS rebinding.

11.2 ARNO identity (UUIDv7)

Timestamp-embedded, lexicographically sortable. DB constraint UNIQUE NOT NULL, retry on collision.

11.3 Copyright/IP + ToS-based shadow disclosure

Registration UI (только copyright checkbox, БЕЗ shadow opt-in):

URL: [_______________________]

☑ Я владею правом на этот URL или у меня есть лицензия
   Лицензия: ⊙ Owned  ○ Licensed  ○ Public domain

⚠️ [для Tranco top 15k commercial sites — extra checkbox required]

[Принимая Terms of Service, вы соглашаетесь с использованием
 анонимизированных данных. См Privacy Policy.]

[Импортировать]

Settings → Privacy (opt-out path):

Privacy Settings
  ☑ Anonymous data contribution  (default ON)
  [Learn more] [Delete past contributions]

Domain reputation check:

const commercialDomains = new BloomFilter(loadTrancoTop15k());  // buffer zone vs top 10k
 
function checkDomainReputation(url): 'normal' | 'requires_extra_confirmation' {
  return commercialDomains.has(new URL(url).hostname)
    ? 'requires_extra_confirmation' : 'normal';
}
 
// Daily cron rebuild Bloom from Tranco
async function refreshDomainReputation() {
  const trancoList = await fetch('https://tranco-list.eu/top-1m.csv.zip');
  const top15k = parseAndExtract(trancoList, 15000);
  const newBloom = BloomFilter.from(top15k, { errorRate: 0.01, size: 150_000 });
  await b2.upload('shared/tranco-bloom.bin', newBloom.serialize());
  await broadcast.send('reload-bloom-filter');
}

Tranco attribution в Privacy Policy footer (CC-BY 4.0 requirement).

11.4 GDPR retention matrix

DataRetentionReason
TSX, specUser lifetimeЮзерский контент
Original screenshots90 днейDebug
→ 90-365 днейpHash + DINOv2 embedding (binary, not recoverable)Cache
→ after 365Metadata onlyAudit
Staging active< 90d с last_activityWorking set
Staging notified90-120 дней + email "30 дней до удаления"Decision window
Staging deleted120 дней (physical delete)Final cleanup
Shadow url_hash2 годаTraining
Shadow correctionsAnonymized после 90 днейGDPR
user_attestation7 летLegal audit

11.4.1 GDPR cascade deletion (saga pattern)

B2 deletes сначала (idempotent retry), THEN DB transaction. Background sweeper для orphan B2 files.

async function deleteUserData(userId: string) {
  try {
    // 1. B2 deletes (idempotent retry)
    await retry(async () => {
      await b2.deletePrefix(`staging/${userId}/`);
      await b2.deletePrefix(`shadow/${userId}/`);
      await b2.deletePrefix(`screenshots/${userId}/`);
    }, { attempts: 3, backoff: 'exponential' });
 
    // 2. Marker — B2 cleaned
    await db.gdprDeletions.create({
      user_id_hash: sha256(userId),
      b2_deleted_at: now()
    });
 
    // 3. DB cascade transaction
    await db.transaction(async tx => {
      await tx.components.where({ user_id: userId }).delete();
      await tx.shadowDataset.where({ user_id: userId }).delete();
      await yjs.deleteUserDocuments(userId);
      await tx.atomCatalog.where({ contributed_by: userId })
        .update({ contributed_by: null, provenance: 'anonymized' });
      await tx.componentEmbeddings.where({ user_id: userId }).delete();
      await tx.users.delete(userId);
    });
 
    await db.gdprDeletions.update(
      { user_id_hash: sha256(userId) },
      { completed_at: now() }
    );
  } catch (e) {
    await db.failedDeletions.create({ user_id_hash: sha256(userId), error: e.message });
    throw e;
  }
}
 
// Background sweeper для orphan B2 files
async function sweepOrphanB2Files() {
  const b2Prefixes = await b2.listTopLevelPrefixes();
  for (const prefix of b2Prefixes) {
    const userId = prefix.split('/')[1];
    if (!await db.users.exists(userId)) {
      await b2.deletePrefix(prefix);
    }
  }
}

11.5 Latency SLO

МетрикаTargetHard limit
p50 code-only< 10s (включая E5 batch ~300ms)
p50 code+vision< 20s
p95 worst case< 30s (Tier 2+ keep-warm OmniParser)
Hard timeout60s
Anti-bot retry1 attempt (Hyperbrowser)
LLM call10s timeout
Vision call15s timeout

11.6 Anti-bot fallback + surgical rate-limit

async function checkHyperbrowserBudget() {
  const window = await metrics.window('hyperbrowser_usage', { minutes: 10 });
 
  // Per-domain check
  const byDomain = groupBy(window.events, e => new URL(e.url).hostname);
  for (const [domain, events] of Object.entries(byDomain)) {
    if (events.length / window.total > 0.20) {
      await domainRateLimit.set(domain, 0.1);
      await alert.send(`Domain ${domain} > 20% Hyperbrowser usage`);
    }
  }
 
  // Per-user check
  const byUser = groupBy(window.events, e => e.user_id);
  for (const [userId, events] of Object.entries(byUser)) {
    if (events.length / window.total > 0.10) {
      await userRateLimit.set(userId, 0.5);
    }
  }
 
  // Global only if true distributed attack
  if (window.unique_users > 100 && window.ratio > 0.15) {
    await globalRateLimit.set(0.5);
    await alert.pageOnCall('Distributed attack pattern');
  }
}
setInterval(checkHyperbrowserBudget, 60_000);

11.7 ARNO integration safety

  • V1: Modal Volume persistent + B2 async (no git auth required)
  • V2: GitHub App + direct PR
  • Branch naming collision-safe (см Phase 10)
  • Никогда не пишем в main автоматически — всегда PR

11.8 XSS sanitization (whitelist AST с safe callee check)

import { Project, Node, SyntaxKind } from 'ts-morph';
 
const ALLOWED_JSX_VALUE_KINDS = [
  SyntaxKind.StringLiteral, SyntaxKind.NumericLiteral,
  SyntaxKind.TrueKeyword, SyntaxKind.FalseKeyword, SyntaxKind.NullKeyword,
  SyntaxKind.PropertyAccessExpression,
  SyntaxKind.Identifier,
  SyntaxKind.ConditionalExpression,
  SyntaxKind.BinaryExpression,
  SyntaxKind.TemplateExpression,
  SyntaxKind.ArrowFunction,        // event handlers
  SyntaxKind.FunctionExpression
];
 
const DANGEROUS_CALLEES = [
  'fetch', 'eval', 'Function', 'setTimeout', 'setInterval',
  'XMLHttpRequest', 'window', 'document', 'globalThis', 'self'
];
 
function isCallExpressionSafe(node): boolean {
  const callee = node.getExpression();
  if (Node.isIdentifier(callee)) return !DANGEROUS_CALLEES.includes(callee.getText());
  if (Node.isPropertyAccessExpression(callee)) {
    return !DANGEROUS_CALLEES.includes(getRootIdentifier(callee));
  }
  return false;  // computed access (foo['fetch']) — blocked
}
 
function sanitizeTsx(tsx: string) {
  const project = new Project({ useInMemoryFileSystem: true });
  const sf = project.createSourceFile('temp.tsx', tsx);
  const report = { rejected_expressions: [], dangerous_attributes: 0 };
 
  sf.forEachDescendant(node => {
    if (Node.isJsxAttribute(node) && node.getName() === 'dangerouslySetInnerHTML') {
      node.remove();
      report.dangerous_attributes++;
    }
 
    if (Node.isJsxExpression(node)) {
      const expr = node.getExpression();
      if (!expr) return;
      const kind = expr.getKind();
      if (ALLOWED_JSX_VALUE_KINDS.includes(kind)) return;
      if (kind === SyntaxKind.CallExpression && isCallExpressionSafe(expr)) return;
 
      node.replaceWithText('{null}');
      report.rejected_expressions.push(expr.getKindName());
    }
  });
 
  return { tsx: sf.getFullText(), removed: report };
}

11.9 Disaster recovery

OutageBehavior
Gemini API downQueue + retry 5min → fallback Cerebras
Cerebras downFallback на Gemini (cost spike, alert)
Modal Labs downPause LoRA; production через Gemini продолжает
Postgres / Supabase downRead-only mode; new imports блок
B2 downModal Volume queue + async retry (backoff 1m/5m/15m/1h)
Hyperbrowser downAnti-bot URLs failed immediately
DNS downSSRF guard fails-safe deny

Status page: Cachet self-host на Cloudflare Workers (free CF Workers tier, 100k req/day). URL status.arno.app.

11.10 Cost monitoring (relative thresholds)

async function checkCostAnomaly() {
  const todayCost = await metrics.dailyCost('gemini');
  const sevenDayAvg = await metrics.avg('gemini.daily_cost', { days: 7 });
  if (todayCost > sevenDayAvg * 1.5 && todayCost > 5) {
    alert.send(`Gemini cost ${todayCost.toFixed(2)} > 150% of 7-day avg`);
  }
}
AlertTrigger
Gemini/Modal cost> 150% of 7-day avg AND > $5 floor
Hyperbrowser usage> 5% за 10min → surgical rate-limit
Failure rate> 10% за час
p95 latency> 45s sustained
Single-user URL volume> 1000/day → anti-abuse review
Vision activation> 150% of baseline rate
Staging growth> 120% of 7-day avg
B2 queue depth> 1 hour pending uploads

11.11 Staging area спецификация (V1)

V1 flow:

Registration → URL submit → extract → Modal Volume local + async B2

                          Result UI: "10 компонентов готовы"
                          [Edit in ARNO] [Connect Git to push]

                  Later: GitHub OAuth → push from staging

Storage math: 50MB × 10k users + 90d lifecycle = ~$3/мес hot. At 100k users = ~$30/мес.

Зачем staging: 70%+ small-biz юзеров не имеют git (Webflow, Squarespace). Forcing git на registration = drop-off. Staging = "try it" → конверсия выше.


§ XII. Failure modes

12.1-12.3 Что НЕ умеет каждый слой

Code path:

  • Auth-gated content (login required) — HAR upload parked
  • Canvas-rendered UI (Figma embed, charts) — vision-only, low confidence
  • WebGL/Three.js — не наш домен
  • Cross-origin iframes — CSP блочит часто
  • Web Components Shadow DOM — V2
  • GSAP imperative animations — только static snapshots
  • :has() selectors — partial, browser-dependent
  • Server Components / React Streaming — degraded fidelity
  • CSS-in-JS runtime-generated class names — computed_style fallback

Vision path:

  • Точные дизайн-токены (только аппроксимация цвета)
  • Вложенная семантика (правильность ARIA)
  • Динамическое поведение (transitions, durations)
  • Микро-интеракции

LLM enrichment:

  • Вычислять (только заполнять)
  • Изобретать missing fields (returns null + confidence 0)
  • Угадывать props без HTML context

12.4 Decision tree (extended)

Phase 1 fail:
  → anti-bot signal? Hyperbrowser (1 retry)
  → network error? abort 'extraction_blocked'
  → timeout 30s? abort 'site_too_slow'

Phase 2 fail (capture matrix incomplete):
  → continue с partial matrix, flag missing combos

Phase 3 fail (coverage < 0.90):
  → Phase 4 LLM enrichment

Phase 4 fail (LLM timeout):
  → skip enrichment, Phase 5 с partial spec
  → mode 'code-only-degraded'

Phase 5 fail (generation error):
  → retry с alternative model (Cerebras → Gemini fallback)
  → still fails → abort 'generation_failed'

Phase 6 fail:
  → Phase 7

Phase 7 fail после full vision:
  → mode 'failed', save partial, manual review

Phase 8 fail (completeness < 0.5):
  → save с warning, flag manual review

Phase 9 fail (DINOv2/pgvector down):
  → skip uniqueness, default 'new'

Phase 10 fail (B2 down):
  → Modal Volume queue + async retry
  → > 1h queued → alert
  → 24h queued → notify юзер "delayed sync"

12.5 Error UX

✅ Button (3 варианта, 2 темы) — code-only
✅ Card — code+vision (4 поля через vision)
⚠️ Modal — частично (vision-only, 78%) — hover на close не извлёкся
❌ CustomChart — не импортирован (canvas-rendered)

12.6 V2 backlog edge cases

CaseV1 behaviorV2 plan
GSAP/Framer MotionStatic snapshotsJS animation parser
:has() selectorsCaptured, may differFeature detection
Web ComponentsSkippedShadow DOM traversal
Streaming SSRFinal HTML onlyReact DevTools integration
Mobile touch interactionsNot capturedTouch simulation
Container queriesLimitedPolyfill
Multi-page sitemapSingle URLCrawler
Auth-gated sitesBlockedHAR upload
Compound components (Tabs.Item)Detection + recurseFull support
forwardRef + genericsBasic substitutionAST-aware resolution

§ XIII. UX

Registration (clean, no shadow checkbox)

1. Email + password
2. Accept Terms of Service [← discloses shadow data usage]
3. "Расскажите о компании" (optional)
4. "У вас есть сайт? Импортируем компоненты"
   URL: [_______________________]
   ☑ Я владею правом на этот URL
   Лицензия: ⊙ Owned  ○ Licensed  ○ Public domain
   ⚠️ [warning для commercial sites из Tranco top 15k]
   [Импортировать] [Пропустить]

Progress (parallel steps marked)

Импортируем example.com...

✅ Извлечён HTML/CSS                              2s
✅ Обнаружено 12 компонентов                       5s
🔄 Анализ тем (light + dark)        ║параллельно║ 8s
🔄 Захват состояний (5 states)      ║параллельно║ 12s
🔄 Захват viewports (4 размера)     ║параллельно║ 15s
🔄 Генерация компонентов (8/12)...                22s
🔄 Проверка соответствия (10/12)...               27s
✅ Готово                                          30s

Result

Импортировано: 10 из 12 компонентов

✅ Button (3 варианта)         code-only       [Открыть]
✅ Card                         code+vision     [Открыть]
✅ Navbar                       code-only       [Открыть]
...
⚠️ Pricing (vision-only, 78%)                    [Re-extract] [Удалить]
❌ CustomChart (canvas)                          [Создать вручную]

[Открыть редактор]  [Импорт ещё URL]
[Connect Git to push to repo]  ← V1 only

Settings → Privacy (opt-out path)

Privacy Settings

  ☑ Anonymous data contribution
     Help improve ARNO by allowing anonymized usage data.
     [What we collect] [Delete past contributions]

  Account
     [Download my data] [Delete my account]

§ XIV. Thresholds + calibration plan

14.1 Defaults

ThresholdDefaultNote
Coverage gate (Phase 3)0.90calibrate after 1k extractions OR 30 days
Acceptance pixel-diff (Phase 6)0.30
Completeness diff (Phase 8)0.15ignoreText, ignoreImages
Completeness coverage (Phase 8)0.90
Uniqueness K (Phase 9)5top-K neighbors
Uniqueness distance0.4cosine cutoff
Cache L2 (DINOv2)0.95
Cache L3 (E5 atoms)0.85
Hard timeout60s
Hyperbrowser usage cap5%auto rate-limit
Vision activation baseline40%
WCAG contrast target4.5AA, flag manifest if below

14.2 Calibration plan

Trigger: calibrate когда earliest of:
  - 1000+ extractions accumulated, OR
  - 30 days since launch

→ analyze distribution → adjust thresholds на natural breakpoints
  target: 95% "fully extractable" cases passes coverage gate

Quarterly re-evaluate с минимум 1k new extractions.
Если volume не достиг 1k/quarter → keep current, document staleness.

Все thresholds tunable через config. Никаких "эмпирически" без data.


§ XV. Open questions

Pre-V1 launch blockers

#ВопросOwner
1✅ DINOv2 ViT-S/14 (384 dim) chosen — UI screenshots constrained domain; saves 50% storage vs ViT-B/14; post-launch reject-rate monitoring → upgrade triggered if > 20%Done
5Copyright UI + ToS legal review (~$500-1000 with privacy lawyer) — brief prepared, awaiting external reviewLegal counsel + Vadim send-out
10Atom catalog seeding — confirmed: pre-load shadcn/uiDone
16Atom decomposition PoC validation (Task #6)Claude
17✅ XSS corpus 24/24 pass: script-tag, event-handlers, dangerouslySetInnerHTML, js: URLs, alias bypass, computed access, new expressions. Implementation в packages/atom-poc/src/sanitize.ts + xss-corpus/. Run pnpm xssDone
18✅ Qwen model verified — Qwen3-VL-32B-Instruct + Qwen3-VL-235B-A22B-Instruct (Apache 2.0)Done

Pre-scale blockers (V1 OK без)

#ВопросNeeded by
2pgvector index strategy (IVFFlat vs HNSW)Pre-50k URLs/мес
3LoRA training infra (Modal/RunPod/Lambda)Q2 deployment
6Hyperbrowser cost cap (per-user vs global)After data collected

V2+ scope

#Вопрос
7Vue/Svelte priority
8Auth-gated sites HAR upload
9Multi-page sitemap crawl
11Multi-tenancy caches/atoms (privacy review)
12Component versioning UI
13Cross-component dependencies
14i18n / RTL handling
153+ theme variants (high-contrast)

§ XVI. Где живёт в ARNO docs

ФайлPurpose
_index.mdMaster spec — unpark §V в v1.3
Implementation_Workflow.mdURL-import phase в roadmap
url_import_spec.mdThis file — canonical detailed spec
url_import_atoms_a1_a8.mdAtom reference + decomposition algorithm
url_import_tos_clause_draft.mdToS draft (legal review pending)
url_import_legal_review_brief.mdQ&A pack для lawyer (10 questions structured by topic)
adr/0007-url-import-code-first.mdcode-first vs vision-first
adr/0008-hybrid-stack-prescription.mdPriority chains + reactive vision
adr/0009-empirical-acceptance.md3 bool tests vs weighted scores
adr/0010-completeness-matrix.mdMatrix (state × viewport × theme)
adr/0011-reactive-vision.mdReactive activation
adr/0012-user-decided-uniqueness.mdTop-K → user
adr/0013-cost-decrease-requirement.md3 multipliers
adr/0014-legal-clean-distillation.mdQwen teacher, не Anthropic
adr/0015-no-claude-in-stack.mdExplicit non-use of Claude API
adr/0016-staging-area-v1.mdStaging + Modal Volume + B2 async
adr/0017-shadow-data-tos-disclosure.mdToS-based vs checkbox
adr/0018-atom-embeddings-e5-small.mdE5-small vs DINOv2 для atoms

Post-launch (после real data):

  • adr/runbooks/url_import_failures.md
  • adr/runbooks/cost_alerts.md
  • adr/runbooks/b2_outage.md