ADRs
ADR 0018 — Atom embeddings: E5-small text, не DINOv2 visual
  • Date: 2026-05-22
  • Status: Accepted
  • Feature: URL-import (atoms — cache L3)
  • Affects: url_import_spec.md § VII.3, § VIII

Context

Cache L3 (см ADR 0013 multiplier 1) matches atoms против каталога. Это требует embedding функции для atoms.

Question: какой embedding model использовать для atoms?

Изначальный draft assumed DINOv2 (same model как Phase 9 component uniqueness check). Но atoms — не та же сущность как components:

AspectComponentAtom
GranularityFull UI blockPrimitive (Surface, Label, Icon, etc)
Visual fingerprintDistinctiveGeneric
Render costHigh (~1s GPU)Low (no render needed)
Semantic loadCompositionType + styles

Atom semantic similarity > visual similarity:

  • <button class="btn-primary">Submit</button> и <a class="link-button" role="button">Submit</a> визуально похожи но семантически разные
  • ОБРАТНО: два <div class="surface"> могут визуально отличаться cards и dialogs, но семантически — оба A1 Surface

DINOv2 (visual) пропускает эту разницу. Нужен semantic-aware embedding.

Decision

E5-small multilingual (384 dim) для atom embeddings:

import { pipeline } from '@xenova/transformers';
const embedder = await pipeline('feature-extraction', 'Xenova/multilingual-e5-small');
 
function serializeAtomForEmbedding(atom: Atom): string {
  return [
    `type:${atom.type}`,                                              // A1/A2/...
    `tag:${atom.node.tagName}`,                                       // div, button, etc
    `styles:${summarizeStyles(atom.styles).join(',')}`,              // flex,bg-blue,border
    `children:${atom.children?.map(c => c.type).join(',') ?? 'none'}` // composition
  ].join(' ');
}
 
async function embedAtom(atom: Atom): Promise<number[]> {
  const serialized = serializeAtomForEmbedding(atom);
  const output = await embedder(serialized, { pooling: 'mean', normalize: true });
  return Array.from(output.data);  // 384 floats
}
 
// Batch для efficiency (~300ms для ~50 atoms)
async function batchEmbedAtoms(atoms: Atom[]): Promise<number[][]> {
  const serialized = atoms.map(serializeAtomForEmbedding);
  const output = await embedder(serialized, { pooling: 'mean', normalize: true });
  return output.map(o => Array.from(o.data));
}

Storage: separate pgvector table atom_embeddings_text (384 dim).

  • vs component_embeddings_visual (768 dim DINOv2 для Phase 9 uniqueness)

Why E5-small wins

  1. Semantic, not visual:
    • Catches "button-like" elements regardless visual style
    • Different tags, similar role → cluster together
  2. Cheap: CPU inference, 80MB model, ~50ms per atom (or ~300ms batch)
  3. No render needed: atoms serialize directly to text — no Playwright render cost
  4. Multilingual: handles EN/RU/CN class names uniformly
  5. MIT licensed (Xenova distribution): commercial use OK

Comparison

ApproachComputeStorageQuality для atoms
Render atom → DINOv2High (Playwright per atom)768 dimAverage (atoms semantic)
E5-small textLow (CPU, 50ms)384 dimHigh (catches semantic)
CodeBERT на ASTMedium768 dimMedium (code-aware но heavy)
Hybrid concat (visual+text)Highest1152 dimBest, overkill для V1

E5-small wins on cost/quality ratio для atom-level matching.

Why separate from component embedding

Component (Phase 9 uniqueness) — visual judgement ("does this look like existing Button?"). DINOv2 appropriate.

Atom (Phase 7 cache L3) — semantic match ("is this A1 Surface comparable?"). E5 appropriate.

Two different embedding spaces, two different pgvector tables. Не пытаемся unified embedding — different purposes.

Cache L3 hit rate trajectory

Conservative estimates (pending PoC validation):

VolumeL3 hit
100 URLs1%
1k4%
10k9%
100k13%

Boosted by pre-loaded shadcn/ui atoms (см § VII seeding) — bootstrap immediate ~10% L3 hit.

Atom merging (cosine > 0.97)

Weekly cron merges near-duplicate atoms → keep canonical, alias duplicates. Reduces vector count ~30%.

E5 embeddings stable enough для consistent cosine comparison. Threshold 0.97 picked empirically — recalibrate after PoC.

Resource math

VolumeAtoms totalStorage
100 URLs × ~5 atoms = 500500 vectors750KB pgvector
1k URLs × ~5 = 5k5k7.5MB
10k × ~5 = 50k (post-merge ~35k)35k52MB
100k × ~5 = 500k (post-merge ~350k)350k525MB

At 100k URLs/мес: 525MB pgvector for atoms. Fits Supabase $25/мес tier.

Consequences

Pros:

  • Semantic atom matching — better cache hits than visual
  • CPU inference, no GPU needed
  • ~10× cheaper than DINOv2 rendering path
  • Multilingual support free

Cons:

  • 2 embedding pipelines (E5 для atoms, DINOv2 для components) — operational complexity
    • Mitigated: both standalone CPU-based, no GPU contention
  • Different vector dimensions (384 vs 768) — cannot cross-search
    • Acceptable: они для different purposes

PoC validation criteria (Task #6)

Atom embedding quality нужно validate (см atoms doc PoC section):

  • Embedding clustering: same-type atoms cluster (silhouette score > 0.5)
  • Match accuracy: known similar atoms (e.g. 5 different Button atoms) → top-K neighbors include peers

Если validation fails → fallback на CodeBERT (heavier but code-aware).

Alternatives rejected

A. DINOv2 atom embedding (initial draft)

  • ❌ Requires render per atom (Playwright overhead)
  • ❌ Visual similarity misses semantic equivalence
  • ❌ Cost ×10 vs E5 text

B. CodeBERT (code-aware embedding)

  • ❌ 110MB model, slower than E5-small (~80MB)
  • ❌ Marginally better quality vs cost increase
  • ❌ Less multilingual

C. OpenAI embeddings (text-embedding-3-small)

  • ❌ Paid API ($0.020/1M tokens) — defeats free-tier philosophy
  • ❌ ToS risks для downstream training (см ADR 0015)

D. Hybrid concat (text + visual)

  • ❌ 1152 dim storage cost ×3
  • ❌ Marginal quality gain не justifies complexity
  • ❌ Можно add later if PoC показывает benefit

Cross-references