ADR 0018 — Atom embeddings: E5-small text, не DINOv2 visual

Date: 2026-05-22
Status: Accepted
Feature: URL-import (atoms — cache L3)
Affects: url_import_spec.md § VII.3, § VIII

Context

Cache L3 (см ADR 0013 multiplier 1) matches atoms против каталога. Это требует embedding функции для atoms.

Question: какой embedding model использовать для atoms?

Изначальный draft assumed DINOv2 (same model как Phase 9 component uniqueness check). Но atoms — не та же сущность как components:

Aspect	Component	Atom
Granularity	Full UI block	Primitive (Surface, Label, Icon, etc)
Visual fingerprint	Distinctive	Generic
Render cost	High (~1s GPU)	Low (no render needed)
Semantic load	Composition	Type + styles

Atom semantic similarity > visual similarity:

<button class="btn-primary">Submit</button> и <a class="link-button" role="button">Submit</a> визуально похожи но семантически разные
ОБРАТНО: два <div class="surface"> могут визуально отличаться cards и dialogs, но семантически — оба A1 Surface

DINOv2 (visual) пропускает эту разницу. Нужен semantic-aware embedding.

Decision

E5-small multilingual (384 dim) для atom embeddings:

import { pipeline } from '@xenova/transformers';
const embedder = await pipeline('feature-extraction', 'Xenova/multilingual-e5-small');
 
function serializeAtomForEmbedding(atom: Atom): string {
  return [
    `type:${atom.type}`,                                              // A1/A2/...
    `tag:${atom.node.tagName}`,                                       // div, button, etc
    `styles:${summarizeStyles(atom.styles).join(',')}`,              // flex,bg-blue,border
    `children:${atom.children?.map(c => c.type).join(',') ?? 'none'}` // composition
  ].join(' ');
}
 
async function embedAtom(atom: Atom): Promise<number[]> {
  const serialized = serializeAtomForEmbedding(atom);
  const output = await embedder(serialized, { pooling: 'mean', normalize: true });
  return Array.from(output.data);  // 384 floats
}
 
// Batch для efficiency (~300ms для ~50 atoms)
async function batchEmbedAtoms(atoms: Atom[]): Promise<number[][]> {
  const serialized = atoms.map(serializeAtomForEmbedding);
  const output = await embedder(serialized, { pooling: 'mean', normalize: true });
  return output.map(o => Array.from(o.data));
}

Storage: separate pgvector table atom_embeddings_text (384 dim).

vs component_embeddings_visual (768 dim DINOv2 для Phase 9 uniqueness)

Why E5-small wins

Semantic, not visual:
- Catches "button-like" elements regardless visual style
- Different tags, similar role → cluster together
Cheap: CPU inference, 80MB model, ~50ms per atom (or ~300ms batch)
No render needed: atoms serialize directly to text — no Playwright render cost
Multilingual: handles EN/RU/CN class names uniformly
MIT licensed (Xenova distribution): commercial use OK

Comparison

Approach	Compute	Storage	Quality для atoms
Render atom → DINOv2	High (Playwright per atom)	768 dim	Average (atoms semantic)
E5-small text	Low (CPU, 50ms)	384 dim	High (catches semantic)
CodeBERT на AST	Medium	768 dim	Medium (code-aware но heavy)
Hybrid concat (visual+text)	Highest	1152 dim	Best, overkill для V1

E5-small wins on cost/quality ratio для atom-level matching.

Why separate from component embedding

Component (Phase 9 uniqueness) — visual judgement ("does this look like existing Button?"). DINOv2 appropriate.

Atom (Phase 7 cache L3) — semantic match ("is this A1 Surface comparable?"). E5 appropriate.

Two different embedding spaces, two different pgvector tables. Не пытаемся unified embedding — different purposes.

Cache L3 hit rate trajectory

Conservative estimates (pending PoC validation):

Volume	L3 hit
100 URLs	1%
1k	4%
10k	9%
100k	13%

Boosted by pre-loaded shadcn/ui atoms (см § VII seeding) — bootstrap immediate ~10% L3 hit.

Atom merging (cosine > 0.97)

Weekly cron merges near-duplicate atoms → keep canonical, alias duplicates. Reduces vector count ~30%.

E5 embeddings stable enough для consistent cosine comparison. Threshold 0.97 picked empirically — recalibrate after PoC.

Resource math

Volume	Atoms total	Storage
100 URLs × ~5 atoms = 500	500 vectors	750KB pgvector
1k URLs × ~5 = 5k	5k	7.5MB
10k × ~5 = 50k (post-merge ~35k)	35k	52MB
100k × ~5 = 500k (post-merge ~350k)	350k	525MB

At 100k URLs/мес: 525MB pgvector for atoms. Fits Supabase $25/мес tier.

Consequences

Pros:

Semantic atom matching — better cache hits than visual
CPU inference, no GPU needed
~10× cheaper than DINOv2 rendering path
Multilingual support free

Cons:

2 embedding pipelines (E5 для atoms, DINOv2 для components) — operational complexity
- Mitigated: both standalone CPU-based, no GPU contention
Different vector dimensions (384 vs 768) — cannot cross-search
- Acceptable: они для different purposes

PoC validation criteria (Task #6)

Atom embedding quality нужно validate (см atoms doc PoC section):

Embedding clustering: same-type atoms cluster (silhouette score > 0.5)
Match accuracy: known similar atoms (e.g. 5 different Button atoms) → top-K neighbors include peers

Если validation fails → fallback на CodeBERT (heavier but code-aware).

Alternatives rejected

A. DINOv2 atom embedding (initial draft)

❌ Requires render per atom (Playwright overhead)
❌ Visual similarity misses semantic equivalence
❌ Cost ×10 vs E5 text

B. CodeBERT (code-aware embedding)

❌ 110MB model, slower than E5-small (~80MB)
❌ Marginally better quality vs cost increase
❌ Less multilingual

C. OpenAI embeddings (text-embedding-3-small)

❌ Paid API ($0.020/1M tokens) — defeats free-tier philosophy
❌ ToS risks для downstream training (см ADR 0015)

D. Hybrid concat (text + visual)

❌ 1152 dim storage cost ×3
❌ Marginal quality gain не justifies complexity
❌ Можно add later if PoC показывает benefit

Cross-references

Main spec § VII — atom overview
Atom doc — full algorithm + PoC criteria
ADR 0012 — DINOv2 для component embedding (separate)
ADR 0013 — cache L3 as cost reduction multiplier

ADR 0017 — Shadow data: ToS-based disclosure, не registration checkbox ADR 0019 — Extension capture path: Camp 1 persistence + multi-provider auth + 5-phase roadmap