- Date: 2026-05-22
- Status: Accepted
- Feature: URL-import (atoms — cache L3)
- Affects: url_import_spec.md § VII.3, § VIII
Context
Cache L3 (см ADR 0013 multiplier 1) matches atoms против каталога. Это требует embedding функции для atoms.
Question: какой embedding model использовать для atoms?
Изначальный draft assumed DINOv2 (same model как Phase 9 component uniqueness check). Но atoms — не та же сущность как components:
| Aspect | Component | Atom |
|---|---|---|
| Granularity | Full UI block | Primitive (Surface, Label, Icon, etc) |
| Visual fingerprint | Distinctive | Generic |
| Render cost | High (~1s GPU) | Low (no render needed) |
| Semantic load | Composition | Type + styles |
Atom semantic similarity > visual similarity:
<button class="btn-primary">Submit</button>и<a class="link-button" role="button">Submit</a>визуально похожи но семантически разные- ОБРАТНО: два
<div class="surface">могут визуально отличаться cards и dialogs, но семантически — оба A1 Surface
DINOv2 (visual) пропускает эту разницу. Нужен semantic-aware embedding.
Decision
E5-small multilingual (384 dim) для atom embeddings:
import { pipeline } from '@xenova/transformers';
const embedder = await pipeline('feature-extraction', 'Xenova/multilingual-e5-small');
function serializeAtomForEmbedding(atom: Atom): string {
return [
`type:${atom.type}`, // A1/A2/...
`tag:${atom.node.tagName}`, // div, button, etc
`styles:${summarizeStyles(atom.styles).join(',')}`, // flex,bg-blue,border
`children:${atom.children?.map(c => c.type).join(',') ?? 'none'}` // composition
].join(' ');
}
async function embedAtom(atom: Atom): Promise<number[]> {
const serialized = serializeAtomForEmbedding(atom);
const output = await embedder(serialized, { pooling: 'mean', normalize: true });
return Array.from(output.data); // 384 floats
}
// Batch для efficiency (~300ms для ~50 atoms)
async function batchEmbedAtoms(atoms: Atom[]): Promise<number[][]> {
const serialized = atoms.map(serializeAtomForEmbedding);
const output = await embedder(serialized, { pooling: 'mean', normalize: true });
return output.map(o => Array.from(o.data));
}Storage: separate pgvector table atom_embeddings_text (384 dim).
- vs
component_embeddings_visual(768 dim DINOv2 для Phase 9 uniqueness)
Why E5-small wins
- Semantic, not visual:
- Catches "button-like" elements regardless visual style
- Different tags, similar role → cluster together
- Cheap: CPU inference, 80MB model, ~50ms per atom (or ~300ms batch)
- No render needed: atoms serialize directly to text — no Playwright render cost
- Multilingual: handles EN/RU/CN class names uniformly
- MIT licensed (Xenova distribution): commercial use OK
Comparison
| Approach | Compute | Storage | Quality для atoms |
|---|---|---|---|
| Render atom → DINOv2 | High (Playwright per atom) | 768 dim | Average (atoms semantic) |
| E5-small text | Low (CPU, 50ms) | 384 dim | High (catches semantic) |
| CodeBERT на AST | Medium | 768 dim | Medium (code-aware но heavy) |
| Hybrid concat (visual+text) | Highest | 1152 dim | Best, overkill для V1 |
E5-small wins on cost/quality ratio для atom-level matching.
Why separate from component embedding
Component (Phase 9 uniqueness) — visual judgement ("does this look like existing Button?"). DINOv2 appropriate.
Atom (Phase 7 cache L3) — semantic match ("is this A1 Surface comparable?"). E5 appropriate.
Two different embedding spaces, two different pgvector tables. Не пытаемся unified embedding — different purposes.
Cache L3 hit rate trajectory
Conservative estimates (pending PoC validation):
| Volume | L3 hit |
|---|---|
| 100 URLs | 1% |
| 1k | 4% |
| 10k | 9% |
| 100k | 13% |
Boosted by pre-loaded shadcn/ui atoms (см § VII seeding) — bootstrap immediate ~10% L3 hit.
Atom merging (cosine > 0.97)
Weekly cron merges near-duplicate atoms → keep canonical, alias duplicates. Reduces vector count ~30%.
E5 embeddings stable enough для consistent cosine comparison. Threshold 0.97 picked empirically — recalibrate after PoC.
Resource math
| Volume | Atoms total | Storage |
|---|---|---|
| 100 URLs × ~5 atoms = 500 | 500 vectors | 750KB pgvector |
| 1k URLs × ~5 = 5k | 5k | 7.5MB |
| 10k × ~5 = 50k (post-merge ~35k) | 35k | 52MB |
| 100k × ~5 = 500k (post-merge ~350k) | 350k | 525MB |
At 100k URLs/мес: 525MB pgvector for atoms. Fits Supabase $25/мес tier.
Consequences
Pros:
- Semantic atom matching — better cache hits than visual
- CPU inference, no GPU needed
- ~10× cheaper than DINOv2 rendering path
- Multilingual support free
Cons:
- 2 embedding pipelines (E5 для atoms, DINOv2 для components) — operational complexity
- Mitigated: both standalone CPU-based, no GPU contention
- Different vector dimensions (384 vs 768) — cannot cross-search
- Acceptable: они для different purposes
PoC validation criteria (Task #6)
Atom embedding quality нужно validate (см atoms doc PoC section):
- Embedding clustering: same-type atoms cluster (silhouette score > 0.5)
- Match accuracy: known similar atoms (e.g. 5 different Button atoms) → top-K neighbors include peers
Если validation fails → fallback на CodeBERT (heavier but code-aware).
Alternatives rejected
A. DINOv2 atom embedding (initial draft)
- ❌ Requires render per atom (Playwright overhead)
- ❌ Visual similarity misses semantic equivalence
- ❌ Cost ×10 vs E5 text
B. CodeBERT (code-aware embedding)
- ❌ 110MB model, slower than E5-small (~80MB)
- ❌ Marginally better quality vs cost increase
- ❌ Less multilingual
C. OpenAI embeddings (text-embedding-3-small)
- ❌ Paid API ($0.020/1M tokens) — defeats free-tier philosophy
- ❌ ToS risks для downstream training (см ADR 0015)
D. Hybrid concat (text + visual)
- ❌ 1152 dim storage cost ×3
- ❌ Marginal quality gain не justifies complexity
- ❌ Можно add later if PoC показывает benefit
Cross-references
- Main spec § VII — atom overview
- Atom doc — full algorithm + PoC criteria
- ADR 0012 — DINOv2 для component embedding (separate)
- ADR 0013 — cache L3 as cost reduction multiplier