Purpose: structured Q&A pack для privacy lawyer / GDPR specialist. Estimated cost: $500-1000 one-time review (per § XV blocker #5). Status: awaiting external review. Internal due diligence complete.
TL;DR for the lawyer
ARNO собирается launch'нуть URL-import feature: новый юзер при registration даёт URL → ARNO extracts React components из public page → юзер edit'ит in editor → optional push to git.
Three legal concerns:
- Shadow data collection для AI model training (anonymized)
- Copyright attestation на user-imported URLs
- GDPR compliance для extracted data + retention
Pre-drafted materials:
- url_import_tos_clause_draft.md — proposed ToS/Privacy clauses (5 sections)
- url_import_spec.md § XI.3-XI.4 — technical implementation
We seek validation that approach is GDPR/CCPA-compliant + identification of gaps.
Q1. GDPR legitimate interest basis для shadow data collection
Pattern: industry standard (Linear, Figma, Notion, GitHub Copilot, Cursor) — default ON anonymized data collection, disclosed in ToS, opt-out в Settings.
ARNO implementation:
- URLs hashed via SHA-256 (irreversible)
- Component metadata stripped of PII (user IDs, email, auth tokens, PII в query params)
- Stored separately from user-identifiable account data
- Disclosure: ToS + Privacy Policy section at registration
- Opt-out: Settings → Privacy → "Anonymous data contribution" toggle
- Right to erasure: "Delete past contributions" button → cascading delete
Specific questions:
-
Is GDPR Art. 6(1)(f) legitimate interest basis appropriate? Or do we need Art. 6(1)(a) explicit consent (= registration checkbox)?
-
Is our "anonymization" sufficient for GDPR Art. 4(5)? We hash URLs (one-way), but the hashed URLs are linked в shadow_dataset records. Records do not contain user_id или PII. But if attacker had original URL, they could rehash and match.
-
Balancing test documentation — what specific evidence показывает that legitimate interest outweighs данных subject rights?
-
EU AI Act (effective Aug 2026) — does training data provenance disclosure requirement apply here? Our model improves component extraction; не general-purpose AI.
Q2. Copyright attestation на user-imported URLs
Pattern: Юзер при импорте URL подтверждает: (a) Owned, (b) Licensed, (c) Public domain.
ARNO implementation:
- Mandatory checkbox при URL submit
- Required license_type selection
- Additional confirmation для Tranco top 15k commercial domains
- Attestation stored 7 years (legal audit)
- ARNO не liable для third-party copyright claims
Specific questions:
-
Is user attestation sufficient legal protection? Or do we need DMCA registered agent + safe harbor compliance?
-
For "fair use" extracted UI patterns (e.g. small-biz импортирует apple.com по ошибке):
- Is design pattern extraction "transformative" under fair use?
- Or does it constitute derivative work requiring permission?
-
Indemnification clause language — current draft:
"You agree to indemnify ARNO against any third-party claims related to content you import." Sufficient? Or need stronger language?
-
Cross-jurisdictional issues: юзер в EU импортирует US-hosted site. Whose law governs?
Q3. Domain reputation system (Tranco-based)
Pattern: Tranco top 15k domains flagged as "commercial" → require extra user confirmation.
ARNO implementation:
- Daily refresh Bloom filter (memory ~10KB)
- 1% false positive rate (Bloom filter inherent)
- Attribution: "Domain classification powered by Tranco (tranco-list.eu), licensed under CC-BY 4.0" в Privacy Policy footer
Specific questions:
-
Tranco CC-BY 4.0 attribution — placement in Privacy Policy footer sufficient? Or needs more prominent (e.g. each ARNO dashboard page)?
-
Defamation risk — ARNO flags
mywebsite.comas "commercial" → false positive embarrasses user. Liability? -
GDPR Art. 22 automated decision-making — does the warning constitute "decision" affecting user?
Q4. GDPR retention matrix
ARNO implementation (from spec § XI.4):
| Data | Retention |
|---|---|
| Component files (TSX, types, tokens) | User lifetime |
| Original screenshots | 90 days |
| → 90-365 days | Hash + DINOv2 binary (not recoverable) |
| → 365+ days | Metadata only |
| Staging area (active) | < 90 days с last_activity |
| Staging area (notified) | 90-120 days, email "30 days до удаления" |
| Staging area (deleted) | 120 days (physical) |
| Shadow url_hash | 2 years |
| Shadow corrections | Anonymized после 90 days |
| User attestation | 7 years |
Specific questions:
-
Right-to-erasure (Art. 17) timeline — current promise "30 days". Is this acceptable, or stricter?
-
Hashed data as "personal data" under GDPR? Some interpretations: hash with user-known input = pseudonymization не anonymization. Position needed.
-
"Legitimate business need" для 7-year attestation retention — sufficient? Or need shorter timeframe?
-
Cascade deletion B2 + DB — saga pattern с retries. 30-day deletion window achievable? Or need stricter SLA?
Q5. Third-party processors disclosure
Services receiving user data:
| Service | What | Why |
|---|---|---|
| Google (Gemini API) | Anonymized component metadata | Text/vision analysis |
| Cerebras | TSX code generation requests | Code generation |
| Modal Labs | Hosted compute (анализ workload) | GPU inference |
| Backblaze B2 | Staging area files + shadow data | Storage |
| Hyperbrowser | Target URL (when anti-bot detected) | Browser automation |
| Liveblocks | ARNO editor real-time state | Collaboration |
Specific questions:
-
GDPR Art. 28 data processing agreements (DPAs) — needed для каждого? We have standard DPAs available from each vendor. Sufficient to reference?
-
Schrems II / EU-US Data Privacy Framework — все эти services are US-based. For EU users, do we need SCCs (Standard Contractual Clauses) или DPF certification?
-
Sub-processor disclosure — должны ли мы list each vendor publicly в Privacy Policy?
-
Data localization claims: можем ли мы offer "EU data residency"? Currently no — все processing US. Impact for EU customers?
Q6. EU AI Act compliance (effective Aug 2026)
ARNO's AI usage:
- Production: Gemini Flash-Lite for vision/text analysis (third-party)
- Future (Q2+): self-hosted Qwen3-VL-32B + LoRA fine-tuned on user corrections
Specific questions:
-
Is ARNO's URL-import "AI system" under AI Act? We extract structured data — not making decisions about people.
-
Training data provenance — Qwen3-VL-235B-A22B teacher labels. Apache 2.0 license. We don't train on Anthropic/OpenAI outputs. Compliance with Art. 53 (training data disclosure)?
-
High-risk category check — Annex III categories. We don't impact:
- Biometric ID
- Critical infrastructure
- Education / employment
- Law enforcement
- Migration
- Justice
Compliance: low-risk system, basic transparency only?
Q7. California (CCPA / CPRA)
Specific questions:
-
Notice at Collection — current ToS clause sufficient for "categories of personal information" disclosure?
-
"Sale" of personal information — we don't sell, but does anonymized data sharing с Gemini API count?
-
"Sensitive Personal Information" under CPRA — do we collect any? (probably no — but verify)
-
GPC (Global Privacy Control) signal honoring — required to implement?
Q8. Children's data (COPPA, GDPR-K)
ARNO не targets minors. Юзер attests adult during registration.
Question: do we need explicit COPPA / GDPR-K disclaimer статement, или general ToS clause достаточно?
Q9. Liability limitations
Specific questions:
-
"As-is" disclaimer language — current ToS draft. Enforceable in all jurisdictions?
-
Limitation of liability cap — typical $X or 12-month fees? Recommendation для ARNO (small-biz target market)?
-
Class action waiver / arbitration clauses — necessary for V1 launch?
Q10. Privacy Policy structure recommendation
Current draft has 5 clauses (Data Collection, URL Import, Tranco Attribution, Retention, Third-party processors).
Question: structure adequate, или нужны additional sections?
Standard GDPR Privacy Policy checklist (from regulators):
- Identity + contact details of controller
- DPO contact (если applicable)
- Purposes of processing + legal basis
- Recipients of personal data
- International transfers + safeguards
- Retention periods
- Data subject rights (access, rectification, erasure, etc)
- Right to withdraw consent
- Right to lodge complaint with supervisory authority
- Whether disclosure is statutory или contractual requirement
- Automated decision-making + profiling
- Categories of personal data (if not collected from data subject)
Which currently covered, which missing?
Recommended deliverables from lawyer
- Validated ToS clauses (1-5) — markup или rewrite
- Validated Privacy Policy — full document for production
- Risk assessment — biggest legal exposures + mitigation priority
- Standard contract templates: DPA template для potential enterprise customers
- 30-day post-launch follow-up — track regulatory changes (EU AI Act effective date specifically)
ARNO-side preparation для legal call
Лоер should review BEFORE call:
- url_import_spec.md — full technical spec (especially § XI.3-XI.4)
- url_import_tos_clause_draft.md — draft clauses
- This document — Q&A pack
Vadim should be available for:
- Specific scenarios discussion (edge cases)
- Stack decisions justification (why Tranco, why default-ON, etc)
- Cost-benefit trade-offs (what compliance костов acceptable for V1)
Cross-references
- url_import_spec.md § XV — blocker tracking
- url_import_tos_clause_draft.md — pre-drafted clauses
- ADR 0017 — shadow data approach rationale
- ADR 0016 — staging area (retention implications)