01
Pipeline
file → vector → collection ┌──────┐ ┌─────────┐ ┌────────┐ ┌──────────┐ ┌────────────┐
│ FILE │ ─► │ EXTRACT │ ─► │ CHUNK │ ─► │ EMBED │ ─► │ STORE │
│ pdf │ │ text │ │ ~512t │ │ vec[1024]│ │ qdrant │
│ md │ │ utf-8 │ │ +64 ov │ │ ollama │ │ /proj_id │
│ txt │ │ │ │ │ │ qwen3 │ │ cosine │
└──────┘ └─────────┘ └────────┘ └──────────┘ └────────────┘
▌ ndjson · receive · extract · chunk · embed · indexed · done ─────►
Each upload passes through five stages, every one emitting a typed
ndjson event back to the client. The activity
feed is a direct render of that stream — there is no polling, no
secondary state, no out-of-band status. New stages
(graph extraction · dedup · summary) hang
off the same wire and surface in the same UI without rewiring.
02
Memory model
git, but for meaning
Most LLM context is ephemeral — pasted into a chat window, used
once, lost. embed inverts that premise: a
versioned, queryable substrate where institutional knowledge
accumulates. Every project is an isolated namespace —
one Qdrant collection, one embedding model, one fixed vector
dimension. Documents are addressable artifacts; chunks carry
provenance; payloads survive every round-trip.
Where git tracks lines and diffs, embed
tracks meaning and proximity. Where git log
shows commits, the activity feed shows ingestion events. The
graph layer (planned) will track derivations between facts the
way git tracks derivations between commits — branchable,
auditable, blame-able. A central place for context, not a
scratch pad.
03
Invariants
load-bearing guarantees - SCOPE Every vector op is keyed by the project's Qdrant collection. Cross-project leakage is structurally impossible.
- DIM-LOCK Collection vector size is fixed at create time. Swapping the embedding model mints new collections — old ones reject mismatched-dim upserts.
- STREAM All progress is one JSON object per line. No SSE, no polling. Trivially testable with curl.
- NO-LC No LangChain. ~150 lines of recursive splitter, Ollama wrapper, and Qdrant fetch wrapper — full control of chunk metadata, retries, and progress events.