docstore¶
dbt for unstructured data. Extract once, query forever.
Most tools re-read your documents every time you ask a question. docstore extracts the fields you care about once, caches them locally as JSON, and serves every subsequent query from the cache - no LLM calls, no re-reading, no waiting.
Run 1 (cold): 200 invoices → 400 LLM calls → ~$0.40
Run 2 (warm): 200 invoices → 0 LLM calls → $0.00
Query: "which invoices are unpaid?" → 0 LLM calls → <1s
Why it exists¶
LLMs made it easy to extract structured data from documents. What they didn't provide is a layer that persists that extraction and invalidates it automatically when the source file changes. Most tools either re-read the raw documents on every query (expensive, slow), require a database or vector store (overkill), or store embeddings (wrong abstraction for structured extraction).
docstore treats structured extraction as a cache over your unstructured data. Same insight as dbt applied to SQL - you define the transformation once, the system manages the state.
Three surfaces¶
- CLI -
docstore extract ./invoices/ --schema invoice_schema. See CLI reference. - Python API -
from docstore import DocStore, ExtractionSchema. Examples in Getting started. - MCP server - Claude Desktop can call
extract,query,diff, andstatsdirectly.
Cache invariants¶
The cache invalidates automatically when:
- the file content changes (file hash changes)
- the schema changes (schema version hash changes)
- a different schema is applied to the same file (different key)
Limitations vs Lumient¶
docstore is a single-document extraction cache. It does not:
- compose records across documents (invoice + Stripe → reconciliation status)
- trigger automatically when files arrive
- maintain a queryable entity layer with lineage
- support multi-step workflow logic
- provide governance and audit trails for regulated industries
For cross-document composition and maintained operational records, see Lumient.
Get going¶
- Install and run your first extraction - ~3 minutes.
- Read the CLI reference - every command with examples.