OSS-first docs
These docs teach the open system first: contracts, generated surfaces, runtimes, governance, and incremental adoption. Studio shows up as the operating layer on top, not as the source of truth.
Knowledge Sources
A KnowledgeSourceConfig connects a tenant's data sources (Notion, Gmail, uploads, databases) to knowledge spaces. Each source is synced, chunked, embedded, and indexed according to the space's configuration.
KnowledgeSourceConfig
type KnowledgeSourceConfig = {
id: string;
tenantId: string;
spaceId: string;
// Source type and location
kind: "uploaded-document" | "url" | "email"
| "notion" | "database-query" | "raw-text"
| "slack" | "confluence" | "google-drive";
location: string;
// Sync configuration
syncPolicy: {
interval?: string; // e.g., "1h", "24h"
webhook?: boolean;
manual?: boolean;
};
// State
status: "active" | "paused" | "error";
lastSyncedAt?: string;
lastErrorMessage?: string;
// Metadata
metadata?: Record<string, unknown>;
createdAt: string;
updatedAt: string;
};Source types
Uploaded Documents
{
kind: "uploaded-document",
location: "s3://bucket/tenant-123/docs/product-spec.pdf",
syncPolicy: { manual: true }
}PDFs, Word docs, presentations uploaded by users
Notion
{
kind: "notion",
location: "https://notion.so/workspace/product-docs",
syncPolicy: { interval: "1h", webhook: true }
}Notion pages and databases with real-time webhook updates
Gmail / Email
{
kind: "email",
location: "support@company.com",
syncPolicy: { webhook: true }
}Email threads from Gmail or other providers
Database Query
{
kind: "database-query",
location: "SELECT * FROM products WHERE active = true",
syncPolicy: { interval: "24h" }
}Structured data from application databases
URL / Web Scraping
{
kind: "url",
location: "https://stripe.com/docs",
syncPolicy: { interval: "24h" }
}External documentation and web content
Sync strategies
| Strategy | When to Use | Latency |
|---|---|---|
| webhook | Real-time updates (Notion, Gmail, Slack) | Seconds |
| interval | Periodic sync (databases, URLs) | Minutes to hours |
| manual | User-triggered (uploads, one-time imports) | On-demand |
Example: Multi-source space
A single knowledge space can be fed by multiple sources:
// Product Canon space with multiple sources
{
spaceId: "product-canon",
sources: [
{
id: "src_database_schema",
kind: "database-query",
location: "SELECT * FROM schema_definitions",
syncPolicy: { interval: "1h" }
},
{
id: "src_notion_product_docs",
kind: "notion",
location: "https://notion.so/product-docs",
syncPolicy: { interval: "1h", webhook: true }
},
{
id: "src_uploaded_specs",
kind: "uploaded-document",
location: "s3://bucket/specs/",
syncPolicy: { manual: true }
}
]
}Processing pipeline
When a source is synced, ContractSpec processes it through several stages:
- Fetch - Retrieve content from source (API, database, file)
- Parse - Extract text from documents (PDF, Word, HTML)
- Chunk - Split into semantic chunks (paragraphs, sections)
- Embed - Generate vector embeddings (OpenAI, Cohere)
- Index - Store in vector database (Qdrant) or search engine
- Audit - Log sync operation and results
Provider delta state
Runtime-backed providers should persist a ProviderDeltaSyncState per source before sync work is acknowledged. Gmail and Google Drive adapters use that checkpoint to resume cursors, renew watches, skip tombstones, dedupe webhook events, and replay from a known point after retries.
type ProviderDeltaSyncState = {
lease?: { holder: string; expiresAt: string; renewalWindowMs: number };
cursor?: { cursor?: string; watermark?: string; watermarkVersion?: string };
webhookChannel?: { channelId: string; resourceId?: string; expiresAt?: string };
providerEventId?: string;
dedupeKey?: string;
idempotencyKey?: string;
replayCheckpoint?: { checkpointId: string; sequence?: string | number };
tombstone?: { deletedAt: string; reason?: string };
};Best practices
Use webhooks for real-time sources (Notion, Gmail) to minimize latency
Set appropriate sync intervals - hourly for active docs, daily for stable content
Monitor sync failures and set up alerts for critical sources
Test sources in sandbox before enabling in production
Document the purpose and ownership of each source for your team
Use manual sync for sensitive or infrequently updated content
- Run external mutations through knowledge mutation governance before sending email, changing Drive permissions, or repairing replay state.