Architecture¶
Paperless NGX Dedupe is a pnpm monorepo with two packages that separate concerns cleanly between business logic and the web interface.
Monorepo Overview¶
graph TD
subgraph "packages/"
Core["core<br/><small>Business logic, algorithms, DB</small>"]
Web["web<br/><small>SvelteKit 2 app (UI + API)</small>"]
end
Web -->|"imports"| Core
Paperless["Paperless-NGX"]
Browser["Browser"]
Browser -->|"HTTP"| Web
Core -->|"REST API"| Paperless
style Core fill:#e8eaf6,stroke:#3f51b5
style Web fill:#e8f5e9,stroke:#4caf50
packages/core¶
Framework-agnostic TypeScript library containing all business logic. No web framework dependencies.
Key modules:
dedup/-- MinHash signatures, LSH indexing, fuzzy matching, discriminative scoring, union-find clusteringsync/-- Document sync from Paperless-NGX, text normalization, fingerprintingjobs/-- Worker thread launcher and job queue managerqueries/-- Database queries via Drizzle ORM (documents, duplicates, dashboard, config)schema/-- Drizzle ORM table definitions and relationspaperless/-- Paperless-NGX REST API client with Zod schema validationai/-- AI-powered metadata extraction (OpenAI), auto-apply, cost tracking, feedbackrag/-- Retrieval-augmented generation: document chunking, embeddings, vector search, conversationsexport/-- CSV and JSON export utilitiestelemetry/-- OpenTelemetry tracing and metrics instrumentationconfig.ts-- Zod-validated environment configuration
packages/web¶
SvelteKit 2 application (Svelte 5 runes) that serves both the web UI and the REST API. Uses adapter-node for Docker deployment.
Key areas:
routes/api/v1/-- REST API endpoints matching the API Referenceroutes/-- UI pages: dashboard, documents, duplicates (detail, graph, wizard), AI processing (queue, review, history), RAG ask, settingslib/components/-- Reusable Svelte components (DocumentCompare, TextDiff, etc.)lib/server/-- Server-side utilities (database connection, API helpers)hooks.server.ts-- SvelteKit server hooks for request processing
Key Technical Choices¶
| Area | Choice | Rationale |
|---|---|---|
| Database | SQLite + Drizzle ORM | Single-file database, no external dependency, excellent for single-container deployment |
| Background Jobs | worker_threads + SQLite job queue |
No Redis needed. One job per type at a time prevents resource contention |
| Real-time Progress | Server-Sent Events (SSE) | Simpler than WebSockets for unidirectional progress streams |
| Dedup Algorithms | Pure TypeScript MinHash/LSH | No native dependencies beyond better-sqlite3. Defaults: 256 permutations, 32 bands |
| Vector Search | sqlite-vec |
SQLite extension for RAG embedding storage and similarity search |
| AI Providers | OpenAI (via Vercel AI SDK) | Optional metadata extraction and RAG conversations |
| Validation | Zod | TypeScript-first schemas for env config and API requests |
| Logging | Pino | Fast structured JSON logging |
| Telemetry | OpenTelemetry | Distributed tracing and metrics (optional) |
| Styling | Tailwind CSS 4 | Utility-first CSS via Vite plugin |
Data Flow¶
Sync Pipeline¶
sequenceDiagram
participant UI as Web UI
participant API as API Layer
participant JM as Job Manager
participant W as Worker Thread
participant P as Paperless-NGX
participant DB as SQLite
UI->>API: POST /api/v1/sync
API->>JM: Create sync job
JM->>W: Spawn worker thread
W->>P: Fetch documents (paginated)
P-->>W: Document metadata + content
W->>W: Normalize text, compute fingerprints
W->>DB: Upsert documents + content
W-->>JM: Progress events (SSE)
JM-->>UI: Real-time progress
W-->>JM: Job complete
Analysis Pipeline¶
sequenceDiagram
participant UI as Web UI
participant API as API Layer
participant W as Worker Thread
participant DB as SQLite
UI->>API: POST /api/v1/analysis
API->>W: Spawn analysis worker
Note over W: Stage 1: Generate shingles
W->>DB: Read document content
W->>W: Create word n-gram sets
Note over W: Stage 2: MinHash signatures
W->>W: Compute 256 hash permutations per doc
W->>DB: Store signatures
Note over W: Stage 3: LSH candidate detection
W->>W: Band hashing (32 bands)
W->>W: Bucket collision → candidate pairs
Note over W: Stage 4: Similarity scoring
W->>W: Jaccard + Fuzzy text matching
W->>W: Discriminative penalty applied
W->>W: 2-weight + penalty confidence score
Note over W: Stage 5: Union-find clustering
W->>W: Group connected pairs
Note over W: Stage 6: Persist results
W->>DB: Create/update duplicate groups
W-->>UI: Job complete
Review Flow¶
flowchart LR
List["Duplicate List<br/><small>sorted by confidence</small>"] --> Detail["Detail View<br/><small>side-by-side diff</small>"]
Detail --> Primary["Set Primary<br/><small>document to keep</small>"]
Detail --> FalsePositive["Set Status<br/><small>false_positive</small>"]
Detail --> Ignored["Set Status<br/><small>ignored</small>"]
Primary --> Batch["Batch Delete<br/><small>remove non-primary</small>"]
Batch --> Paperless["Paperless-NGX<br/><small>documents deleted</small>"]
style Batch fill:#ffcdd2,stroke:#f44336
style Paperless fill:#ffcdd2,stroke:#f44336
Database Schema¶
The SQLite database contains 11 tables (plus a virtual table for vector embeddings):
erDiagram
document ||--o| documentContent : "has content"
document ||--o| documentSignature : "has signature"
document ||--o{ duplicateMember : "belongs to groups"
document ||--o| aiProcessingResult : "has AI result"
document ||--o{ documentChunk : "has chunks"
duplicateGroup ||--|{ duplicateMember : "contains members"
ragConversation ||--|{ ragMessage : "contains messages"
document {
text id PK
int paperlessId UK
text title
text fingerprint
text correspondent
text documentType
text tagsJson
text createdDate
text addedDate
text modifiedDate
text processingStatus
text syncedAt
}
documentContent {
text id PK
text documentId FK_UK
text fullText
text normalizedText
int wordCount
text contentHash
}
documentSignature {
text id PK
text documentId FK_UK
blob minhashSignature
text algorithmVersion
int numPermutations
text createdAt
}
duplicateGroup {
text id PK
real confidenceScore
real jaccardSimilarity
real fuzzyTextRatio
real discriminativeScore
text algorithmVersion
text status
text createdAt
text updatedAt
}
duplicateMember {
text id PK
text groupId FK
text documentId FK
int isPrimary
}
job {
text id PK
text type
text status
real progress
real phaseProgress
text progressMessage
text startedAt
text completedAt
text errorMessage
text resultJson
text createdAt
}
appConfig {
text key PK
text value
text updatedAt
}
syncState {
text id PK
text lastSyncAt
int lastSyncDocumentCount
text lastAnalysisAt
int totalDocuments
int totalDuplicateGroups
int cumulativeGroupsActioned
int cumulativeDocumentsDeleted
}
aiProcessingResult {
text id PK
text documentId FK_UK
int paperlessId
text provider
text model
text suggestedCorrespondent
text suggestedDocumentType
text suggestedTagsJson
text confidenceJson
text appliedStatus
text appliedAt
text evidence
text failureType
int promptTokens
int completionTokens
real estimatedCostUsd
text createdAt
}
documentChunk {
text id PK
text documentId FK
int chunkIndex
text content
int tokenCount
text metadata
text contentHash
text embeddingModel
text createdAt
}
ragConversation {
text id PK
text title
text createdAt
text updatedAt
}
ragMessage {
text id PK
text conversationId FK
text role
text content
text sourcesJson
int tokenUsage
text createdAt
}
Worker Thread Architecture¶
Background jobs run in Node.js worker_threads to avoid blocking the main event loop:
- Job Manager (
packages/core/src/jobs/manager.ts): Creates job records in SQLite, spawns worker threads, monitors completion - Worker Launcher (
packages/core/src/jobs/worker-launcher.ts): Generic worker spawning and crash handling - Worker Paths (
packages/core/src/jobs/worker-paths.ts): Resolves worker module paths across dev, built, and Docker environments - Workers (
packages/core/src/jobs/workers/): Specialized workers:sync-worker-- Document sync from Paperless-NGXanalysis-worker-- MinHash/LSH dedup analysisbatch-worker-- Batch delete operationsai-processing-worker-- AI metadata extractionai-apply-worker-- Apply AI suggestions to Paperless-NGXrag-indexing-worker-- RAG document chunking and embedding
Constraints:
- Only one job per type can run at a time (enforced by the job queue)
- Workers persist progress to the
jobtable - The API polls that job state and streams it via SSE at
/api/v1/jobs/:jobId/progress - Stale jobs (from crashed workers) are recovered on startup
API Layer¶
The REST API is implemented as SvelteKit server routes at packages/web/src/routes/api/v1/:
api/v1/
├── health/ # GET
├── ready/ # GET
├── metrics/ # GET (Prometheus)
├── dashboard/ # GET
├── sync/ # POST
├── sync/status/ # GET
├── analysis/ # POST
├── analysis/status/ # GET
├── jobs/ # GET
├── jobs/:jobId # GET
├── jobs/:jobId/progress # GET (SSE)
├── jobs/:jobId/cancel # POST
├── config/ # GET, PUT
├── config/dedup # GET, PUT
├── config/test-connection # POST
├── documents/ # GET
├── documents/:id # GET
├── documents/:id/content # GET
├── documents/stats # GET
├── duplicates/ # GET
├── duplicates/:id # GET, DELETE
├── duplicates/:id/content # GET
├── duplicates/:id/status # PUT
├── duplicates/:id/primary # PUT
├── duplicates/stats # GET
├── duplicates/graph # GET
├── batch/status # POST
├── batch/delete-non-primary # POST
├── batch/purge-deleted # POST
├── export/duplicates.csv # GET
├── export/config.json # GET
├── import/config # POST
├── ai/config # GET, PUT
├── ai/models # GET
├── ai/process # POST
├── ai/stats # GET
├── ai/costs # GET
├── ai/costs/estimate # POST
├── ai/feedback/summary # GET
├── ai/results/ # GET
├── ai/results/groups # GET
├── ai/results/preflight # POST
├── ai/results/batch-apply # POST
├── ai/results/apply-all # POST
├── ai/results/batch-reject # POST
├── ai/results/reject-all # POST
├── ai/results/:id # GET, DELETE
├── ai/results/:id/apply # POST
├── ai/results/:id/reject # POST
├── ai/results/:id/revert # POST
├── ai/results/:id/feedback # POST
├── rag/config # GET, PUT
├── rag/index # POST
├── rag/stats # GET
├── rag/ask # POST
├── rag/conversations # GET, POST
├── rag/conversations/:id # GET, DELETE
└── paperless/* # Proxy/helper endpoints used by the UI
All endpoints follow a consistent response envelope pattern documented in the API Reference.