Skip to content

Architecture

Paperless NGX Dedupe is a pnpm monorepo with two packages that separate concerns cleanly between business logic and the web interface.

Monorepo Overview

graph TD
    subgraph "packages/"
        Core["core<br/><small>Business logic, algorithms, DB</small>"]
        Web["web<br/><small>SvelteKit 2 app (UI + API)</small>"]
    end

    Web -->|"imports"| Core

    Paperless["Paperless-NGX"]
    Browser["Browser"]

    Browser -->|"HTTP"| Web
    Core -->|"REST API"| Paperless

    style Core fill:#e8eaf6,stroke:#3f51b5
    style Web fill:#e8f5e9,stroke:#4caf50

packages/core

Framework-agnostic TypeScript library containing all business logic. No web framework dependencies.

Key modules:

  • dedup/ -- MinHash signatures, LSH indexing, fuzzy matching, discriminative scoring, union-find clustering
  • sync/ -- Document sync from Paperless-NGX, text normalization, fingerprinting
  • jobs/ -- Worker thread launcher and job queue manager
  • queries/ -- Database queries via Drizzle ORM (documents, duplicates, dashboard, config)
  • schema/ -- Drizzle ORM table definitions and relations
  • paperless/ -- Paperless-NGX REST API client with Zod schema validation
  • ai/ -- AI-powered metadata extraction (OpenAI), auto-apply, cost tracking, feedback
  • rag/ -- Retrieval-augmented generation: document chunking, embeddings, vector search, conversations
  • export/ -- CSV and JSON export utilities
  • telemetry/ -- OpenTelemetry tracing and metrics instrumentation
  • config.ts -- Zod-validated environment configuration

packages/web

SvelteKit 2 application (Svelte 5 runes) that serves both the web UI and the REST API. Uses adapter-node for Docker deployment.

Key areas:

  • routes/api/v1/ -- REST API endpoints matching the API Reference
  • routes/ -- UI pages: dashboard, documents, duplicates (detail, graph, wizard), AI processing (queue, review, history), RAG ask, settings
  • lib/components/ -- Reusable Svelte components (DocumentCompare, TextDiff, etc.)
  • lib/server/ -- Server-side utilities (database connection, API helpers)
  • hooks.server.ts -- SvelteKit server hooks for request processing

Key Technical Choices

Area Choice Rationale
Database SQLite + Drizzle ORM Single-file database, no external dependency, excellent for single-container deployment
Background Jobs worker_threads + SQLite job queue No Redis needed. One job per type at a time prevents resource contention
Real-time Progress Server-Sent Events (SSE) Simpler than WebSockets for unidirectional progress streams
Dedup Algorithms Pure TypeScript MinHash/LSH No native dependencies beyond better-sqlite3. Defaults: 256 permutations, 32 bands
Vector Search sqlite-vec SQLite extension for RAG embedding storage and similarity search
AI Providers OpenAI (via Vercel AI SDK) Optional metadata extraction and RAG conversations
Validation Zod TypeScript-first schemas for env config and API requests
Logging Pino Fast structured JSON logging
Telemetry OpenTelemetry Distributed tracing and metrics (optional)
Styling Tailwind CSS 4 Utility-first CSS via Vite plugin

Data Flow

Sync Pipeline

sequenceDiagram
    participant UI as Web UI
    participant API as API Layer
    participant JM as Job Manager
    participant W as Worker Thread
    participant P as Paperless-NGX
    participant DB as SQLite

    UI->>API: POST /api/v1/sync
    API->>JM: Create sync job
    JM->>W: Spawn worker thread
    W->>P: Fetch documents (paginated)
    P-->>W: Document metadata + content
    W->>W: Normalize text, compute fingerprints
    W->>DB: Upsert documents + content
    W-->>JM: Progress events (SSE)
    JM-->>UI: Real-time progress
    W-->>JM: Job complete

Analysis Pipeline

sequenceDiagram
    participant UI as Web UI
    participant API as API Layer
    participant W as Worker Thread
    participant DB as SQLite

    UI->>API: POST /api/v1/analysis
    API->>W: Spawn analysis worker

    Note over W: Stage 1: Generate shingles
    W->>DB: Read document content
    W->>W: Create word n-gram sets

    Note over W: Stage 2: MinHash signatures
    W->>W: Compute 256 hash permutations per doc
    W->>DB: Store signatures

    Note over W: Stage 3: LSH candidate detection
    W->>W: Band hashing (32 bands)
    W->>W: Bucket collision → candidate pairs

    Note over W: Stage 4: Similarity scoring
    W->>W: Jaccard + Fuzzy text matching
    W->>W: Discriminative penalty applied
    W->>W: 2-weight + penalty confidence score

    Note over W: Stage 5: Union-find clustering
    W->>W: Group connected pairs

    Note over W: Stage 6: Persist results
    W->>DB: Create/update duplicate groups
    W-->>UI: Job complete

Review Flow

flowchart LR
    List["Duplicate List<br/><small>sorted by confidence</small>"] --> Detail["Detail View<br/><small>side-by-side diff</small>"]
    Detail --> Primary["Set Primary<br/><small>document to keep</small>"]
    Detail --> FalsePositive["Set Status<br/><small>false_positive</small>"]
    Detail --> Ignored["Set Status<br/><small>ignored</small>"]
    Primary --> Batch["Batch Delete<br/><small>remove non-primary</small>"]
    Batch --> Paperless["Paperless-NGX<br/><small>documents deleted</small>"]

    style Batch fill:#ffcdd2,stroke:#f44336
    style Paperless fill:#ffcdd2,stroke:#f44336

Database Schema

The SQLite database contains 11 tables (plus a virtual table for vector embeddings):

erDiagram
    document ||--o| documentContent : "has content"
    document ||--o| documentSignature : "has signature"
    document ||--o{ duplicateMember : "belongs to groups"
    document ||--o| aiProcessingResult : "has AI result"
    document ||--o{ documentChunk : "has chunks"
    duplicateGroup ||--|{ duplicateMember : "contains members"
    ragConversation ||--|{ ragMessage : "contains messages"

    document {
        text id PK
        int paperlessId UK
        text title
        text fingerprint
        text correspondent
        text documentType
        text tagsJson
        text createdDate
        text addedDate
        text modifiedDate
        text processingStatus
        text syncedAt
    }

    documentContent {
        text id PK
        text documentId FK_UK
        text fullText
        text normalizedText
        int wordCount
        text contentHash
    }

    documentSignature {
        text id PK
        text documentId FK_UK
        blob minhashSignature
        text algorithmVersion
        int numPermutations
        text createdAt
    }

    duplicateGroup {
        text id PK
        real confidenceScore
        real jaccardSimilarity
        real fuzzyTextRatio
        real discriminativeScore
        text algorithmVersion
        text status
        text createdAt
        text updatedAt
    }

    duplicateMember {
        text id PK
        text groupId FK
        text documentId FK
        int isPrimary
    }

    job {
        text id PK
        text type
        text status
        real progress
        real phaseProgress
        text progressMessage
        text startedAt
        text completedAt
        text errorMessage
        text resultJson
        text createdAt
    }

    appConfig {
        text key PK
        text value
        text updatedAt
    }

    syncState {
        text id PK
        text lastSyncAt
        int lastSyncDocumentCount
        text lastAnalysisAt
        int totalDocuments
        int totalDuplicateGroups
        int cumulativeGroupsActioned
        int cumulativeDocumentsDeleted
    }

    aiProcessingResult {
        text id PK
        text documentId FK_UK
        int paperlessId
        text provider
        text model
        text suggestedCorrespondent
        text suggestedDocumentType
        text suggestedTagsJson
        text confidenceJson
        text appliedStatus
        text appliedAt
        text evidence
        text failureType
        int promptTokens
        int completionTokens
        real estimatedCostUsd
        text createdAt
    }

    documentChunk {
        text id PK
        text documentId FK
        int chunkIndex
        text content
        int tokenCount
        text metadata
        text contentHash
        text embeddingModel
        text createdAt
    }

    ragConversation {
        text id PK
        text title
        text createdAt
        text updatedAt
    }

    ragMessage {
        text id PK
        text conversationId FK
        text role
        text content
        text sourcesJson
        int tokenUsage
        text createdAt
    }

Worker Thread Architecture

Background jobs run in Node.js worker_threads to avoid blocking the main event loop:

  • Job Manager (packages/core/src/jobs/manager.ts): Creates job records in SQLite, spawns worker threads, monitors completion
  • Worker Launcher (packages/core/src/jobs/worker-launcher.ts): Generic worker spawning and crash handling
  • Worker Paths (packages/core/src/jobs/worker-paths.ts): Resolves worker module paths across dev, built, and Docker environments
  • Workers (packages/core/src/jobs/workers/): Specialized workers:
    • sync-worker -- Document sync from Paperless-NGX
    • analysis-worker -- MinHash/LSH dedup analysis
    • batch-worker -- Batch delete operations
    • ai-processing-worker -- AI metadata extraction
    • ai-apply-worker -- Apply AI suggestions to Paperless-NGX
    • rag-indexing-worker -- RAG document chunking and embedding

Constraints:

  • Only one job per type can run at a time (enforced by the job queue)
  • Workers persist progress to the job table
  • The API polls that job state and streams it via SSE at /api/v1/jobs/:jobId/progress
  • Stale jobs (from crashed workers) are recovered on startup

API Layer

The REST API is implemented as SvelteKit server routes at packages/web/src/routes/api/v1/:

api/v1/
├── health/                         # GET
├── ready/                          # GET
├── metrics/                        # GET (Prometheus)
├── dashboard/                      # GET
├── sync/                           # POST
├── sync/status/                    # GET
├── analysis/                       # POST
├── analysis/status/                # GET
├── jobs/                           # GET
├── jobs/:jobId                     # GET
├── jobs/:jobId/progress            # GET (SSE)
├── jobs/:jobId/cancel              # POST
├── config/                         # GET, PUT
├── config/dedup                    # GET, PUT
├── config/test-connection          # POST
├── documents/                      # GET
├── documents/:id                   # GET
├── documents/:id/content           # GET
├── documents/stats                 # GET
├── duplicates/                     # GET
├── duplicates/:id                  # GET, DELETE
├── duplicates/:id/content          # GET
├── duplicates/:id/status           # PUT
├── duplicates/:id/primary          # PUT
├── duplicates/stats                # GET
├── duplicates/graph                # GET
├── batch/status                    # POST
├── batch/delete-non-primary        # POST
├── batch/purge-deleted             # POST
├── export/duplicates.csv           # GET
├── export/config.json              # GET
├── import/config                   # POST
├── ai/config                       # GET, PUT
├── ai/models                       # GET
├── ai/process                      # POST
├── ai/stats                        # GET
├── ai/costs                        # GET
├── ai/costs/estimate               # POST
├── ai/feedback/summary             # GET
├── ai/results/                     # GET
├── ai/results/groups               # GET
├── ai/results/preflight            # POST
├── ai/results/batch-apply          # POST
├── ai/results/apply-all            # POST
├── ai/results/batch-reject         # POST
├── ai/results/reject-all           # POST
├── ai/results/:id                  # GET, DELETE
├── ai/results/:id/apply            # POST
├── ai/results/:id/reject           # POST
├── ai/results/:id/revert           # POST
├── ai/results/:id/feedback         # POST
├── rag/config                      # GET, PUT
├── rag/index                       # POST
├── rag/stats                       # GET
├── rag/ask                         # POST
├── rag/conversations               # GET, POST
├── rag/conversations/:id           # GET, DELETE
└── paperless/*                     # Proxy/helper endpoints used by the UI

All endpoints follow a consistent response envelope pattern documented in the API Reference.