Sorex - Architecture
System design: three-tier search, parallel build, formal verification. For wire format details, see Binary Format. For browser runtime, see Runtime.
Design Principles
Three ideas shape every technical decision:
- Precompute everything possible: Build-time work is free; query-time work is expensive. If you can compute it once, do it at index time.
- Compact binary format: Smaller indices load faster and cache better. Every byte in
.sorexfiles earns its place. - Proven correctness: Formal verification catches bugs that tests miss. The field hierarchy is mathematically proven, not just tested.
The goal: instant search in browsers without sacrificing accuracy or features. A ~150KB WASM bundle shouldn't feel like a compromise.
Visual Overview
Build and Runtime Flow
BUILD TIME RUNTIME (Browser)
────────── ─────────────────
JSON Documents .sorex file
│ │
v v
+-------------+ +-------------+
| sorex index | | WASM Module |
+-------------+ +-------------+
│ │
v v
.sorex file ───────────────────────▶ +-------------+
| Web Worker |◀── User Query
+-------------+
│
v
Streaming ResultsWeb Worker Integration
Search runs in a dedicated Web Worker to keep the main thread free. The UI stays responsive at 60fps even during complex fuzzy searches.
┌─────────────────┐ ┌─────────────────┐
│ Main Thread │ postMessage(q) │ Web Worker │
│ │ ──────────────────▶│ │
│ UI Component │ │ SorexSearcher │
│ │◀── T1: exact ──────│ │
│ │◀── T2: prefix ─────│ │
│ │◀── T3: fuzzy ──────│ │
└─────────────────┘ └─────────────────┘Why Web Workers?
- Non-blocking: Heavy computation doesn't freeze the UI
- Streaming: Results arrive progressively as each tier completes
- Isolation: WASM memory is sandboxed in the worker
System Overview
BUILD TIME RUNTIME (WASM)
-------------- --------------
JSON Payload .sorex binary
| |
v v
+---------------------------+ +----------------------------+
| Index Construction | | Search Execution |
| | | |
| 1. Tokenize + normalize | | 1. Parse query |
| 2. Build inverted index | | 2. Exact match O(1) |
| 3. Build vocab SA | | 3. Prefix match O(log k) |
| 4. Precompute Lev DFA | | 4. Fuzzy match (DFA) |
| 5. Encode to binary | | 5. Score + rank |
| | | 6. Return results |
+---------------------------+ +----------------------------+
| |
v v
.sorex file SearchResult[]
(~15% overhead with section_ids
vs raw text) for deep linkingParallel Build (MapReduce)
The sorex index CLI uses a MapReduce-style architecture for maximum throughput on multi-core machines:
INPUT MAP PHASE REDUCE PHASE
----- --------- ------------
manifest.json
|
+-----+-----+-----+
v v v v
+-------+-------+-------+---------+
| 0.json| 1.json| 2.json| N.json | Document files
+---+---+---+---+---+---+----+----+
| | | |
v v v v
+---------------------------------------------------------------+
| PHASE 1: Parallel Document Loading (Rayon par_iter) |
| |
| Thread 1 Thread 2 Thread 3 Thread N |
| -------- -------- -------- -------- |
| read JSON read JSON read JSON read JSON |
| parse parse parse parse |
| validate validate validate validate |
+---------------------------------------------------------------+
|
v
Vec<Document>
(sorted by ID)
|
v
+-----------------------------------------------------------------+
| PHASE 2: Index Construction |
| |
| +-----------------------------------------------------+ |
| | Shared: Arc<ParametricDFA> (built once, ~1.2KB) | |
| +-----------------------------------------------------+ |
| |
| All Documents |
| ------------- |
| | |
| v |
| build_fst_index() <- Verified suffix array construction |
| | |
| v |
| BinaryLayer::build <- Encode with embedded WASM |
+-----------------------------------------------------------------+
|
v
BuiltIndex
|
+-----------------------------------------------------------------+
| PHASE 3: Sequential Output (single thread) |
| |
| write index.sorex <- Index + embedded WASM |
| write sorex.js <- Extracts WASM from .sorex |
| |
| if --demo: |
| write demo.html <- Demo page |
+-----------------------------------------------------------------+
|
v
OUTPUT
------
output/
+-- index.sorex <- Self-contained (index + WASM)
+-- sorex.js <- JS loader (embedded in binary)
+-- demo.html <- (if --demo, embedded in binary)No external dependencies: The sorex binary embeds all output files at compile time:
- WASM runtime (~150KB)
- JavaScript loader
- Demo HTML template
Running sorex index extracts these embedded files to the output directory.
Why This Architecture
| Phase | Parallelization | Bottleneck |
|---|---|---|
| Document loading | Per-file | I/O bound (SSD throughput) |
| Index building | Single index | CPU bound (suffix array construction) |
| Output writing | Sequential | I/O bound (negligible) |
The Levenshtein DFA is built once (~1.2KB precomputed automaton with ~70 states). The index uses the verified build_fst_index() function that guarantees suffix array sortedness and index well-formedness.
Search Algorithms
Three-Tier Query Resolution
All three tiers execute in parallel. Results stream progressively:
Query: "auth"
|
+--------------------------+--------------------------+
| | |
v v v
+-------------------+ +-------------------+ +-------------------+
| T1: Exact | | T2: Prefix | | T3: Fuzzy |
| O(1) | | O(log k) | | O(vocabulary) |
| | | | | |
| Hash lookup in | | Binary search | | Levenshtein DFA |
| inverted index | | suffix array | | traversal |
| | | | | |
| "auth" → postings | | "auth*" matches: | | Edit distance ≤2: |
| | | authenticate | | auto (d=1) |
| | | authentication | | author (d=2) |
| | | author | | |
+-------------------+ +-------------------+ +-------------------+
| | |
| ≈5μs | ≈10μs | ≈200μs
| | |
+--------------------------+--------------------------+
|
v
Streaming Results
(arrive as each tier completes)Vocabulary Suffix Array
Rather than a suffix array over the full text (expensive), Sorex uses a suffix array over the vocabulary, the unique terms:
Vocabulary: ["apple", "application", "apply", "banana"]
Suffix Array entries:
(0, 0) -> "apple" Points to suffix of vocabulary[0]
(0, 1) -> "pple"
(0, 2) -> "ple"
(1, 5) -> "ation" Points to suffix of vocabulary[1]
(1, 0) -> "application"
...
Sorted lexicographically by the suffix string.
Binary search finds all terms with a given prefix.For a 100KB blog with 10K unique words, this is ~50K suffix entries (vs ~500K for full text). The vocabulary is typically 10-20% of the full text size.
Levenshtein DFA (Schulz-Mihov 2002)
Traditional fuzzy search computes edit distance per-term at query time: O(query_len x term_len x vocabulary_size). Sorex precomputes a universal DFA at index time:
Parametric DFA Structure:
~70 states for k=2
8 transitions per state (2^(k+1) character classes)
~1.2KB total
At query time:
1. Compute character class for each input character:
bit i = 1 if input matches query[position + i]
2. Follow DFA transitions (single array lookup per character)
3. Accept state indicates edit distance <= k
Result: O(term_len) per term, no distance computationThe DFA is query-independent. The same precomputed structure works for any query. Only character class computation depends on the actual query string.
Scoring and Ranking
Results are ranked by a combination of field type and position:
Field Type Base Scores (Lean-verified):
Title = 100.0
Heading = 10.0
Content = 1.0
Position Boost (within field):
Earlier matches get up to +0.5 bonus
MaxBoost = 0.5
Field Type Dominance (mathematically proven):
Title - MaxBoost > Heading + MaxBoost
-> 99.5 > 10.5 ✓
Heading - MaxBoost > Content + MaxBoost
-> 9.5 > 1.5 ✓
This guarantees: ANY title match outranks ANY heading match,
regardless of position. The hierarchy is absolute, not heuristic.
Fuzzy Match Penalty (T3):
penalty = 1.0 / (1.0 + distance)
distance=1: 0.5 (50% of base score)
distance=2: 0.33 (33% of base score)
Always positive, monotonically decreasing with distance.Index Types
Sorex supports multiple index configurations:
Union Index (Default)
Separate indices for titles, headings, and content:
UnionIndex
+-- docs: Vec<SearchDoc> <- Shared metadata
+-- titles: Option<HybridIndex> <- Title text only
+-- headings: Option<HybridIndex> <- Heading text only
+-- content: Option<HybridIndex> <- Body text onlyBenefits:
- Faster search: smaller indices = fewer comparisons
- Source attribution: results indicate where match was found
- Early termination: can stop after finding title matches
Hybrid Index
Each sub-index combines inverted + suffix array:
HybridIndex
+-- inverted_index: HashMap<String, PostingList> <- O(1) exact
+-- vocabulary: Vec<String> <- Sorted terms
+-- vocab_suffix_array: Vec<(term_idx, offset)> <- O(log k) prefix
+-- docs, texts, field_boundaries <- MetadataDeep Linking (Section IDs)
Search results include section IDs for navigation to specific headings:
Search result for "optimization" in document "/posts/rust-search":
{
href: "/posts/rust-search",
section_id: "performance-optimization", <- From FieldBoundary
source: "heading"
}
Frontend builds URL: /posts/rust-search#performance-optimizationSection IDs are stored in a deduplicated string table and referenced by index in postings:
Section Table: ["introduction", "setup", "performance-optimization"]
Posting entry:
doc_id: 5
section_idx: 3 <- Points to "performance-optimization"
Resolution:
section_idx 0 = None (title, no anchor)
section_idx N = section_table[N-1]Formal Verification
Three-Layer Defense
+-------------------------------------------------------------+
| LAYER 1: Lean Proofs |
| Mathematical specifications in lean/SearchVerified/ |
| 5 proven theorems, 18 axioms |
| If Lean builds, the spec is internally consistent |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| LAYER 2: Type-Level Wrappers |
| ValidatedSuffixEntry, SortedSuffixArray, WellFormedIndex |
| Compile-time enforcement via newtype pattern |
| Can't create invalid data without explicit unsafe |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| LAYER 3: Runtime Contracts |
| debug_assert! checks in contracts.rs |
| Zero cost in release builds |
| Catch invariant violations during development |
+-------------------------------------------------------------+Key Proven Properties
| Property | Lean File | Rust Enforcement |
|---|---|---|
| Title beats Heading | Scoring.lean | check_field_hierarchy |
| Heading beats Content | Scoring.lean | check_field_hierarchy |
| Suffix array sorted | SuffixArray.lean | SortedSuffixArray |
| Suffix entries valid | Types.lean | ValidatedSuffixEntry |
| Binary search correct | BinarySearch.lean | Property tests |
| Edit distance bounds | Levenshtein.lean | Property tests |
WASM Compilation
The library compiles to WebAssembly with minimal dependencies:
wasm-pack build --target web --features wasmSize Optimization
| Technique | Impact |
|---|---|
opt-level = 's' |
Optimize for size, not speed |
lto = true |
Whole-program optimization |
panic = 'abort' |
No unwinding, smaller binary |
No serde_json in WASM |
Cuts ~20KB |
| Precomputed DFA | ~1.2KB vs runtime construction |
Output: ~150KB gzipped (WASM + JS loader).
JavaScript Loader
The sorex.js is generated from TypeScript modules in tools/:
tools/
+-- loader.ts # Loader template with SorexSearcher wrapper
+-- build.ts # Bundles to target/loader/Build with bun tools/build.ts. Output goes to target/loader/sorex.js (and .map), which is embedded in the Rust CLI via include_str!.
Performance Characteristics
Time Complexity
| Operation | Complexity | Speed | Notes |
|---|---|---|---|
| Index load | O(n) | ~50ms | Linear scan, one-time cost |
| Exact match | O(1) | ~5μs | Hash table lookup |
| Prefix match | O(log k + m) | ~10μs | k = vocabulary, m = matches |
| Fuzzy match | O(k x t) | ~200μs | k = vocabulary, t = avg term length |
| Posting intersection | O(min(n, m)) | ~5μs | Using skip lists for large lists |
Space Complexity
| Component | Size | Notes |
|---|---|---|
| Vocabulary | ~15% of text | Deduplicated, length-prefixed |
| Suffix array | ~4 bytes/suffix | Compressed term indices |
| Postings | ~2 bytes/posting | Block PFOR with deltas |
| Levenshtein DFA | ~1.2KB fixed | Independent of vocabulary |
| Skip lists | ~5% of large postings | Only for terms with >1024 docs |
Total index overhead: ~15-20% on top of document metadata.
Related Documentation
- Binary Format - .sorex v12 wire format specification
- Runtime - Browser execution model, threading, streaming
- Algorithms - Suffix arrays, Levenshtein automata, Block PFOR
- TypeScript API - WASM bindings for browser search
- CLI Reference - Building indexes with
sorex index - Benchmarks - Performance comparisons
- Verification - Formal verification guide