Formally Verified Search: From Suffix Arrays to Lean Proofs
1. The Problem with “Just Test It”
Client-side search is often dismissed as a toy. Common libraries like FlexSearch or Lunr.js work well for small corpora but hit performance walls with megabytes of text. More importantly, they’re tested but not verified—edge cases hide in the long tail.
We wanted something different: a search engine where key properties aren’t just tested, but proven.
Consider the scoring system. If a title match ranks below a content match, users will complain. Testing helps, but you can’t test every query. What if someone changes a constant and breaks the hierarchy? This happens more often than you’d think. A “quick fix” to scoring constants can silently break ranking invariants.
The solution is formal verification. We specify invariants in Lean 4, a theorem prover, and prove they hold mathematically. The Rust implementation mirrors these specs, with compile-time assertions and type-level invariants that make invalid states unrepresentable.
2. Suffix Arrays: The Data Structure
We use suffix arrays rather than suffix trees.1 Both enable fast substring search, but suffix arrays are simpler and more cache-friendly.
2.1. What is a Suffix Array?
A suffix array is a sorted array of pointers to every suffix of a text. For the string “banana”:
| Index | Offset | Suffix |
|---|---|---|
| 0 | 5 | “a” |
| 1 | 3 | “ana” |
| 2 | 1 | “anana” |
| 3 | 0 | “banana” |
| 4 | 4 | “na” |
| 5 | 2 | “nana” |
The suffixes are sorted lexicographically. To find any pattern, we binary search this array.
2.2. Complexity Analysis
Building the suffix array naively requires sorting suffixes, each of length up to :
We use a radix-based approach that constructs the array in time.2 The algorithm exploits the structure of suffix comparison—prefixes of suffixes are themselves suffixes.
Querying is logarithmic in index size but linear in query length:
For a 5MB corpus with 100-character queries, this means ~20 string comparisons per search—imperceptible latency.
2.3. LCP Arrays
Alongside the suffix array, we build a Longest Common Prefix (LCP) array. For each position i, lcp[i] stores the length of the longest common prefix between consecutive sorted suffixes:
lcp[i] = commonPrefixLength(suffix_array[i-1], suffix_array[i]) For “banana”, the LCP array is:
| i | Suffix at i | Suffix at i-1 | LCP |
|---|---|---|---|
| 0 | “a” | — | 0 |
| 1 | “ana” | “a” | 1 |
| 2 | “anana” | “ana” | 3 |
| 3 | “banana” | “anana” | 0 |
| 4 | “na” | “banana” | 0 |
| 5 | “nana” | “na” | 2 |
The LCP array enables two optimizations:
- Early termination: During binary search, if we’ve already compared
kcharacters andlcp[mid] ≥ k, we can skip re-comparing those characters - Range queries: Finding all matches for a prefix becomes O(m + occ) instead of O(m log n + occ) using LCP-based range minimum queries
2.4. Parallel Construction with Rayon
The naive sort is embarrassingly parallel—each suffix comparison is independent. We use Rayon3 for data parallelism:
#[cfg(feature = "parallel")]
use rayon::prelude::*;
pub fn build_index(docs: Vec<SearchDoc>, texts: Vec<String>) -> SearchIndex {
// Phase 1: Generate suffix entries in parallel
let suffixes: Vec<SuffixEntry> = docs
.par_iter()
.flat_map_iter(|doc| {
texts.get(doc.id).into_iter().flat_map(move |text| {
(0..text.len()).map(move |offset| SuffixEntry {
doc_id: doc.id,
offset,
})
})
})
.collect();
// Phase 2: Parallel sort (the expensive part)
let mut suffixes = suffixes;
suffixes.par_sort_by(|a, b| {
let sa = texts[a.doc_id].get(a.offset..).unwrap_or("");
let sb = texts[b.doc_id].get(b.offset..).unwrap_or("");
sa.cmp(sb)
});
// Phase 3: Compute LCP in parallel
let mut lcp = vec![0; suffixes.len()];
lcp.par_iter_mut()
.enumerate()
.skip(1)
.for_each(|(i, slot)| {
*slot = common_prefix_len(
&texts[suffixes[i-1].doc_id][suffixes[i-1].offset..],
&texts[suffixes[i].doc_id][suffixes[i].offset..],
);
});
SearchIndex { docs, texts, suffix_array: suffixes, lcp }
} Rayon's `par_sort_by` uses a parallel merge sort that maintains stability while scaling across cores. On an 8-core M1, this cuts build time from 215ms to ~35ms for 100 posts. The key insight: parallelizing the sort provides the biggest win because it dominates build time. Suffix generation and LCP computation are linear and parallelize naturally, but the sort’s work—with each comparison taking time for -character suffixes—is where Rayon shines.
Note that we use flat_map_iter rather than flat_map to avoid excessive parallel overhead for the inner loop. Each document’s suffixes are generated sequentially; parallelism happens across documents.
3. The Verification Stack
Here’s the layered verification architecture:
flowchart TB
subgraph Lean["Lean 4 Specifications"]
L1[Types.lean<br/>Core data structures]
L2[Scoring.lean<br/>Field hierarchy proofs]
L3[SuffixArray.lean<br/>Sortedness invariants]
L4[BinarySearch.lean<br/>Search correctness]
L5[InvertedIndex.lean<br/>Posting list properties]
end
subgraph Rust["Rust Implementation"]
R1[types.rs]
R2[scoring.rs]
R3[index.rs]
R4[search.rs]
R5[inverted.rs]
end
subgraph Contracts["Verification Layers"]
C1[Type-level invariants<br/>Compile-time guarantees]
C2[const assertions<br/>Build-time checks]
C3[debug_assert!<br/>Runtime contracts]
C4[proptest<br/>Property tests]
end
L1 --> R1
L2 --> R2
L3 --> R3
L4 --> R4
L5 --> R5
R1 --> C1
R2 --> C2
R3 --> C3
R4 --> C4 Each layer catches different classes of bugs:
- Lean proofs: Mathematical certainty for critical invariants
- Type-level wrappers: Compile-time prevention of invalid states
- Const assertions: Build-time verification of constants
- Runtime contracts: Debug-mode invariant checks
- Property tests: Randomized exploration of edge cases
4. Lean 4 Specifications
Lean 4 is a theorem prover and programming language.4 We use it to specify what our search index should do, then prove those properties hold.
4.1. Type Definitions
The Lean types mirror our Rust structs exactly:
structure SuffixEntry where
doc_id : Nat -- Which document
offset : Nat -- Position in document
deriving Repr, DecidableEq
def SuffixEntry.WellFormed (e : SuffixEntry) (texts : Array String) : Prop :=
e.doc_id < texts.size ∧
e.offset < texts[e.doc_id]!.length A suffix entry is “well-formed” if its doc_id points to a valid document and its offset is within the document bounds. We use strict inequality (<) because suffix arrays index non-empty suffixes—an offset equal to the length would produce an empty string. This is a proposition—a statement that can be true or false.
4.2. The Scoring Hierarchy Proof
The most important invariant is field type dominance: title matches always outrank heading matches, which always outrank content matches, regardless of position bonuses.
def baseScore : FieldType → Nat
| .title => 1000
| .heading => 100
| .content => 10
def maxPositionBoost : Nat := 5
theorem title_beats_heading :
baseScore .title - maxPositionBoost > baseScore .heading + maxPositionBoost := by
native_decide -- 995 > 105 ✓
theorem heading_beats_content :
baseScore .heading - maxPositionBoost > baseScore .content + maxPositionBoost := by
native_decide -- 95 > 15 ✓ The native_decide tactic asks Lean to evaluate the arithmetic and verify the inequality holds. If someone changes the constants incorrectly, Lean won’t compile.
4.3. Aggregate Score Properties
We also prove properties about score aggregation:
def aggregateScores (scores : List Nat) : Nat :=
scores.foldl (· + ·) 0
theorem aggregateScores_eq_sum (scores : List Nat) :
aggregateScores scores = scores.sum := by
induction scores with
| nil => rfl
| cons x xs ih =>
simp only [List.foldl_cons, List.sum_cons]
have foldl_shift : ∀ init l,
List.foldl (· + ·) init l = init + List.foldl (· + ·) 0 l := by
intro init l
induction l generalizing init with
| nil => simp
| cons y ys ihy => simp [ihy]; omega
rw [foldl_shift]; omega This theorem proves our aggregation function correctly sums scores. The induction tactic breaks the proof into base case (empty list) and inductive case (cons).
5. Rust Implementation
The Rust code mirrors the Lean specifications, with additional enforcement mechanisms.
5.1. Type-Level Invariants
We use wrapper types that encode invariants at the type level:
/// A validated suffix entry where bounds are guaranteed.
pub struct ValidatedSuffixEntry {
inner: SuffixEntry,
}
impl ValidatedSuffixEntry {
/// Create a validated entry. Returns Err if invariants violated.
pub fn new(entry: SuffixEntry, texts: &[String]) -> Result<Self, InvariantError> {
if entry.doc_id >= texts.len() {
return Err(InvariantError::InvalidDocId {
doc_id: entry.doc_id,
texts_len: texts.len(),
});
}
if entry.offset > texts[entry.doc_id].len() {
return Err(InvariantError::InvalidOffset {
doc_id: entry.doc_id,
offset: entry.offset,
text_len: texts[entry.doc_id].len(),
});
}
Ok(Self { inner: entry })
}
/// Get the suffix. Cannot panic—bounds were checked at construction.
pub fn suffix<'a>(&self, texts: &'a [String]) -> &'a str {
&texts[self.inner.doc_id][self.inner.offset..]
}
} Once you have a ValidatedSuffixEntry, you cannot have an invalid pointer. The constructor is the only way to create one, and it enforces the invariant.
5.2. Compile-Time Assertions
For constants that must satisfy mathematical relationships, we use const assertions:
/// Static assertion that field type dominance holds.
/// Evaluated at compile time—if it fails, the crate won't build.
const _: () = {
const TITLE: f64 = 100.0;
const HEADING: f64 = 10.0;
const CONTENT: f64 = 1.0;
const MAX_BOOST: f64 = 0.5;
// INVARIANT: title_beats_heading (from Scoring.lean)
const WORST_TITLE: f64 = TITLE - MAX_BOOST;
const BEST_HEADING: f64 = HEADING + MAX_BOOST;
assert!(WORST_TITLE > BEST_HEADING); // 99.5 > 10.5 ✓
// INVARIANT: heading_beats_content (from Scoring.lean)
const WORST_HEADING: f64 = HEADING - MAX_BOOST;
const BEST_CONTENT: f64 = CONTENT + MAX_BOOST;
assert!(WORST_HEADING > BEST_CONTENT); // 9.5 > 1.5 ✓
}; This code runs at compile time. If someone changes HEADING to 50.0, the build fails immediately with a clear error.
5.3. Runtime Contracts
For invariants that can’t be checked at compile time, we use debug assertions—a technique from Design by Contract:6
/// Check that the suffix array is sorted lexicographically.
#[inline]
pub fn check_suffix_array_sorted(texts: &[String], suffix_array: &[SuffixEntry]) {
for i in 1..suffix_array.len() {
let prev_suffix = &texts[suffix_array[i-1].doc_id][suffix_array[i-1].offset..];
let curr_suffix = &texts[suffix_array[i].doc_id][suffix_array[i].offset..];
debug_assert!(
prev_suffix <= curr_suffix,
"Contract violation: suffix array not sorted at {}: '{}' > '{}'",
i, prev_suffix, curr_suffix
);
}
} These checks are compiled out in release builds but catch bugs during development.
6. The cargo xtask Workflow
We use the idiomatic Rust xtask pattern for verification commands:
cargo xtask verify # Full verification suite
cargo xtask test # Run all tests
cargo xtask lean # Build Lean proofs
cargo xtask check # Quick check (no Lean) The verify command runs a 5-step pipeline:
sequenceDiagram
participant V as verify
participant I as Invariants
participant T as Tests
participant C as Clippy
participant L as Lean
participant A as Alignment
V->>I: Check INVARIANT markers
I-->>V: ✓ 5+ markers found
V->>T: Run cargo test
T-->>V: ✓ All tests pass
V->>C: Run clippy
C-->>V: ✓ No warnings
V->>L: Run lake build
L-->>V: ✓ Proofs compile
V->>A: Compare Rust/Lean constants
A-->>V: ✓ Constants aligned
V->>V: ✓ Safe to commit The final step—constant alignment—ensures Rust and Lean use the same values. Lean uses 10× scaling (1000 instead of 100.0) to avoid float arithmetic, so we verify the relationship holds.
7. Property-Based Testing
For properties that are axiomatized in Lean (not fully proven), we use proptest7 to gain empirical confidence:
proptest! {
#[test]
fn suffix_array_is_sorted(texts in prop::collection::vec("[a-z]{1,20}", 1..5)) {
let docs = texts.iter().enumerate()
.map(|(i, _)| make_doc(i))
.collect();
let index = build_index(docs, texts.clone(), vec![]);
// Verify sortedness
for i in 1..index.suffix_array.len() {
let prev = suffix_at(&index.texts, &index.suffix_array[i-1]);
let curr = suffix_at(&index.texts, &index.suffix_array[i]);
prop_assert!(prev <= curr);
}
}
#[test]
fn binary_search_finds_all_matches(
text in "[a-z]{10,50}",
pattern in "[a-z]{1,5}"
) {
let index = build_single_doc_index(&text);
let results = search(&index, &pattern);
// Every occurrence should be found
for (i, _) in text.match_indices(&pattern) {
prop_assert!(
results.iter().any(|r| r.offset == i),
"Pattern '{}' at offset {} not found", pattern, i
);
}
}
} These tests run thousands of random inputs, exploring edge cases that manual tests miss.
8. Hybrid Search: Inverted Index for Large Corpora
Suffix arrays excel at substring and fuzzy matching, but for large blogs (500+ posts), exact word lookups benefit from an inverted index—the classic information retrieval data structure.
8.1. The Scaling Problem
Suffix arrays have O(log n) query time, which is excellent. But for exact word matches, an inverted index provides O(1) lookup. With 1000 documents, the difference is:
More importantly, inverted indexes support efficient boolean queries (AND, OR) via posting list intersection.
8.2. Build-Time Mode Selection
Rather than always building both indexes, we select the optimal mode at build time based on content characteristics:
pub fn select_index_mode(
doc_count: usize,
total_text_bytes: usize,
needs_prefix_matching: bool,
needs_fuzzy_matching: bool,
thresholds: &IndexThresholds,
) -> IndexMode {
// Small content: suffix array is fine
if doc_count <= thresholds.suffix_only_max_docs
&& total_text_bytes <= thresholds.suffix_only_max_bytes
{
return IndexMode::SuffixArrayOnly;
}
// Large content but needs prefix/fuzzy: must have suffix array
if needs_prefix_matching || needs_fuzzy_matching {
if doc_count >= thresholds.inverted_only_min_docs {
return IndexMode::Hybrid;
}
return IndexMode::SuffixArrayOnly;
}
// Large content, exact matches only: inverted index
if doc_count >= thresholds.inverted_only_min_docs {
return IndexMode::InvertedIndexOnly;
}
IndexMode::SuffixArrayOnly
} The decision matrix:
| Docs | Text Size | Query Types | Mode |
|---|---|---|---|
| <50 | <100KB | Any | SuffixArrayOnly |
| 50-500 | Any | Exact only | SuffixArrayOnly |
| 50-500 | Any | Prefix/fuzzy | SuffixArrayOnly |
| 500+ | Any | Exact only | InvertedIndexOnly |
| 500+ | Any | Prefix/fuzzy | Hybrid |
8.3. Inverted Index Structure
The inverted index maps terms to posting lists:
pub struct Posting {
pub doc_id: usize,
pub offset: usize,
pub field_type: FieldType,
}
pub struct PostingList {
pub postings: Vec<Posting>, // Sorted by (doc_id, offset)
pub doc_freq: usize, // Count of unique doc_ids
}
pub struct InvertedIndex {
pub terms: HashMap<String, PostingList>,
pub total_docs: usize,
} Posting lists are sorted by (doc_id, offset) to enable efficient intersection. This invariant is verified by property tests. 8.4. Lean Specification
The inverted index specifications in InvertedIndex.lean define key properties:
/-- A posting is well-formed if it points to a valid location -/
def Posting.WellFormed (p : Posting) (texts : Array String) (term_len : Nat) : Prop :=
p.doc_id < texts.size ∧
p.offset + term_len ≤ (texts[p.doc_id]!).length
/-- The posting list is sorted by (doc_id, offset) -/
def PostingList.Sorted (pl : PostingList) : Prop :=
∀ i j : Nat, (hi : i < pl.postings.size) → (hj : j < pl.postings.size) → i < j →
Posting.le pl.postings[i] pl.postings[j]
/-- Document frequency matches unique doc_ids -/
def PostingList.DocFreqCorrect (pl : PostingList) : Prop :=
pl.doc_freq = (pl.postings.map (·.doc_id)).toList.eraseDups.length The hybrid index consistency axiom ensures both indexes agree on results:
/-- Both indexes agree on document results for exact matches -/
axiom hybrid_search_consistent
(idx : HybridIndex)
(h : HybridIndex.WellFormed idx)
(word : String)
(doc_id : Nat) :
(∃ pl, idx.inverted_index.lookup word = some pl ∧ doc_id ∈ pl.docIds) →
∃ entry : SuffixEntry, entry ∈ idx.suffix_index.suffix_array.toList ∧
entry.doc_id = doc_id ∧ ... 8.5. Unified Search
The unified index dispatches queries to the appropriate data structure:
pub fn search_unified(index: &UnifiedIndex, term: &str) -> Vec<SearchDoc> {
match index.mode {
IndexMode::SuffixArrayOnly => {
// O(log n) prefix matching with fuzzy fallback
search_suffix_array(index, term)
}
IndexMode::InvertedIndexOnly => {
// O(1) exact word lookup
search_inverted(index, term)
}
IndexMode::Hybrid => {
// Try inverted index first, fall back to suffix array
let results = search_inverted(index, term);
if results.is_empty() {
search_suffix_array(index, term)
} else {
results
}
}
}
} This gives us the best of both worlds: fast exact matches for common queries, with substring and fuzzy search as fallback.
8.6. Performance Comparison
We benchmarked all index modes using Criterion with 200 samples and 99% confidence intervals.
Build Time
Query Latency
The inverted index wins decisively for exact word matching:
- 3.7× faster for single word lookups (O(1) hash vs O(log n) binary search)
- 4.9× faster for multi-word AND queries (posting list intersection vs repeated binary search)
- Suffix array uniquely supports prefix and fuzzy matching (inverted index cannot do these)
Scaling to Large Corpora
At 500 posts with ~1500 words each (750K words total), the inverted index maintains sub-millisecond query times (175 µs for multi-word, 114 µs for rare terms). The suffix array would require >10 seconds to build at this scale.
8.7. True Hybrid: Suffix Array over Vocabulary
The previous “hybrid” mode simply falls back from inverted index to suffix array. But there’s an elegant alternative: build a suffix array over just the vocabulary keys.
This approach combines the FracturedJson insight (adapt structure to content characteristics) with classic information retrieval.The key observation: for prefix search like “prog” → “programming”, we don’t need a suffix array over the full text (800K characters). We only need to search among unique terms (vocabulary).
pub struct HybridIndex {
pub inverted_index: InvertedIndex, // term → posting list
pub vocabulary: Vec<String>, // sorted unique terms
pub vocab_suffix_array: Vec<VocabSuffixEntry>, // suffix array over vocabulary
} Space savings are dramatic:
For 100 posts (~100K words), the full text is 797K characters requiring ~800K suffix entries. The vocabulary has only 114 unique terms (755 suffix entries). That’s a 1000× reduction.
Search strategy:
- Exact word: O(1) hash lookup in inverted index
- Prefix search: O(log k) binary search on vocabulary suffix array → find matching terms → union their posting lists
- Multi-word AND: Intersect posting lists for each matched term
fn score_term(index: &HybridIndex, term: &str) -> Vec<(usize, f64)> {
// Try exact match first (O(1))
if let Some(posting_list) = index.inverted_index.terms.get(term) {
return score_posting_list(posting_list);
}
// Fall back to prefix match via vocabulary suffix array (O(log k))
let matching_terms = find_prefix_matches(index, term);
if matching_terms.is_empty() {
return Vec::new();
}
// Union posting lists for all matching terms
matching_terms.iter()
.filter_map(|t| index.inverted_index.terms.get(t))
.flat_map(|pl| score_posting_list(pl))
.collect()
} This true hybrid gives us:
- Build time: Same as inverted index (14× faster than full suffix array)
- Exact word query: O(1) via hash lookup
- Substring search: O(log k) on vocabulary—“script” finds “typescript”, “javascript”
- Space: Vocabulary size + inverted index (not full text × 2)
The suffix array over vocabulary still enables substring search within words. Searching “script” finds all terms containing that substring: “typescript”, “javascript”, “scripting”. What we lose is cross-word substrings—“java script” (two words) won’t match a search for “ript”. For blog search, this is fine.
9. Zero-CPU Fuzzy Search: FST + Levenshtein Automata
The final piece is making fuzzy search fast. Traditional approaches iterate over the vocabulary computing Levenshtein distance for each term— per query. With 114 terms, that’s ~11,400 operations per keystroke.
9.1. The Schulz-Mihov Insight
The key insight from Schulz & Mihov (2002)9 is that Levenshtein automaton structure is query-independent. The transitions depend only on:
- The length of the query
- The maximum edit distance (k)
We can precompute parametric transition tables at build time that work for ANY query string. At search time, building a query-specific DFA is pure table lookups—~1μs instead of computing actual Levenshtein distances.
/// Precomputed Levenshtein DFA (query-independent)
pub struct ParametricDFA {
states: Vec<DfaState>, // ~70 states for k=2
transitions: Vec<[u16; 8]>, // 8 char classes per state
accept: Vec<u8>, // distance if accepting
}
/// Query-specific matcher (built in ~1μs via table lookups)
pub struct QueryMatcher<'a> {
dfa: &'a ParametricDFA,
query: &'a str,
}
impl<'a> QueryMatcher<'a> {
/// Check if a term matches within edit distance k
/// Returns distance if accepting, None otherwise
pub fn matches(&self, term: &str) -> Option<u8> {
let mut state = 0u16;
for c in term.chars() {
// Compute character class: which query positions match c
let char_class = self.compute_char_class(c);
// Pure table lookup - no Levenshtein computation!
state = self.dfa.transitions[state as usize][char_class];
if state == DEAD_STATE {
return None;
}
}
// Check if final state is accepting
let distance = self.dfa.accept[state as usize];
if distance != NOT_ACCEPTING {
Some(distance)
} else {
None
}
}
} 9.2. FST for Vocabulary Compression
We pair the Levenshtein automaton with an FST (Finite State Transducer) for vocabulary storage.10 An FST is a compressed trie with suffix sharing:
use fst::{Set, SetBuilder};
/// Build FST from sorted vocabulary
pub fn build_fst_from_vocabulary(vocabulary: &[String]) -> Vec<u8> {
let mut builder = SetBuilder::memory();
for term in vocabulary {
builder.insert(term).expect("vocabulary must be sorted");
}
builder.into_inner().unwrap()
} For 114 terms, the FST is ~500 bytes—10x smaller than raw vocabulary. More importantly, it enables efficient fuzzy traversal: we walk the FST while simulating the Levenshtein DFA, pruning branches that exceed the edit distance threshold.
9.3. Three-Tier Search Strategy
The final WASM searcher combines all techniques:
#[wasm_bindgen]
impl SiftSearcher {
pub fn search(&self, query: &str) -> Vec<SearchDoc> {
// Tier 1: Exact match - O(1) inverted index lookup
if let Some(docs) = self.exact_lookup(query) {
return docs;
}
// Tier 2: Prefix match - O(log k) vocabulary suffix array
let prefix_matches = self.prefix_search(query);
if !prefix_matches.is_empty() {
return prefix_matches;
}
// Tier 3: Fuzzy match - O(FST edges), zero Levenshtein computation
self.fuzzy_search(query, 2) // k=2 edit distance
}
} Performance comparison:
| Operation | Before (naive) | After (FST + DFA) |
|---|---|---|
| DFA build | - | ~1μs (table lookups) |
| Fuzzy per term | ~100μs (Levenshtein) | ~8ns (DFA transition) |
| Full fuzzy query | ~10ms | ~0.1ms |
100x faster fuzzy search, with zero client CPU for Levenshtein computation.
10. Progressive Layer Loading
For the best user experience, we load search data incrementally. Users see results immediately from title matches while heavier content layers load in the background.
10.1. Layer Architecture
flowchart LR
subgraph Client["Browser"]
M[Manifest<br/>~500 bytes] --> T[Titles Layer<br/>~5KB]
T --> H[Headings Layer<br/>~20KB]
H --> C[Content Layer<br/>~200KB]
end
subgraph Search["Search Quality"]
T --> |"Instant"| R1[First Results]
H --> |"+50ms"| R2[Expanded Coverage]
C --> |"+200ms"| R3[Full Search]
end Each layer is a complete HybridIndex with its own vocabulary suffix array and inverted index:
#[wasm_bindgen]
pub struct SiftProgressiveIndex {
docs: Vec<SearchDoc>, // Always present
titles: Option<HybridIndex>, // Loaded first (~5KB)
headings: Option<HybridIndex>, // Loaded second (~20KB)
content: Option<HybridIndex>, // Loaded last (~200KB)
} 10.2. Streaming Search API
The two-phase search API enables showing results before all layers load:
impl SiftProgressiveIndex {
/// Phase 1: O(1) exact word matches from inverted index
pub fn search_exact(&self, query: &str) -> Vec<SearchResult> {
// Returns immediately with exact matches
// ~0.1ms latency
}
/// Phase 2: O(log k) prefix/fuzzy matches via suffix array
pub fn search_expanded(&self, query: &str, exclude_ids: &[usize]) -> Vec<SearchResult> {
// Returns additional matches not found in Phase 1
// Excludes already-found docs to avoid duplicates
}
} TypeScript integration:
// Show results as they arrive
async function* streamingSearch(query: string) {
// Phase 1: Instant exact matches
const exactResults = index.search_exact(query);
yield exactResults;
// Phase 2: Expanded matches (prefix, fuzzy)
const exactIds = exactResults.map((r) => r.doc.id);
const expandedResults = index.search_expanded(query, JSON.stringify(exactIds));
yield expandedResults;
} 10.3. Layer Loading Strategy
// Load titles first for instant search
const titlesResponse = await fetch('/cdn/search/titles.sift');
index.load_layer_binary('titles', await titlesResponse.arrayBuffer());
// Load remaining layers in parallel
const [headings, content] = await Promise.all([
fetch('/cdn/search/headings.sift'),
fetch('/cdn/search/content.sift')
]);
index.load_layer_binary('headings', await headings.arrayBuffer());
index.load_layer_binary('content', await content.arrayBuffer()); Time to first result: ~50ms (titles layer only) Time to full search: ~300ms (all layers)
11. Binary Format
The .sift binary format is 5-7x smaller than JSON and loads ~3-5x faster.
11.1. Format Layout
┌────────────────────────────────────────────────────────────┐
│ HEADER (32 bytes) │
│ magic: "SIFT" | version: u8 | flags: u8 │
│ doc_count | term_count | fst_len | sa_len | postings_len│
├────────────────────────────────────────────────────────────┤
│ FST BYTES (vocabulary as finite state transducer) │
│ ~500 bytes for typical blog vocabulary │
├────────────────────────────────────────────────────────────┤
│ VOCABULARY SUFFIX ARRAY (delta-encoded pairs) │
│ [term_idx, offset] pairs for prefix search │
├────────────────────────────────────────────────────────────┤
│ POSTINGS (Block PFOR: 128-doc blocks, bit-packed) │
│ Lucene-style variable-byte encoding │
├────────────────────────────────────────────────────────────┤
│ LEVENSHTEIN DFA (~1.2KB for k=2) │
│ Precomputed Schulz-Mihov parametric tables │
├────────────────────────────────────────────────────────────┤
│ FOOTER: CRC32 checksum + "TFIS" magic │
└────────────────────────────────────────────────────────────┘ 11.2. Size Comparison
| Component | JSON | Binary |
|---|---|---|
| Vocabulary | ~3KB | ~500 bytes (FST) |
| Suffix array | ~20KB | ~8KB (delta + varint) |
| Postings | ~50KB | ~15KB (Block PFOR) |
| Levenshtein DFA | N/A | ~1.2KB |
| Total (uncompressed) | ~200KB | ~25KB |
| Brotli compressed | ~20KB | ~8KB |
12. Performance Benchmarks
We benchmark against popular search libraries using Criterion (Rust) and Tinybench (JavaScript) with 99% confidence intervals on a 100-post blog corpus.
12.1. Unified Comparison: Build Time
The suffix array’s naive sort makes build time its weakness. For 100 posts, that’s 10ms—2× slower than FlexSearch but 5× faster than lunr.js. The tradeoff is worth it for true substring search.
12.2. The Killer Features: Substring & Typo Tolerance
Substring search is why suffix arrays exist. When searching “script”, users expect to find TypeScript and JavaScript posts. But lunr.js, FlexSearch, and MiniSearch return zero results—they tokenize by words and only match prefixes. Fuse.js finds matches via fuzzy matching, but at 500ms per query—500× slower. Our suffix array finds all 100 matches in under 1 microsecond.
Typo tolerance handles real-world queries. Searching “ruts” should find “rust”. Our Levenshtein distance matching finds strings within k edits. For k=1, the suffix array finds 15 results while most inverted indexes find zero.
How edit distance works: For k=1, we search all “neighborhoods” of the query—“auts”, “buts”, … “rust”, “zutz”. The suffix array’s sorted structure makes this efficient, though at k=2 the search space explodes. We default to k=1 for typo tolerance.
12.3. Performance Tradeoffs: Speed vs Memory
Query latency: For standard word-based queries, FlexSearch leads. Our suffix array matches its speed while adding substring capability—the key differentiator.
Memory footprint reveals the tradeoff:
- fuse.js: No index, scans raw data—minimal memory but 750× slower queries
- MiniSearch: Compact inverted index—good balance of speed and memory
- FlexSearch/lunr.js: ~1MB indexes for fast tokenized search
- Suffix Array (WASM): ~6MB for 100 posts—indexes every character position
The suffix array’s O(n) memory cost is the price of true substring search. Inverted indexes only store tokenized words; suffix arrays index every position to enable queries like “async” matching within “asynchronous”.
12.4. Cold Start Analysis
WASM instantiation adds startup cost:
With instantiateStreaming, compilation overlaps with fetch. On 4G networks (~50ms fetch for 160KB), the index is ready before users focus the search input.
13. Verification Status
Current proof coverage:
| Component | Lean Status | Rust Status |
|---|---|---|
| Type definitions | ✓ Specified | ✓ Implemented |
| Field dominance | ✓ Proven | Compile-time asserted |
| Suffix sortedness | Axiom | Property tested |
| LCP correctness | Axiom | Unit tested |
| Binary search | Axiom | Property tested |
| Edit distance | Axiom | Unit tested |
| Posting well-formedness | ✓ Specified | Property tested |
| Posting list sortedness | ✓ Specified | Property tested |
| Doc freq correctness | ✓ Specified | Property tested |
| Hybrid consistency | Axiom | Property tested |
Proven theorems: 5 (including title_beats_heading, heading_beats_content, aggregateScores_eq_sum)
Axioms: 22 (empirically verified through testing)
Type-level invariants: 6 (ValidatedSuffixEntry, SortedSuffixArray, WellFormedIndex, ValidatedPosting, ValidatedPostingList, ValidatedInvertedIndex)
The axioms represent properties we believe true but haven’t formally proven—usually because they require advanced Mathlib tactics. Property tests give us high confidence while we work toward full proofs.
14. Conclusion
Formal verification isn’t just for aerospace and cryptography. For any system with critical invariants—search ranking, financial calculations, access control—the investment pays off.
The workflow we’ve built:
- Specify invariants in Lean
- Prove what we can mathematically
- Encode invariants in Rust types
- Assert at compile time where possible
- Check at runtime in debug builds
- Test properties exhaustively
This layered approach catches bugs at every stage. Type errors fail at compile time. Constant mistakes fail at build time. Invariant violations fail in debug mode. Edge cases fail in property tests. And the core ranking guarantee? That’s mathematically proven.
The next frontier is completing the Lean proofs—replacing axioms with theorems. But even partial verification is valuable. When someone asks “how do you know title matches rank higher?”, we don’t say “we tested it.” We say “we proved it.”
Bibliography
- Manber, U. & Myers, G. (1993). “Suffix Arrays: A New Method for On-Line String Searches”. SIAM Journal on Computing, 22(5), 935–948. The foundational paper introducing suffix arrays as an alternative to suffix trees with better cache locality.↩
- Kärkkäinen, J. & Sanders, P. (2003). “Simple Linear Work Suffix Array Construction”. Proceedings of ICALP 2003. Springer LNCS 2719. The DC3/skew algorithm that achieves linear-time construction.↩
- Rayon Contributors. (2024). “Rayon: A data parallelism library for Rust”. Rayon provides parallel iterators that make it trivial to convert sequential computations into parallel ones while maintaining Rust’s safety guarantees.↩
- de Moura, L. & Ullrich, S. (2021). “The Lean 4 Theorem Prover and Programming Language”. Proceedings of CADE-28. Springer LNCS 12699. Lean 4’s dependent type system enables both proving and programming.↩
- Minsky, Y. (2011). “Effective ML”. Jane Street Tech Blog. The principle of making illegal states unrepresentable using algebraic data types.↩
- Meyer, B. (1992). “Applying Design by Contract”. IEEE Computer, 25(10), 40–51. The foundational work on preconditions, postconditions, and invariants as contracts between callers and implementations.↩
- Claessen, K. & Hughes, J. (2000). “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”. Proceedings of ICFP 2000. The paper that popularized property-based testing. Proptest is Rust’s implementation of these ideas.↩
- Schulz, K.U. & Mihov, S. (2002). “Fast String Correction with Levenshtein-Automata”. International Journal of Document Analysis and Recognition, 5(1), 67-85. The foundational paper on parametric Levenshtein automata enabling O(1) DFA construction.↩
- BurntSushi. (2024). “fst: Finite State Transducers”. A Rust library for fast ordered set/map operations using FSTs. Enables memory-efficient vocabulary storage with O(k) lookup time.↩