Formally Verified Search: From Suffix Arrays to Lean Proofs

Listen
Base Camp
0:00
0:00
tl;dr.
This post details building a formally verified search engine for a static site. We use suffix arrays for *O(m log n)* substring search, an inverted index for *O(1)* exact word lookup, and a hybrid mode that combines both. The implementation compiles Rust to WebAssembly for sub-millisecond queries, with correctness properties proven in Lean 4. The result: a search index where critical invariants are mathematically guaranteed, not just tested.

1. The Problem with “Just Test It”

Client-side search is often dismissed as a toy. Common libraries like FlexSearch or Lunr.js work well for small corpora but hit performance walls with megabytes of text. More importantly, they’re tested but not verified—edge cases hide in the long tail.

We wanted something different: a search engine where key properties aren’t just tested, but proven.

Consider the scoring system. If a title match ranks below a content match, users will complain. Testing helps, but you can’t test every query. What if someone changes a constant and breaks the hierarchy? This happens more often than you’d think. A “quick fix” to scoring constants can silently break ranking invariants.

The solution is formal verification. We specify invariants in Lean 4, a theorem prover, and prove they hold mathematically. The Rust implementation mirrors these specs, with compile-time assertions and type-level invariants that make invalid states unrepresentable.

2. Suffix Arrays: The Data Structure

We use suffix arrays rather than suffix trees.1 Both enable fast substring search, but suffix arrays are simpler and more cache-friendly.

2.1. What is a Suffix Array?

A suffix array is a sorted array of pointers to every suffix of a text. For the string “banana”:

IndexOffsetSuffix
05“a”
13“ana”
21“anana”
30“banana”
44“na”
52“nana”

The suffixes are sorted lexicographically. To find any pattern, we binary search this array.

2.2. Complexity Analysis

Building the suffix array naively requires sorting suffixes, each of length up to :

We use a radix-based approach that constructs the array in time.2 The algorithm exploits the structure of suffix comparison⁠—⁠prefixes of suffixes are themselves suffixes.

Querying is logarithmic in index size but linear in query length:

For a 5MB corpus with 100-character queries, this means ~20 string comparisons per search⁠—⁠imperceptible latency.

2.3. LCP Arrays

Alongside the suffix array, we build a Longest Common Prefix (LCP) array. For each position i, lcp[i] stores the length of the longest common prefix between consecutive sorted suffixes:

lcp[i] = commonPrefixLength(suffix_array[i-1], suffix_array[i])

For “banana”, the LCP array is:

iSuffix at iSuffix at i-1LCP
0“a”0
1“ana”“a”1
2“anana”“ana”3
3“banana”“anana”0
4“na”“banana”0
5“nana”“na”2

The LCP array enables two optimizations:

  1. Early termination: During binary search, if we’ve already compared k characters and lcp[mid] ≥ k, we can skip re-comparing those characters
  2. Range queries: Finding all matches for a prefix becomes O(m + occ) instead of O(m log n + occ) using LCP-based range minimum queries

2.4. Parallel Construction with Rayon

The naive sort is embarrassingly parallel⁠—⁠each suffix comparison is independent. We use Rayon3 for data parallelism:

#[cfg(feature = "parallel")]
use rayon::prelude::*;

pub fn build_index(docs: Vec<SearchDoc>, texts: Vec<String>) -> SearchIndex {
    // Phase 1: Generate suffix entries in parallel
    let suffixes: Vec<SuffixEntry> = docs
        .par_iter()
        .flat_map_iter(|doc| {
            texts.get(doc.id).into_iter().flat_map(move |text| {
                (0..text.len()).map(move |offset| SuffixEntry {
                    doc_id: doc.id,
                    offset,
                })
            })
        })
        .collect();

    // Phase 2: Parallel sort (the expensive part)
    let mut suffixes = suffixes;
    suffixes.par_sort_by(|a, b| {
        let sa = texts[a.doc_id].get(a.offset..).unwrap_or("");
        let sb = texts[b.doc_id].get(b.offset..).unwrap_or("");
        sa.cmp(sb)
    });

    // Phase 3: Compute LCP in parallel
    let mut lcp = vec![0; suffixes.len()];
    lcp.par_iter_mut()
        .enumerate()
        .skip(1)
        .for_each(|(i, slot)| {
            *slot = common_prefix_len(
                &texts[suffixes[i-1].doc_id][suffixes[i-1].offset..],
                &texts[suffixes[i].doc_id][suffixes[i].offset..],
            );
        });

    SearchIndex { docs, texts, suffix_array: suffixes, lcp }
}
Rayon's `par_sort_by` uses a parallel merge sort that maintains stability while scaling across cores. On an 8-core M1, this cuts build time from 215ms to ~35ms for 100 posts.

The key insight: parallelizing the sort provides the biggest win because it dominates build time. Suffix generation and LCP computation are linear and parallelize naturally, but the sort’s work⁠—⁠with each comparison taking time for -character suffixes⁠—⁠is where Rayon shines.

Note that we use flat_map_iter rather than flat_map to avoid excessive parallel overhead for the inner loop. Each document’s suffixes are generated sequentially; parallelism happens across documents.

3. The Verification Stack

Here’s the layered verification architecture:

flowchart TB
    subgraph Lean["Lean 4 Specifications"]
        L1[Types.lean<br/>Core data structures]
        L2[Scoring.lean<br/>Field hierarchy proofs]
        L3[SuffixArray.lean<br/>Sortedness invariants]
        L4[BinarySearch.lean<br/>Search correctness]
        L5[InvertedIndex.lean<br/>Posting list properties]
    end

    subgraph Rust["Rust Implementation"]
        R1[types.rs]
        R2[scoring.rs]
        R3[index.rs]
        R4[search.rs]
        R5[inverted.rs]
    end

    subgraph Contracts["Verification Layers"]
        C1[Type-level invariants<br/>Compile-time guarantees]
        C2[const assertions<br/>Build-time checks]
        C3[debug_assert!<br/>Runtime contracts]
        C4[proptest<br/>Property tests]
    end

    L1 --> R1
    L2 --> R2
    L3 --> R3
    L4 --> R4
    L5 --> R5

    R1 --> C1
    R2 --> C2
    R3 --> C3
    R4 --> C4

Each layer catches different classes of bugs:

  1. Lean proofs: Mathematical certainty for critical invariants
  2. Type-level wrappers: Compile-time prevention of invalid states
  3. Const assertions: Build-time verification of constants
  4. Runtime contracts: Debug-mode invariant checks
  5. Property tests: Randomized exploration of edge cases

4. Lean 4 Specifications

Lean 4 is a theorem prover and programming language.4 We use it to specify what our search index should do, then prove those properties hold.

4.1. Type Definitions

The Lean types mirror our Rust structs exactly:

structure SuffixEntry where
  doc_id : Nat    -- Which document
  offset : Nat    -- Position in document
  deriving Repr, DecidableEq

def SuffixEntry.WellFormed (e : SuffixEntry) (texts : Array String) : Prop :=
  e.doc_id < texts.size ∧
  e.offset < texts[e.doc_id]!.length

A suffix entry is “well-formed” if its doc_id points to a valid document and its offset is within the document bounds. We use strict inequality (<) because suffix arrays index non-empty suffixes⁠—⁠an offset equal to the length would produce an empty string. This is a proposition—a statement that can be true or false.

4.2. The Scoring Hierarchy Proof

The most important invariant is field type dominance: title matches always outrank heading matches, which always outrank content matches, regardless of position bonuses.

def baseScore : FieldType → Nat
  | .title   => 1000
  | .heading => 100
  | .content => 10

def maxPositionBoost : Nat := 5

theorem title_beats_heading :
    baseScore .title - maxPositionBoost > baseScore .heading + maxPositionBoost := by
  native_decide  -- 995 > 105

theorem heading_beats_content :
    baseScore .heading - maxPositionBoost > baseScore .content + maxPositionBoost := by
  native_decide  -- 95 > 15

The native_decide tactic asks Lean to evaluate the arithmetic and verify the inequality holds. If someone changes the constants incorrectly, Lean won’t compile.

4.3. Aggregate Score Properties

We also prove properties about score aggregation:

def aggregateScores (scores : List Nat) : Nat :=
  scores.foldl (· + ·) 0

theorem aggregateScores_eq_sum (scores : List Nat) :
    aggregateScores scores = scores.sum := by
  induction scores with
  | nil => rfl
  | cons x xs ih =>
    simp only [List.foldl_cons, List.sum_cons]
    have foldl_shift : ∀ init l,
        List.foldl (· + ·) init l = init + List.foldl (· + ·) 0 l := by
      intro init l
      induction l generalizing init with
      | nil => simp
      | cons y ys ihy => simp [ihy]; omega
    rw [foldl_shift]; omega

This theorem proves our aggregation function correctly sums scores. The induction tactic breaks the proof into base case (empty list) and inductive case (cons).

5. Rust Implementation

The Rust code mirrors the Lean specifications, with additional enforcement mechanisms.

5.1. Type-Level Invariants

We use wrapper types that encode invariants at the type level:

/// A validated suffix entry where bounds are guaranteed.
pub struct ValidatedSuffixEntry {
    inner: SuffixEntry,
}

impl ValidatedSuffixEntry {
    /// Create a validated entry. Returns Err if invariants violated.
    pub fn new(entry: SuffixEntry, texts: &[String]) -> Result<Self, InvariantError> {
        if entry.doc_id >= texts.len() {
            return Err(InvariantError::InvalidDocId {
                doc_id: entry.doc_id,
                texts_len: texts.len(),
            });
        }

        if entry.offset > texts[entry.doc_id].len() {
            return Err(InvariantError::InvalidOffset {
                doc_id: entry.doc_id,
                offset: entry.offset,
                text_len: texts[entry.doc_id].len(),
            });
        }

        Ok(Self { inner: entry })
    }

    /// Get the suffix. Cannot panic—bounds were checked at construction.
    pub fn suffix<'a>(&self, texts: &'a [String]) -> &'a str {
        &texts[self.inner.doc_id][self.inner.offset..]
    }
}

Once you have a ValidatedSuffixEntry, you cannot have an invalid pointer. The constructor is the only way to create one, and it enforces the invariant.

5.2. Compile-Time Assertions

For constants that must satisfy mathematical relationships, we use const assertions:

/// Static assertion that field type dominance holds.
/// Evaluated at compile time—if it fails, the crate won't build.
const _: () = {
    const TITLE: f64 = 100.0;
    const HEADING: f64 = 10.0;
    const CONTENT: f64 = 1.0;
    const MAX_BOOST: f64 = 0.5;

    // INVARIANT: title_beats_heading (from Scoring.lean)
    const WORST_TITLE: f64 = TITLE - MAX_BOOST;
    const BEST_HEADING: f64 = HEADING + MAX_BOOST;
    assert!(WORST_TITLE > BEST_HEADING); // 99.5 > 10.5 ✓

    // INVARIANT: heading_beats_content (from Scoring.lean)
    const WORST_HEADING: f64 = HEADING - MAX_BOOST;
    const BEST_CONTENT: f64 = CONTENT + MAX_BOOST;
    assert!(WORST_HEADING > BEST_CONTENT); // 9.5 > 1.5 ✓
};

This code runs at compile time. If someone changes HEADING to 50.0, the build fails immediately with a clear error.

5.3. Runtime Contracts

For invariants that can’t be checked at compile time, we use debug assertions⁠—⁠a technique from Design by Contract:6

/// Check that the suffix array is sorted lexicographically.
#[inline]
pub fn check_suffix_array_sorted(texts: &[String], suffix_array: &[SuffixEntry]) {
    for i in 1..suffix_array.len() {
        let prev_suffix = &texts[suffix_array[i-1].doc_id][suffix_array[i-1].offset..];
        let curr_suffix = &texts[suffix_array[i].doc_id][suffix_array[i].offset..];

        debug_assert!(
            prev_suffix <= curr_suffix,
            "Contract violation: suffix array not sorted at {}: '{}' > '{}'",
            i, prev_suffix, curr_suffix
        );
    }
}

These checks are compiled out in release builds but catch bugs during development.

6. The cargo xtask Workflow

We use the idiomatic Rust xtask pattern for verification commands:

cargo xtask verify    # Full verification suite
cargo xtask test      # Run all tests
cargo xtask lean      # Build Lean proofs
cargo xtask check     # Quick check (no Lean)

The verify command runs a 5-step pipeline:

sequenceDiagram
    participant V as verify
    participant I as Invariants
    participant T as Tests
    participant C as Clippy
    participant L as Lean
    participant A as Alignment

    V->>I: Check INVARIANT markers
    I-->>V: ✓ 5+ markers found
    V->>T: Run cargo test
    T-->>V: ✓ All tests pass
    V->>C: Run clippy
    C-->>V: ✓ No warnings
    V->>L: Run lake build
    L-->>V: ✓ Proofs compile
    V->>A: Compare Rust/Lean constants
    A-->>V: ✓ Constants aligned
    V->>V: ✓ Safe to commit

The final step⁠—⁠constant alignment⁠—⁠ensures Rust and Lean use the same values. Lean uses 10× scaling (1000 instead of 100.0) to avoid float arithmetic, so we verify the relationship holds.

7. Property-Based Testing

For properties that are axiomatized in Lean (not fully proven), we use proptest7 to gain empirical confidence:

proptest! {
    #[test]
    fn suffix_array_is_sorted(texts in prop::collection::vec("[a-z]{1,20}", 1..5)) {
        let docs = texts.iter().enumerate()
            .map(|(i, _)| make_doc(i))
            .collect();
        let index = build_index(docs, texts.clone(), vec![]);

        // Verify sortedness
        for i in 1..index.suffix_array.len() {
            let prev = suffix_at(&index.texts, &index.suffix_array[i-1]);
            let curr = suffix_at(&index.texts, &index.suffix_array[i]);
            prop_assert!(prev <= curr);
        }
    }

    #[test]
    fn binary_search_finds_all_matches(
        text in "[a-z]{10,50}",
        pattern in "[a-z]{1,5}"
    ) {
        let index = build_single_doc_index(&text);
        let results = search(&index, &pattern);

        // Every occurrence should be found
        for (i, _) in text.match_indices(&pattern) {
            prop_assert!(
                results.iter().any(|r| r.offset == i),
                "Pattern '{}' at offset {} not found", pattern, i
            );
        }
    }
}

These tests run thousands of random inputs, exploring edge cases that manual tests miss.

8. Hybrid Search: Inverted Index for Large Corpora

Suffix arrays excel at substring and fuzzy matching, but for large blogs (500+ posts), exact word lookups benefit from an inverted index⁠—⁠the classic information retrieval data structure.

8.1. The Scaling Problem

Suffix arrays have O(log n) query time, which is excellent. But for exact word matches, an inverted index provides O(1) lookup. With 1000 documents, the difference is:

More importantly, inverted indexes support efficient boolean queries (AND, OR) via posting list intersection.

8.2. Build-Time Mode Selection

Rather than always building both indexes, we select the optimal mode at build time based on content characteristics:

pub fn select_index_mode(
    doc_count: usize,
    total_text_bytes: usize,
    needs_prefix_matching: bool,
    needs_fuzzy_matching: bool,
    thresholds: &IndexThresholds,
) -> IndexMode {
    // Small content: suffix array is fine
    if doc_count <= thresholds.suffix_only_max_docs
        && total_text_bytes <= thresholds.suffix_only_max_bytes
    {
        return IndexMode::SuffixArrayOnly;
    }

    // Large content but needs prefix/fuzzy: must have suffix array
    if needs_prefix_matching || needs_fuzzy_matching {
        if doc_count >= thresholds.inverted_only_min_docs {
            return IndexMode::Hybrid;
        }
        return IndexMode::SuffixArrayOnly;
    }

    // Large content, exact matches only: inverted index
    if doc_count >= thresholds.inverted_only_min_docs {
        return IndexMode::InvertedIndexOnly;
    }

    IndexMode::SuffixArrayOnly
}

The decision matrix:

DocsText SizeQuery TypesMode
<50<100KBAnySuffixArrayOnly
50-500AnyExact onlySuffixArrayOnly
50-500AnyPrefix/fuzzySuffixArrayOnly
500+AnyExact onlyInvertedIndexOnly
500+AnyPrefix/fuzzyHybrid

8.3. Inverted Index Structure

The inverted index maps terms to posting lists:

pub struct Posting {
    pub doc_id: usize,
    pub offset: usize,
    pub field_type: FieldType,
}

pub struct PostingList {
    pub postings: Vec<Posting>,  // Sorted by (doc_id, offset)
    pub doc_freq: usize,         // Count of unique doc_ids
}

pub struct InvertedIndex {
    pub terms: HashMap<String, PostingList>,
    pub total_docs: usize,
}
Posting lists are sorted by (doc_id, offset) to enable efficient intersection. This invariant is verified by property tests.

8.4. Lean Specification

The inverted index specifications in InvertedIndex.lean define key properties:

/-- A posting is well-formed if it points to a valid location -/
def Posting.WellFormed (p : Posting) (texts : Array String) (term_len : Nat) : Prop :=
  p.doc_id < texts.size ∧
  p.offset + term_len ≤ (texts[p.doc_id]!).length

/-- The posting list is sorted by (doc_id, offset) -/
def PostingList.Sorted (pl : PostingList) : Prop :=
  ∀ i j : Nat, (hi : i < pl.postings.size) → (hj : j < pl.postings.size) → i < j →
    Posting.le pl.postings[i] pl.postings[j]

/-- Document frequency matches unique doc_ids -/
def PostingList.DocFreqCorrect (pl : PostingList) : Prop :=
  pl.doc_freq = (pl.postings.map (·.doc_id)).toList.eraseDups.length

The hybrid index consistency axiom ensures both indexes agree on results:

/-- Both indexes agree on document results for exact matches -/
axiom hybrid_search_consistent
    (idx : HybridIndex)
    (h : HybridIndex.WellFormed idx)
    (word : String)
    (doc_id : Nat) :
    (∃ pl, idx.inverted_index.lookup word = some pl ∧ doc_id ∈ pl.docIds) →
    ∃ entry : SuffixEntry, entry ∈ idx.suffix_index.suffix_array.toList ∧
      entry.doc_id = doc_id ∧ ...

The unified index dispatches queries to the appropriate data structure:

pub fn search_unified(index: &UnifiedIndex, term: &str) -> Vec<SearchDoc> {
    match index.mode {
        IndexMode::SuffixArrayOnly => {
            // O(log n) prefix matching with fuzzy fallback
            search_suffix_array(index, term)
        }
        IndexMode::InvertedIndexOnly => {
            // O(1) exact word lookup
            search_inverted(index, term)
        }
        IndexMode::Hybrid => {
            // Try inverted index first, fall back to suffix array
            let results = search_inverted(index, term);
            if results.is_empty() {
                search_suffix_array(index, term)
            } else {
                results
            }
        }
    }
}

This gives us the best of both worlds: fast exact matches for common queries, with substring and fuzzy search as fallback.

8.6. Performance Comparison

We benchmarked all index modes using Criterion with 200 samples and 99% confidence intervals.

Build Time

Inverted index builds 14× faster than suffix array
Suffix array build time grows quadratically with corpus size due to suffix sorting. At 500 posts, it becomes impractical.

Query Latency

Single word lookup comparison
Multi-word AND query comparison

The inverted index wins decisively for exact word matching:

  • 3.7× faster for single word lookups (O(1) hash vs O(log n) binary search)
  • 4.9× faster for multi-word AND queries (posting list intersection vs repeated binary search)
  • Suffix array uniquely supports prefix and fuzzy matching (inverted index cannot do these)

Scaling to Large Corpora

Inverted index scales to 500 posts

At 500 posts with ~1500 words each (750K words total), the inverted index maintains sub-millisecond query times (175 µs for multi-word, 114 µs for rare terms). The suffix array would require >10 seconds to build at this scale.

8.7. True Hybrid: Suffix Array over Vocabulary

The previous “hybrid” mode simply falls back from inverted index to suffix array. But there’s an elegant alternative: build a suffix array over just the vocabulary keys.

This approach combines the FracturedJson insight (adapt structure to content characteristics) with classic information retrieval.

The key observation: for prefix search like “prog” → “programming”, we don’t need a suffix array over the full text (800K characters). We only need to search among unique terms (vocabulary).

pub struct HybridIndex {
    pub inverted_index: InvertedIndex,  // term → posting list
    pub vocabulary: Vec<String>,         // sorted unique terms
    pub vocab_suffix_array: Vec<VocabSuffixEntry>,  // suffix array over vocabulary
}

Space savings are dramatic:

Vocabulary suffix array is 1000× smaller than full text

For 100 posts (~100K words), the full text is 797K characters requiring ~800K suffix entries. The vocabulary has only 114 unique terms (755 suffix entries). That’s a 1000× reduction.

Search strategy:

  1. Exact word: O(1) hash lookup in inverted index
  2. Prefix search: O(log k) binary search on vocabulary suffix array → find matching terms → union their posting lists
  3. Multi-word AND: Intersect posting lists for each matched term
fn score_term(index: &HybridIndex, term: &str) -> Vec<(usize, f64)> {
    // Try exact match first (O(1))
    if let Some(posting_list) = index.inverted_index.terms.get(term) {
        return score_posting_list(posting_list);
    }

    // Fall back to prefix match via vocabulary suffix array (O(log k))
    let matching_terms = find_prefix_matches(index, term);
    if matching_terms.is_empty() {
        return Vec::new();
    }

    // Union posting lists for all matching terms
    matching_terms.iter()
        .filter_map(|t| index.inverted_index.terms.get(t))
        .flat_map(|pl| score_posting_list(pl))
        .collect()
}
Hybrid enables substring search at inverted index build speed

This true hybrid gives us:

  • Build time: Same as inverted index (14× faster than full suffix array)
  • Exact word query: O(1) via hash lookup
  • Substring search: O(log k) on vocabulary⁠—⁠“script” finds “typescript”, “javascript”
  • Space: Vocabulary size + inverted index (not full text × 2)

The suffix array over vocabulary still enables substring search within words. Searching “script” finds all terms containing that substring: “typescript”, “javascript”, “scripting”. What we lose is cross-word substrings⁠—⁠“java script” (two words) won’t match a search for “ript”. For blog search, this is fine.

9. Zero-CPU Fuzzy Search: FST + Levenshtein Automata

The final piece is making fuzzy search fast. Traditional approaches iterate over the vocabulary computing Levenshtein distance for each term— per query. With 114 terms, that’s ~11,400 operations per keystroke.

9.1. The Schulz-Mihov Insight

The key insight from Schulz & Mihov (2002)9 is that Levenshtein automaton structure is query-independent. The transitions depend only on:

  1. The length of the query
  2. The maximum edit distance (k)

We can precompute parametric transition tables at build time that work for ANY query string. At search time, building a query-specific DFA is pure table lookups⁠—⁠~1μs instead of computing actual Levenshtein distances.

/// Precomputed Levenshtein DFA (query-independent)
pub struct ParametricDFA {
    states: Vec<DfaState>,       // ~70 states for k=2
    transitions: Vec<[u16; 8]>,  // 8 char classes per state
    accept: Vec<u8>,             // distance if accepting
}

/// Query-specific matcher (built in ~1μs via table lookups)
pub struct QueryMatcher<'a> {
    dfa: &'a ParametricDFA,
    query: &'a str,
}

impl<'a> QueryMatcher<'a> {
    /// Check if a term matches within edit distance k
    /// Returns distance if accepting, None otherwise
    pub fn matches(&self, term: &str) -> Option<u8> {
        let mut state = 0u16;
        for c in term.chars() {
            // Compute character class: which query positions match c
            let char_class = self.compute_char_class(c);
            // Pure table lookup - no Levenshtein computation!
            state = self.dfa.transitions[state as usize][char_class];
            if state == DEAD_STATE {
                return None;
            }
        }
        // Check if final state is accepting
        let distance = self.dfa.accept[state as usize];
        if distance != NOT_ACCEPTING {
            Some(distance)
        } else {
            None
        }
    }
}

9.2. FST for Vocabulary Compression

We pair the Levenshtein automaton with an FST (Finite State Transducer) for vocabulary storage.10 An FST is a compressed trie with suffix sharing:

use fst::{Set, SetBuilder};

/// Build FST from sorted vocabulary
pub fn build_fst_from_vocabulary(vocabulary: &[String]) -> Vec<u8> {
    let mut builder = SetBuilder::memory();
    for term in vocabulary {
        builder.insert(term).expect("vocabulary must be sorted");
    }
    builder.into_inner().unwrap()
}

For 114 terms, the FST is ~500 bytes⁠—⁠10x smaller than raw vocabulary. More importantly, it enables efficient fuzzy traversal: we walk the FST while simulating the Levenshtein DFA, pruning branches that exceed the edit distance threshold.

9.3. Three-Tier Search Strategy

The final WASM searcher combines all techniques:

#[wasm_bindgen]
impl SiftSearcher {
    pub fn search(&self, query: &str) -> Vec<SearchDoc> {
        // Tier 1: Exact match - O(1) inverted index lookup
        if let Some(docs) = self.exact_lookup(query) {
            return docs;
        }

        // Tier 2: Prefix match - O(log k) vocabulary suffix array
        let prefix_matches = self.prefix_search(query);
        if !prefix_matches.is_empty() {
            return prefix_matches;
        }

        // Tier 3: Fuzzy match - O(FST edges), zero Levenshtein computation
        self.fuzzy_search(query, 2)  // k=2 edit distance
    }
}

Performance comparison:

OperationBefore (naive)After (FST + DFA)
DFA build-~1μs (table lookups)
Fuzzy per term~100μs (Levenshtein)~8ns (DFA transition)
Full fuzzy query~10ms~0.1ms

100x faster fuzzy search, with zero client CPU for Levenshtein computation.

10. Progressive Layer Loading

For the best user experience, we load search data incrementally. Users see results immediately from title matches while heavier content layers load in the background.

10.1. Layer Architecture

flowchart LR
    subgraph Client["Browser"]
        M[Manifest<br/>~500 bytes] --> T[Titles Layer<br/>~5KB]
        T --> H[Headings Layer<br/>~20KB]
        H --> C[Content Layer<br/>~200KB]
    end

    subgraph Search["Search Quality"]
        T --> |"Instant"| R1[First Results]
        H --> |"+50ms"| R2[Expanded Coverage]
        C --> |"+200ms"| R3[Full Search]
    end

Each layer is a complete HybridIndex with its own vocabulary suffix array and inverted index:

#[wasm_bindgen]
pub struct SiftProgressiveIndex {
    docs: Vec<SearchDoc>,           // Always present
    titles: Option<HybridIndex>,    // Loaded first (~5KB)
    headings: Option<HybridIndex>,  // Loaded second (~20KB)
    content: Option<HybridIndex>,   // Loaded last (~200KB)
}

10.2. Streaming Search API

The two-phase search API enables showing results before all layers load:

impl SiftProgressiveIndex {
    /// Phase 1: O(1) exact word matches from inverted index
    pub fn search_exact(&self, query: &str) -> Vec<SearchResult> {
        // Returns immediately with exact matches
        // ~0.1ms latency
    }

    /// Phase 2: O(log k) prefix/fuzzy matches via suffix array
    pub fn search_expanded(&self, query: &str, exclude_ids: &[usize]) -> Vec<SearchResult> {
        // Returns additional matches not found in Phase 1
        // Excludes already-found docs to avoid duplicates
    }
}

TypeScript integration:

// Show results as they arrive
async function* streamingSearch(query: string) {
	// Phase 1: Instant exact matches
	const exactResults = index.search_exact(query);
	yield exactResults;

	// Phase 2: Expanded matches (prefix, fuzzy)
	const exactIds = exactResults.map((r) => r.doc.id);
	const expandedResults = index.search_expanded(query, JSON.stringify(exactIds));
	yield expandedResults;
}

10.3. Layer Loading Strategy

// Load titles first for instant search
const titlesResponse = await fetch('/cdn/search/titles.sift');
index.load_layer_binary('titles', await titlesResponse.arrayBuffer());

// Load remaining layers in parallel
const [headings, content] = await Promise.all([
	fetch('/cdn/search/headings.sift'),
	fetch('/cdn/search/content.sift')
]);
index.load_layer_binary('headings', await headings.arrayBuffer());
index.load_layer_binary('content', await content.arrayBuffer());

Time to first result: ~50ms (titles layer only) Time to full search: ~300ms (all layers)

11. Binary Format

The .sift binary format is 5-7x smaller than JSON and loads ~3-5x faster.

11.1. Format Layout

┌────────────────────────────────────────────────────────────┐
│ HEADER (32 bytes)                                          │
│   magic: "SIFT" | version: u8 | flags: u8                 │
│   doc_count | term_count | fst_len | sa_len | postings_len│
├────────────────────────────────────────────────────────────┤
│ FST BYTES (vocabulary as finite state transducer)          │
│   ~500 bytes for typical blog vocabulary                   │
├────────────────────────────────────────────────────────────┤
│ VOCABULARY SUFFIX ARRAY (delta-encoded pairs)              │
│   [term_idx, offset] pairs for prefix search               │
├────────────────────────────────────────────────────────────┤
│ POSTINGS (Block PFOR: 128-doc blocks, bit-packed)          │
│   Lucene-style variable-byte encoding                      │
├────────────────────────────────────────────────────────────┤
│ LEVENSHTEIN DFA (~1.2KB for k=2)                           │
│   Precomputed Schulz-Mihov parametric tables               │
├────────────────────────────────────────────────────────────┤
│ FOOTER: CRC32 checksum + "TFIS" magic                      │
└────────────────────────────────────────────────────────────┘

11.2. Size Comparison

ComponentJSONBinary
Vocabulary~3KB~500 bytes (FST)
Suffix array~20KB~8KB (delta + varint)
Postings~50KB~15KB (Block PFOR)
Levenshtein DFAN/A~1.2KB
Total (uncompressed)~200KB~25KB
Brotli compressed~20KB~8KB

12. Performance Benchmarks

We benchmark against popular search libraries using Criterion (Rust) and Tinybench (JavaScript) with 99% confidence intervals on a 100-post blog corpus.

12.1. Unified Comparison: Build Time

The suffix array’s naive sort makes build time its weakness. For 100 posts, that’s 10ms⁠—⁠2× slower than FlexSearch but 5× faster than lunr.js. The tradeoff is worth it for true substring search.

12.2. The Killer Features: Substring & Typo Tolerance

Substring search is why suffix arrays exist. When searching “script”, users expect to find TypeScript and JavaScript posts. But lunr.js, FlexSearch, and MiniSearch return zero results—they tokenize by words and only match prefixes. Fuse.js finds matches via fuzzy matching, but at 500ms per query⁠—⁠500× slower. Our suffix array finds all 100 matches in under 1 microsecond.

Typo tolerance handles real-world queries. Searching “ruts” should find “rust”. Our Levenshtein distance matching finds strings within k edits. For k=1, the suffix array finds 15 results while most inverted indexes find zero.

How edit distance works: For k=1, we search all “neighborhoods” of the query⁠—⁠“auts”, “buts”, … “rust”, “zutz”. The suffix array’s sorted structure makes this efficient, though at k=2 the search space explodes. We default to k=1 for typo tolerance.

12.3. Performance Tradeoffs: Speed vs Memory

Query latency: For standard word-based queries, FlexSearch leads. Our suffix array matches its speed while adding substring capability⁠—⁠the key differentiator.

Memory footprint reveals the tradeoff:

  • fuse.js: No index, scans raw data⁠—⁠minimal memory but 750× slower queries
  • MiniSearch: Compact inverted index⁠—⁠good balance of speed and memory
  • FlexSearch/lunr.js: ~1MB indexes for fast tokenized search
  • Suffix Array (WASM): ~6MB for 100 posts⁠—⁠indexes every character position

The suffix array’s O(n) memory cost is the price of true substring search. Inverted indexes only store tokenized words; suffix arrays index every position to enable queries like “async” matching within “asynchronous”.

12.4. Cold Start Analysis

WASM instantiation adds startup cost:

With instantiateStreaming, compilation overlaps with fetch. On 4G networks (~50ms fetch for 160KB), the index is ready before users focus the search input.

13. Verification Status

Current proof coverage:

ComponentLean StatusRust Status
Type definitions✓ Specified✓ Implemented
Field dominanceProvenCompile-time asserted
Suffix sortednessAxiomProperty tested
LCP correctnessAxiomUnit tested
Binary searchAxiomProperty tested
Edit distanceAxiomUnit tested
Posting well-formedness✓ SpecifiedProperty tested
Posting list sortedness✓ SpecifiedProperty tested
Doc freq correctness✓ SpecifiedProperty tested
Hybrid consistencyAxiomProperty tested

Proven theorems: 5 (including title_beats_heading, heading_beats_content, aggregateScores_eq_sum)

Axioms: 22 (empirically verified through testing)

Type-level invariants: 6 (ValidatedSuffixEntry, SortedSuffixArray, WellFormedIndex, ValidatedPosting, ValidatedPostingList, ValidatedInvertedIndex)

The axioms represent properties we believe true but haven’t formally proven⁠—⁠usually because they require advanced Mathlib tactics. Property tests give us high confidence while we work toward full proofs.

14. Conclusion

Formal verification isn’t just for aerospace and cryptography. For any system with critical invariants⁠—⁠search ranking, financial calculations, access control⁠—⁠the investment pays off.

The workflow we’ve built:

  1. Specify invariants in Lean
  2. Prove what we can mathematically
  3. Encode invariants in Rust types
  4. Assert at compile time where possible
  5. Check at runtime in debug builds
  6. Test properties exhaustively

This layered approach catches bugs at every stage. Type errors fail at compile time. Constant mistakes fail at build time. Invariant violations fail in debug mode. Edge cases fail in property tests. And the core ranking guarantee? That’s mathematically proven.

The next frontier is completing the Lean proofs⁠—⁠replacing axioms with theorems. But even partial verification is valuable. When someone asks “how do you know title matches rank higher?”, we don’t say “we tested it.” We say “we proved it.”

Bibliography


  1. Manber, U. & Myers, G. (1993). “Suffix Arrays: A New Method for On-Line String Searches”. SIAM Journal on Computing, 22(5), 935–948. The foundational paper introducing suffix arrays as an alternative to suffix trees with better cache locality.
  2. Kärkkäinen, J. & Sanders, P. (2003). “Simple Linear Work Suffix Array Construction”. Proceedings of ICALP 2003. Springer LNCS 2719. The DC3/skew algorithm that achieves linear-time construction.
  3. Rayon Contributors. (2024). “Rayon: A data parallelism library for Rust”. Rayon provides parallel iterators that make it trivial to convert sequential computations into parallel ones while maintaining Rust’s safety guarantees.
  4. de Moura, L. & Ullrich, S. (2021). “The Lean 4 Theorem Prover and Programming Language”. Proceedings of CADE-28. Springer LNCS 12699. Lean 4’s dependent type system enables both proving and programming.
  5. Minsky, Y. (2011). “Effective ML. Jane Street Tech Blog. The principle of making illegal states unrepresentable using algebraic data types.
  6. Meyer, B. (1992). “Applying Design by Contract”. IEEE Computer, 25(10), 40–51. The foundational work on preconditions, postconditions, and invariants as contracts between callers and implementations.
  7. Claessen, K. & Hughes, J. (2000). “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”. Proceedings of ICFP 2000. The paper that popularized property-based testing. Proptest is Rust’s implementation of these ideas.
  8. Schulz, K.U. & Mihov, S. (2002). “Fast String Correction with Levenshtein-Automata”. International Journal of Document Analysis and Recognition, 5(1), 67-85. The foundational paper on parametric Levenshtein automata enabling O(1) DFA construction.
  9. BurntSushi. (2024). “fst: Finite State Transducers”. A Rust library for fast ordered set/map operations using FSTs. Enables memory-efficient vocabulary storage with O(k) lookup time.