Search engine for static sites

SEARCH ENGINE FOR STATIC SITES

03 JANUARY 2026

Built search.

Requirements: match substrings, ignore case, handle 50 concurrent queries with 1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript, text browsers.

Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine.

Server-side regex is viable for a personal site. But an index has clear advantages. Not that much harder to implement.

Index: suffix array (SA) implemented in Perl. Index— corpus.bin, sa.bin, file_map.dat—built with site:

$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl

Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned integers, sorted lexicographically:

my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare 8-bit Unicode value comparisons
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

32-bit offsets limit index size to 4GB—more than sufficient and, if necessary, easily expanded.

Sort (O(L⋅N log N)) is the real bottleneck. Fast path caps L at 64 bytes (length of a typical cache line).

Search: Textbook range query with two binary searches. Fixed-width offsets enable random access to index:

seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);

Small seek/reads are fast on modern SSDs; keeps memory footprint small.

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search:

500 files:

Index size: 204.94 KB
Indexing time: 0.1475 s
Peak RAM (SA): 8828 KB
Peak RAM (Regex): 9136 KB
Search (SA): 0.0012 s
Search (Regex): 0.0407 s

1,000 files:

Index size: 410.51 KB
Indexing time: 0.3101 s
Peak RAM (SA): 8980 KB
Peak RAM (Regex): 9460 KB
Search (SA): 0.0019 s
Search (Regex): 0.0795 s

10,000 files:

Index size: 4163.44 KB
Indexing time: 10.9661 s
Peak RAM (SA): 12504 KB
Peak RAM (Regex): 12804 KB
Search (SA): 0.0161 s
Search (Regex): 0.9120 s

Seek/read consistently outperformed mmap at <1k files. At 10k, mmap was occasionally faster (~200 µs), but used more memory—possibly OpenBSD’s VM security trade-offs. Results may vary by OS.

Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions govern access. Runs in chroot.

Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit concurrent searches. Query length (64B) and result set (20) are capped. All output is HTML-escaped to prevent XSS.

Warranty: 10,000 / 12 → 833 years.
Next release: inverted index—Anno Domini 2859.

Commit: 6da102d | Benchmarks: 8a4da68