SEARCH ENGINE (PERL + FASTCGI + SA)
03 JANUARY 2026
Article count on the website is growing. Need a way to search.
Requirements: matches substrings, case-insensitive, fast, secure. No JavaScript.
Architecture: browser → httpd → slowcgi → Perl CGI script.
httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access governed by file system permissions–no secrets to manage. Secure by default.
2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast enough. Start to feel the O(N) pull at higher file counts.
Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can be mitigated. Tempted to stop here.
2026-01-03: Suffix array (SA) based index lookup.
Inefficiency of scanning every file on each request troubles me. Regex search depends almost entirely on hardware for speed.
Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted byte offsets), file_map.dat (metadata). Index created with the site:
$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the lexicographical order:
my @sa = 0 .. (length($corpus) - 1);
{
use bytes; # Force compare 8-bit Unicode value comparisons
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
CORPUS: a s c i \0 i m x
OFFSET: 0 1 2 3 4 5 6 7
SORTED SUFFIXES:
[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx <-- "i" from "asci"
[5] imx <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a cache line), reducing complexity to O(N log N).
Search: Textbook range query with two binary searches. Random access to offsets and the text is possible via the fixed-width offsets:
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy to implement. Note to self: mmap.
Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
500 files:
- Index size: 204.94 KB
- Indexing time: 0.1475 s
- Peak RAM (SA): 8828 KB
- Peak RAM (Regex): 9136 KB
- Search (SA): 0.0012 s
- Search (Regex): 0.0407 s
1,000 files:
- Index size: 410.51 KB
- Indexing time: 0.3101 s
- Peak RAM (SA): 8980 KB
- Peak RAM (Regex): 9460 KB
- Search (SA): 0.0019 s
- Search (Regex): 0.0795 s
10,000 files:
- Index size: 4163.44 KB
- Indexing time: 10.9661 s
- Peak RAM (SA): 12504 KB
- Peak RAM (Regex): 12804 KB
- Search (SA): 0.0161 s
- Search (Regex): 0.9120 s
Security: Much of it comes from its architectural simplicity. No dependencies to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.
Resource exhaustion and XSS attacks are inherent. The former is mitigated by limiting concurrent searches via lock-file semaphores and capping query length (64B) and result sets (20). All output is HTML-escaped to prevent XSS.
Verdict: Fast SA lookup; Works on every conceivable web browser.