vendor/regex/HACKING.md

   1 Your friendly guide to hacking and navigating the regex library.
   2
   3 This guide assumes familiarity with Rust and Cargo, and at least a perusal of
   4 the user facing documentation for this crate.
   5
   6 If you're looking for background on the implementation in this library, then
   7 you can do no better than Russ Cox's article series on implementing regular
   8 expressions using finite automata: https://swtch.com/~rsc/regexp/
   9
  10
  11 ## Architecture overview
  12
  13 As you probably already know, this library executes regular expressions using
  14 finite automata. In particular, a design goal is to make searching linear
  15 with respect to both the regular expression and the text being searched.
  16 Meeting that design goal on its own is not so hard and can be done with an
  17 implementation of the Pike VM (similar to Thompson's construction, but supports
  18 capturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html
  19 --- This library contains such an implementation in src/pikevm.rs.
  20
  21 Making it fast is harder. One of the key problems with the Pike VM is that it
  22 can be in more than one state at any point in time, and must shuffle capture
  23 positions between them. The Pike VM also spends a lot of time following the
  24 same epsilon transitions over and over again. We can employ one trick to
  25 speed up the Pike VM: extract one or more literal prefixes from the regular
  26 expression and execute specialized code to quickly find matches of those
  27 prefixes in the search text. The Pike VM can then be avoided for most the
  28 search, and instead only executed when a prefix is found. The code to find
  29 prefixes is in the regex-syntax crate (in this repository). The code to search
  30 for literals is in src/literals.rs. When more than one literal prefix is found,
  31 we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one
  32 literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and
  33 Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this
  34 library also uses elementary frequency analysis to choose the right byte to run
  35 `memchr` with.
  36
  37 Of course, detecting prefix literals can only take us so far. Not all regular
  38 expressions have literal prefixes. To remedy this, we try another approach
  39 to executing the Pike VM: backtracking, whose implementation can be found in
  40 src/backtrack.rs. One reason why backtracking can be faster is that it avoids
  41 excessive shuffling of capture groups. Of course, backtracking is susceptible
  42 to exponential runtimes, so we keep track of every state we've visited to make
  43 sure we never visit it again. This guarantees linear time execution, but we
  44 pay for it with the memory required to track visited states. Because of the
  45 memory requirement, we only use this engine on small search strings *and* small
  46 regular expressions.
  47
  48 Lastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs.
  49 It is distinct from the Pike VM in that the DFA is explicitly represented in
  50 memory and is only ever in one state at a time. It is said to be "lazy" because
  51 the DFA is computed as text is searched, where each byte in the search text
  52 results in at most one new DFA state. It is made fast by caching states. DFAs
  53 are susceptible to exponential state blow up (where the worst case is computing
  54 a new state for every input byte, regardless of what's in the state cache). To
  55 avoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache
  56 is full, it is wiped and state computation starts over again. If the cache is
  57 wiped too frequently, then the DFA gives up and searching falls back to one of
  58 the aforementioned algorithms.
  59
  60 All of the above matching engines expose precisely the same matching semantics.
  61 This is indeed tested. (See the section below about testing.)
  62
  63 The following sub-sections describe the rest of the library and how each of the
  64 matching engines are actually used.
  65
  66 ### Parsing
  67
  68 Regular expressions are parsed using the regex-syntax crate, which is
  69 maintained in this repository. The regex-syntax crate defines an abstract
  70 syntax and provides very detailed error messages when a parse error is
  71 encountered. Parsing is done in a separate crate so that others may benefit
  72 from its existence, and because it is relatively divorced from the rest of the
  73 regex library.
  74
  75 The regex-syntax crate also provides sophisticated support for extracting
  76 prefix and suffix literals from regular expressions.
  77
  78 ### Compilation
  79
  80 The compiler is in src/compile.rs. The input to the compiler is some abstract
  81 syntax for a regular expression and the output is a sequence of opcodes that
  82 matching engines use to execute a search. (One can think of matching engines as
  83 mini virtual machines.) The sequence of opcodes is a particular encoding of a
  84 non-deterministic finite automaton. In particular, the opcodes explicitly rely
  85 on epsilon transitions.
  86
  87 Consider a simple regular expression like `a|b`. Its compiled form looks like
  88 this:
  89
  90     000 Save(0)
  91     001 Split(2, 3)
  92     002 'a' (goto: 4)
  93     003 'b'
  94     004 Save(1)
  95     005 Match
  96
  97 The first column is the instruction pointer and the second column is the
  98 instruction. Save instructions indicate that the current position in the input
  99 should be stored in a captured location. Split instructions represent a binary
 100 branch in the program (i.e., epsilon transitions). The instructions `'a'` and
 101 `'b'` indicate that the literal bytes `'a'` or `'b'` should match.
 102
 103 In older versions of this library, the compilation looked like this:
 104
 105     000 Save(0)
 106     001 Split(2, 3)
 107     002 'a'
 108     003 Jump(5)
 109     004 'b'
 110     005 Save(1)
 111     006 Match
 112
 113 In particular, empty instructions that merely served to move execution from one
 114 point in the program to another were removed. Instead, every instruction has a
 115 `goto` pointer embedded into it. This resulted in a small performance boost for
 116 the Pike VM, because it was one fewer epsilon transition that it had to follow.
 117
 118 There exist more instructions and they are defined and documented in
 119 src/prog.rs.
 120
 121 Compilation has several knobs and a few unfortunately complicated invariants.
 122 Namely, the output of compilation can be one of two types of programs: a
 123 program that executes on Unicode scalar values or a program that executes
 124 on raw bytes. In the former case, the matching engine is responsible for
 125 performing UTF-8 decoding and executing instructions using Unicode codepoints.
 126 In the latter case, the program handles UTF-8 decoding implicitly, so that the
 127 matching engine can execute on raw bytes. All matching engines can execute
 128 either Unicode or byte based programs except for the lazy DFA, which requires
 129 byte based programs. In general, both representations were kept because (1) the
 130 lazy DFA requires byte based programs so that states can be encoded in a memory
 131 efficient manner and (2) the Pike VM benefits greatly from inlining Unicode
 132 character classes into fewer instructions as it results in fewer epsilon
 133 transitions.
 134
 135 N.B. UTF-8 decoding is built into the compiled program by making use of the
 136 utf8-ranges crate. The compiler in this library factors out common suffixes to
 137 reduce the size of huge character classes (e.g., `\pL`).
 138
 139 A regrettable consequence of this split in instruction sets is we generally
 140 need to compile two programs; one for NFA execution and one for the lazy DFA.
 141
 142 In fact, it is worse than that: the lazy DFA is not capable of finding the
 143 starting location of a match in a single scan, and must instead execute a
 144 backwards search after finding the end location. To execute a backwards search,
 145 we must have compiled the regular expression *in reverse*.
 146
 147 This means that every compilation of a regular expression generally results in
 148 three distinct programs. It would be possible to lazily compile the Unicode
 149 program, since it is never needed if (1) the regular expression uses no word
 150 boundary assertions and (2) the caller never asks for sub-capture locations.
 151
 152 ### Execution
 153
 154 At the time of writing, there are four matching engines in this library:
 155
 156 1. The Pike VM (supports captures).
 157 2. Bounded backtracking (supports captures).
 158 3. Literal substring or multi-substring search.
 159 4. Lazy DFA (no support for Unicode word boundary assertions).
 160
 161 Only the first two matching engines are capable of executing every regular
 162 expression program. They also happen to be the slowest, which means we need
 163 some logic that (1) knows various facts about the regular expression and (2)
 164 knows what the caller wants. Using this information, we can determine which
 165 engine (or engines) to use.
 166
 167 The logic for choosing which engine to execute is in src/exec.rs and is
 168 documented on the Exec type. Exec values contain regular expression Programs
 169 (defined in src/prog.rs), which contain all the necessary tidbits for actually
 170 executing a regular expression on search text.
 171
 172 For the most part, the execution logic is straight-forward and follows the
 173 limitations of each engine described above pretty faithfully. The hairiest
 174 part of src/exec.rs by far is the execution of the lazy DFA, since it requires
 175 a forwards and backwards search, and then falls back to either the Pike VM or
 176 backtracking if the caller requested capture locations.
 177
 178 The Exec type also contains mutable scratch space for each type of matching
 179 engine. This scratch space is used during search (for example, for the lazy
 180 DFA, it contains compiled states that are reused on subsequent searches).
 181
 182 ### Programs
 183
 184 A regular expression program is essentially a sequence of opcodes produced by
 185 the compiler plus various facts about the regular expression (such as whether
 186 it is anchored, its capture names, etc.).
 187
 188 ### The regex! macro
 189
 190 The `regex!` macro no longer exists. It was developed in a bygone era as a
 191 compiler plugin during the infancy of the regex crate. Back then, then only
 192 matching engine in the crate was the Pike VM. The `regex!` macro was, itself,
 193 also a Pike VM. The only advantages it offered over the dynamic Pike VM that
 194 was built at runtime were the following:
 195
 196   1. Syntax checking was done at compile time. Your Rust program wouldn't
 197      compile if your regex didn't compile.
 198   2. Reduction of overhead that was proportional to the size of the regex.
 199      For the most part, this overhead consisted of heap allocation, which
 200      was nearly eliminated in the compiler plugin.
 201
 202 The main takeaway here is that the compiler plugin was a marginally faster
 203 version of a slow regex engine. As the regex crate evolved, it grew other regex
 204 engines (DFA, bounded backtracker) and sophisticated literal optimizations.
 205 The regex macro didn't keep pace, and it therefore became (dramatically) slower
 206 than the dynamic engines. The only reason left to use it was for the compile
 207 time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint
 208 tool) has a lint that checks your regular expression validity, which mostly
 209 replaces that use case.
 210
 211 Additionally, the regex compiler plugin stopped receiving maintenance. Nobody
 212 complained. At that point, it seemed prudent to just remove it.
 213
 214 Will a compiler plugin be brought back? The future is murky, but there is
 215 definitely an opportunity there to build something that is faster than the
 216 dynamic engines in some cases. But it will be challenging! As of now, there
 217 are no plans to work on this.
 218
 219
 220 ## Testing
 221
 222 A key aspect of any mature regex library is its test suite. A subset of the
 223 tests in this library come from Glenn Fowler's AT&T test suite (its online
 224 presence seems gone at the time of writing). The source of the test suite is
 225 located in src/testdata. The scripts/regex-match-tests.py takes the test suite
 226 in src/testdata and generates tests/matches.rs.
 227
 228 There are also many other manually crafted tests and regression tests in
 229 tests/tests.rs. Some of these tests were taken from RE2.
 230
 231 The biggest source of complexity in the tests is related to answering this
 232 question: how can we reuse the tests to check all of our matching engines? One
 233 approach would have been to encode every test into some kind of format (like
 234 the AT&T test suite) and code generate tests for each matching engine. The
 235 approach we use in this library is to create a Cargo.toml entry point for each
 236 matching engine we want to test. The entry points are:
 237
 238 * `tests/test_default.rs` - tests `Regex::new`
 239 * `tests/test_default_bytes.rs` - tests `bytes::Regex::new`
 240 * `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA
 241   algorithm on every regex.
 242 * `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA
 243   algorithm on every regex and use *arbitrary* byte based programs.
 244 * `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA
 245   algorithm on every regex and use *UTF-8* byte based programs.
 246 * `tests/test_backtrack.rs` - tests `Regex::new`, forced to use
 247   backtracking on every regex.
 248 * `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use
 249   backtracking on every regex and use *arbitrary* byte based programs.
 250 * `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use
 251   backtracking on every regex and use *UTF-8* byte based programs.
 252 * `tests/test_crates_regex.rs` - tests to make sure that all of the
 253   backends behave in the same way against a number of quickcheck
 254   generated random inputs. These tests need to be enabled through
 255   the `RUST_REGEX_RANDOM_TEST` environment variable (see
 256   below).
 257
 258 The lazy DFA and pure literal engines are absent from this list because
 259 they cannot be used on every regular expression. Instead, we rely on
 260 `tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible.
 261
 262 Since the tests are repeated several times, and because `cargo test` runs all
 263 entry points, it can take a while to compile everything. To reduce compile
 264 times slightly, try using `cargo test --test default`, which will only use the
 265 `tests/test_default.rs` entry point.
 266
 267 The random testing takes quite a while, so it is not enabled by default.
 268 In order to run the random testing you can set the
 269 `RUST_REGEX_RANDOM_TEST` environment variable to anything before
 270 invoking `cargo test`. Note that this variable is inspected at compile
 271 time, so if the tests don't seem to be running, you may need to run
 272 `cargo clean`.
 273
 274 ## Benchmarking
 275
 276 The benchmarking in this crate is made up of many micro-benchmarks. Currently,
 277 there are two primary sets of benchmarks: the benchmarks that were adopted
 278 at this library's inception (in `bench/src/misc.rs`) and a newer set of
 279 benchmarks meant to test various optimizations. Specifically, the latter set
 280 contain some analysis and are in `bench/src/sherlock.rs`. Also, the latter
 281 set are all executed on the same lengthy input whereas the former benchmarks
 282 are executed on strings of varying length.
 283
 284 There is also a smattering of benchmarks for parsing and compilation.
 285
 286 Benchmarks are in a separate crate so that its dependencies can be managed
 287 separately from the main regex crate.
 288
 289 Benchmarking follows a similarly wonky setup as tests. There are multiple entry
 290 points:
 291
 292 * `bench_rust.rs` - benchmarks `Regex::new`
 293 * `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
 294 * `bench_pcre.rs` - benchmarks PCRE
 295 * `bench_onig.rs` - benchmarks Oniguruma
 296
 297 The PCRE and Oniguruma benchmarks exist as a comparison point to a mature
 298 regular expression library. In general, this regex library compares favorably
 299 (there are even a few benchmarks that PCRE simply runs too slowly on or
 300 outright can't execute at all). I would love to add other regular expression
 301 library benchmarks (especially RE2).
 302
 303 If you're hacking on one of the matching engines and just want to see
 304 benchmarks, then all you need to run is:
 305
 306     $ (cd bench && ./run rust)
 307
 308 If you want to compare your results with older benchmarks, then try:
 309
 310     $ (cd bench && ./run rust | tee old)
 311     $ ... make it faster
 312     $ (cd bench && ./run rust | tee new)
 313     $ cargo benchcmp old new --improvements
 314
 315 The `cargo-benchcmp` utility is available here:
 316 https://github.com/BurntSushi/cargo-benchcmp
 317
 318 The `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See
 319 `./bench/bench --help`.
 320
 321 ## Dev Docs
 322
 323 When digging your teeth into the codebase for the first time, the
 324 crate documentation can be a great resource. By default `rustdoc`
 325 will strip out all documentation of private crate members in an
 326 effort to help consumers of the crate focus on the *interface*
 327 without having to concern themselves with the *implementation*.
 328 Normally this is a great thing, but if you want to start hacking
 329 on regex internals it is not what you want. Many of the private members
 330 of this crate are well documented with rustdoc style comments, and
 331 it would be a shame to miss out on the opportunity that presents.
 332 You can generate the private docs with:
 333
 334 ```
 335 $ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments
 336 ```
 337
 338 Then just point your browser at `target/doc/regex/index.html`.
 339
 340 See https://github.com/rust-lang/rust/issues/15347 for more info
 341 about generating developer docs for internal use.