vendor/regex-1.4.6/PERFORMANCE.md

   1 Your friendly guide to understanding the performance characteristics of this
   2 crate.
   3
   4 This guide assumes some familiarity with the public API of this crate, which
   5 can be found here: https://docs.rs/regex
   6
   7 ## Theory vs. Practice
   8
   9 One of the design goals of this crate is to provide worst case linear time
  10 behavior with respect to the text searched using finite state automata. This
  11 means that, *in theory*, the performance of this crate is much better than most
  12 regex implementations, which typically use backtracking which has worst case
  13 exponential time.
  14
  15 For example, try opening a Python interpreter and typing this:
  16
  17     >>> import re
  18     >>> re.search('(a*)*c', 'a' * 30).span()
  19
  20 I'll wait.
  21
  22 At some point, you'll figure out that it won't terminate any time soon. ^C it.
  23
  24 The promise of this crate is that *this pathological behavior can't happen*.
  25
  26 With that said, just because we have protected ourselves against worst case
  27 exponential behavior doesn't mean we are immune from large constant factors
  28 or places where the current regex engine isn't quite optimal. This guide will
  29 detail those cases and provide guidance on how to avoid them, among other
  30 bits of general advice.
  31
  32 ## Thou Shalt Not Compile Regular Expressions In A Loop
  33
  34 **Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.
  35
  36 Don't do it unless you really don't mind paying for it. Compiling a regular
  37 expression in this crate is quite expensive. It is conceivable that it may get
  38 faster some day, but I wouldn't hold out hope for, say, an order of magnitude
  39 improvement. In particular, compilation can take any where from a few dozen
  40 microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
  41 classes, in particular, have the largest impact on compilation performance. At
  42 the time of writing, for example, `\pL{100}` takes around 44ms to compile. This
  43 is because `\pL` corresponds to every letter in Unicode and compilation must
  44 turn it into a proper automaton that decodes a subset of UTF-8 which
  45 corresponds to those letters. Compilation also spends some cycles shrinking the
  46 size of the automaton.
  47
  48 This means that in order to realize efficient regex matching, one must
  49 *amortize the cost of compilation*. Trivially, if a call to `is_match` is
  50 inside a loop, then make sure your call to `Regex::new` is *outside* that loop.
  51
  52 In many programming languages, regular expressions can be conveniently defined
  53 and compiled in a global scope, and code can reach out and use them as if
  54 they were global static variables. In Rust, there is really no concept of
  55 life-before-main, and therefore, one cannot utter this:
  56
  57     static MY_REGEX: Regex = Regex::new("...").unwrap();
  58
  59 Unfortunately, this would seem to imply that one must pass `Regex` objects
  60 around to everywhere they are used, which can be especially painful depending
  61 on how your program is structured. Thankfully, the
  62 [`lazy_static`](https://crates.io/crates/lazy_static)
  63 crate provides an answer that works well:
  64
  65     #[macro_use] extern crate lazy_static;
  66     extern crate regex;
  67
  68     use regex::Regex;
  69
  70     fn some_helper_function(text: &str) -> bool {
  71         lazy_static! {
  72             static ref MY_REGEX: Regex = Regex::new("...").unwrap();
  73         }
  74         MY_REGEX.is_match(text)
  75     }
  76
  77 In other words, the `lazy_static!` macro enables us to define a `Regex` *as if*
  78 it were a global static value. What is actually happening under the covers is
  79 that the code inside the macro (i.e., `Regex::new(...)`) is run on *first use*
  80 of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but
  81 it's self contained and everything works exactly as you expect. In particular,
  82 `MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
  83 a `Mutex`. On that note...
  84
  85 ## Using a regex from multiple threads
  86
  87 **Advice**: The performance impact from using a `Regex` from multiple threads
  88 is likely negligible. If necessary, clone the `Regex` so that each thread gets
  89 its own copy. Cloning a regex does not incur any additional memory overhead
  90 than what would be used by using a `Regex` from multiple threads
  91 simultaneously. *Its only cost is ergonomics.*
  92
  93 It is supported and encouraged to define your regexes using `lazy_static!` as
  94 if they were global static values, and then use them to search text from
  95 multiple threads simultaneously.
  96
  97 One might imagine that this is possible because a `Regex` represents a
  98 *compiled* program, so that any allocation or mutation is already done, and is
  99 therefore read-only. Unfortunately, this is not true. Each type of search
 100 strategy in this crate requires some kind of mutable scratch space to use
 101 *during search*. For example, when executing a DFA, its states are computed
 102 lazily and reused on subsequent searches. Those states go into that mutable
 103 scratch space.
 104
 105 The mutable scratch space is an implementation detail, and in general, its
 106 mutation should not be observable from users of this crate. Therefore, it uses
 107 interior mutability. This implies that `Regex` can either only be used from one
 108 thread, or it must do some sort of synchronization. Either choice is
 109 reasonable, but this crate chooses the latter, in particular because it is
 110 ergonomic and makes use with `lazy_static!` straight forward.
 111
 112 Synchronization implies *some* amount of overhead. When a `Regex` is used from
 113 a single thread, this overhead is negligible. When a `Regex` is used from
 114 multiple threads simultaneously, it is possible for the overhead of
 115 synchronization from contention to impact performance. The specific cases where
 116 contention may happen is if you are calling any of these methods repeatedly
 117 from multiple threads simultaneously:
 118
 119 * shortest_match
 120 * is_match
 121 * find
 122 * captures
 123
 124 In particular, every invocation of one of these methods must synchronize with
 125 other threads to retrieve its mutable scratch space before searching can start.
 126 If, however, you are using one of these methods:
 127
 128 * find_iter
 129 * captures_iter
 130
 131 Then you may not suffer from contention since the cost of synchronization is
 132 amortized on *construction of the iterator*. That is, the mutable scratch space
 133 is obtained when the iterator is created and retained throughout its lifetime.
 134
 135 ## Only ask for what you need
 136
 137 **Advice**: Prefer in this order: `is_match`, `find`, `captures`.
 138
 139 There are three primary search methods on a `Regex`:
 140
 141 * is_match
 142 * find
 143 * captures
 144
 145 In general, these are ordered from fastest to slowest.
 146
 147 `is_match` is fastest because it doesn't actually need to find the start or the
 148 end of the leftmost-first match. It can quit immediately after it knows there
 149 is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
 150 search will quit after examing the first byte.
 151
 152 In constrast, `find` must return both the start and end location of the
 153 leftmost-first match. It can use the DFA matcher for this, but must run it
 154 forwards once to find the end of the match *and then run it backwards* to find
 155 the start of the match. The two scans and the cost of finding the real end of
 156 the leftmost-first match make this more expensive than `is_match`.
 157
 158 `captures` is the most expensive of them all because it must do what `find`
 159 does, and then run either the bounded backtracker or the Pike VM to fill in the
 160 capture group locations. Both of these are simulations of an NFA, which must
 161 spend a lot of time shuffling states around. The DFA limits the performance hit
 162 somewhat by restricting the amount of text that must be searched via an NFA
 163 simulation.
 164
 165 One other method not mentioned is `shortest_match`. This method has precisely
 166 the same performance characteristics as `is_match`, except it will return the
 167 end location of when it discovered a match. For example, given the regex `a+`
 168 and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
 169 the latter of which being the correct end location of the leftmost-first match.
 170
 171 ## Literals in your regex may make it faster
 172
 173 **Advice**: Literals can reduce the work that the regex engine needs to do. Use
 174 them if you can, especially as prefixes.
 175
 176 In particular, if your regex starts with a prefix literal, the prefix is
 177 quickly searched before entering the (much slower) regex engine. For example,
 178 given the regex `foo\w+`, the literal `foo` will be searched for using
 179 Boyer-Moore. If there's no match, then no regex engine is ever used. Only when
 180 there's a match is the regex engine invoked at the location of the match, which
 181 effectively permits the regex engine to skip large portions of a haystack.
 182 If a regex is comprised entirely of literals (possibly more than one), then
 183 it's possible that the regex engine can be avoided entirely even when there's a
 184 match.
 185
 186 When one literal is found, Boyer-Moore is used. When multiple literals are
 187 found, then an optimized version of Aho-Corasick is used.
 188
 189 This optimization is in particular extended quite a bit in this crate. Here are
 190 a few examples of regexes that get literal prefixes detected:
 191
 192 * `(foo|bar)` detects `foo` and `bar`
 193 * `(a|b)c` detects `ac` and `bc`
 194 * `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz`
 195 * `a?b` detects `a` and `b`
 196 * `a*b` detects `a` and `b`
 197 * `(ab){3,6}` detects `ababab`
 198
 199 Literals in anchored regexes can also be used for detecting non-matches very
 200 quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
 201 just by examing the first (or last) three bytes of the haystack.
 202
 203 ## Unicode word boundaries may prevent the DFA from being used
 204
 205 **Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
 206 instead of `\b` if you care about consistent performance more than correctness.
 207
 208 It's a sad state of the current implementation. At the moment, the DFA will try
 209 to interpret Unicode word boundaries as if they were ASCII word boundaries.
 210 If the DFA comes across any non-ASCII byte, it will quit and fall back to an
 211 alternative matching engine that can handle Unicode word boundaries correctly.
 212 The alternate matching engine is generally quite a bit slower (perhaps by an
 213 order of magnitude). If necessary, this can be ameliorated in two ways.
 214
 215 The first way is to add some number of literal prefixes to your regular
 216 expression. Even though the DFA may not be used, specialized routines will
 217 still kick in to find prefix literals quickly, which limits how much work the
 218 NFA simulation will need to do.
 219
 220 The second way is to give up on Unicode and use an ASCII word boundary instead.
 221 One can use an ASCII word boundary by disabling Unicode support. That is,
 222 instead of using `\b`, use `(?-u:\b)`.  Namely, given the regex `\b.+\b`, it
 223 can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
 224 is important to limit the scope of disabling the `u` flag, since it might lead
 225 to a syntax error if the regex could match arbitrary bytes. For example, if one
 226 wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
 227 any *byte* when the Unicode flag is disabled.
 228
 229 The second way isn't appreciably different than just using a Unicode word
 230 boundary in the first place, since the DFA will speculatively interpret it as
 231 an ASCII word boundary anyway. The key difference is that if an ASCII word
 232 boundary is used explicitly, then the DFA won't quit in the presence of
 233 non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
 234 more consistent performance.
 235
 236 N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
 237 can simply write `\b` to get an ASCII word boundary.
 238
 239 ## Excessive counting can lead to exponential state blow up in the DFA
 240
 241 **Advice**: Don't write regexes that cause DFA state blow up if you care about
 242 match performance.
 243
 244 Wait, didn't I say that this crate guards against exponential worst cases?
 245 Well, it turns out that the process of converting an NFA to a DFA can lead to
 246 an exponential blow up in the number of states. This crate specifically guards
 247 against exponential blow up by doing two things:
 248
 249 1. The DFA is computed lazily. That is, a state in the DFA only exists in
 250    memory if it is visited. In particular, the lazy DFA guarantees that *at
 251    most* one state is created for every byte of input. This, on its own,
 252    guarantees linear time complexity.
 253 2. Of course, creating a new state for *every* byte of input means that search
 254    will go incredibly slow because of very large constant factors. On top of
 255    that, creating a state for every byte in a large haystack could result in
 256    exorbitant memory usage. To ameliorate this, the DFA bounds the number of
 257    states it can store. Once it reaches its limit, it flushes its cache. This
 258    prevents reuse of states that it already computed. If the cache is flushed
 259    too frequently, then the DFA will give up and execution will fall back to
 260    one of the NFA simulations.
 261
 262 In effect, this crate will detect exponential state blow up and fall back to
 263 a search routine with fixed memory requirements. This does, however, mean that
 264 searching will be much slower than one might expect. Regexes that rely on
 265 counting in particular are strong aggravators of this behavior. For example,
 266 matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s.
 267
 268 In the future, it may be possible to increase the bound that the DFA uses,
 269 which would allow the caller to choose how much memory they're willing to
 270 spend.
 271
 272 ## Resist the temptation to "optimize" regexes
 273
 274 **Advice**: This ain't a backtracking engine.
 275
 276 An entire book was written on how to optimize Perl-style regular expressions.
 277 Most of those techniques are not applicable for this library. For example,
 278 there is no problem with using non-greedy matching or having lots of
 279 alternations in your regex.