]>
Commit | Line | Data |
---|---|---|
94b46f34 XL |
1 | Your friendly guide to understanding the performance characteristics of this |
2 | crate. | |
3 | ||
4 | This guide assumes some familiarity with the public API of this crate, which | |
5 | can be found here: https://docs.rs/regex | |
6 | ||
7 | ## Theory vs. Practice | |
8 | ||
9 | One of the design goals of this crate is to provide worst case linear time | |
10 | behavior with respect to the text searched using finite state automata. This | |
11 | means that, *in theory*, the performance of this crate is much better than most | |
12 | regex implementations, which typically use backtracking which has worst case | |
13 | exponential time. | |
14 | ||
15 | For example, try opening a Python interpreter and typing this: | |
16 | ||
17 | >>> import re | |
18 | >>> re.search('(a*)*c', 'a' * 30).span() | |
19 | ||
20 | I'll wait. | |
21 | ||
22 | At some point, you'll figure out that it won't terminate any time soon. ^C it. | |
23 | ||
24 | The promise of this crate is that *this pathological behavior can't happen*. | |
25 | ||
26 | With that said, just because we have protected ourselves against worst case | |
27 | exponential behavior doesn't mean we are immune from large constant factors | |
28 | or places where the current regex engine isn't quite optimal. This guide will | |
29 | detail those cases and provide guidance on how to avoid them, among other | |
30 | bits of general advice. | |
31 | ||
32 | ## Thou Shalt Not Compile Regular Expressions In A Loop | |
33 | ||
34 | **Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation. | |
35 | ||
36 | Don't do it unless you really don't mind paying for it. Compiling a regular | |
37 | expression in this crate is quite expensive. It is conceivable that it may get | |
38 | faster some day, but I wouldn't hold out hope for, say, an order of magnitude | |
39 | improvement. In particular, compilation can take any where from a few dozen | |
40 | microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character | |
41 | classes, in particular, have the largest impact on compilation performance. At | |
42 | the time of writing, for example, `\pL{100}` takes around 44ms to compile. This | |
43 | is because `\pL` corresponds to every letter in Unicode and compilation must | |
44 | turn it into a proper automaton that decodes a subset of UTF-8 which | |
45 | corresponds to those letters. Compilation also spends some cycles shrinking the | |
46 | size of the automaton. | |
47 | ||
48 | This means that in order to realize efficient regex matching, one must | |
49 | *amortize the cost of compilation*. Trivially, if a call to `is_match` is | |
50 | inside a loop, then make sure your call to `Regex::new` is *outside* that loop. | |
51 | ||
52 | In many programming languages, regular expressions can be conveniently defined | |
53 | and compiled in a global scope, and code can reach out and use them as if | |
54 | they were global static variables. In Rust, there is really no concept of | |
55 | life-before-main, and therefore, one cannot utter this: | |
56 | ||
57 | static MY_REGEX: Regex = Regex::new("...").unwrap(); | |
58 | ||
59 | Unfortunately, this would seem to imply that one must pass `Regex` objects | |
60 | around to everywhere they are used, which can be especially painful depending | |
61 | on how your program is structured. Thankfully, the | |
62 | [`lazy_static`](https://crates.io/crates/lazy_static) | |
63 | crate provides an answer that works well: | |
64 | ||
65 | #[macro_use] extern crate lazy_static; | |
66 | extern crate regex; | |
67 | ||
68 | use regex::Regex; | |
69 | ||
70 | fn some_helper_function(text: &str) -> bool { | |
71 | lazy_static! { | |
72 | static ref MY_REGEX: Regex = Regex::new("...").unwrap(); | |
73 | } | |
74 | MY_REGEX.is_match(text) | |
75 | } | |
76 | ||
77 | In other words, the `lazy_static!` macro enables us to define a `Regex` *as if* | |
78 | it were a global static value. What is actually happening under the covers is | |
79 | that the code inside the macro (i.e., `Regex::new(...)`) is run on *first use* | |
80 | of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but | |
81 | it's self contained and everything works exactly as you expect. In particular, | |
82 | `MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or | |
83 | a `Mutex`. On that note... | |
84 | ||
85 | ## Using a regex from multiple threads | |
86 | ||
87 | **Advice**: The performance impact from using a `Regex` from multiple threads | |
88 | is likely negligible. If necessary, clone the `Regex` so that each thread gets | |
89 | its own copy. Cloning a regex does not incur any additional memory overhead | |
90 | than what would be used by using a `Regex` from multiple threads | |
91 | simultaneously. *Its only cost is ergonomics.* | |
92 | ||
93 | It is supported and encouraged to define your regexes using `lazy_static!` as | |
94 | if they were global static values, and then use them to search text from | |
95 | multiple threads simultaneously. | |
96 | ||
97 | One might imagine that this is possible because a `Regex` represents a | |
98 | *compiled* program, so that any allocation or mutation is already done, and is | |
99 | therefore read-only. Unfortunately, this is not true. Each type of search | |
100 | strategy in this crate requires some kind of mutable scratch space to use | |
101 | *during search*. For example, when executing a DFA, its states are computed | |
102 | lazily and reused on subsequent searches. Those states go into that mutable | |
103 | scratch space. | |
104 | ||
105 | The mutable scratch space is an implementation detail, and in general, its | |
106 | mutation should not be observable from users of this crate. Therefore, it uses | |
107 | interior mutability. This implies that `Regex` can either only be used from one | |
108 | thread, or it must do some sort of synchronization. Either choice is | |
109 | reasonable, but this crate chooses the latter, in particular because it is | |
110 | ergonomic and makes use with `lazy_static!` straight forward. | |
111 | ||
112 | Synchronization implies *some* amount of overhead. When a `Regex` is used from | |
113 | a single thread, this overhead is negligible. When a `Regex` is used from | |
114 | multiple threads simultaneously, it is possible for the overhead of | |
115 | synchronization from contention to impact performance. The specific cases where | |
116 | contention may happen is if you are calling any of these methods repeatedly | |
117 | from multiple threads simultaneously: | |
118 | ||
119 | * shortest_match | |
120 | * is_match | |
121 | * find | |
122 | * captures | |
123 | ||
124 | In particular, every invocation of one of these methods must synchronize with | |
125 | other threads to retrieve its mutable scratch space before searching can start. | |
126 | If, however, you are using one of these methods: | |
127 | ||
128 | * find_iter | |
129 | * captures_iter | |
130 | ||
131 | Then you may not suffer from contention since the cost of synchronization is | |
132 | amortized on *construction of the iterator*. That is, the mutable scratch space | |
133 | is obtained when the iterator is created and retained throughout its lifetime. | |
134 | ||
135 | ## Only ask for what you need | |
136 | ||
137 | **Advice**: Prefer in this order: `is_match`, `find`, `captures`. | |
138 | ||
139 | There are three primary search methods on a `Regex`: | |
140 | ||
141 | * is_match | |
142 | * find | |
143 | * captures | |
144 | ||
145 | In general, these are ordered from fastest to slowest. | |
146 | ||
147 | `is_match` is fastest because it doesn't actually need to find the start or the | |
148 | end of the leftmost-first match. It can quit immediately after it knows there | |
149 | is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the | |
150 | search will quit after examing the first byte. | |
151 | ||
152 | In constrast, `find` must return both the start and end location of the | |
153 | leftmost-first match. It can use the DFA matcher for this, but must run it | |
154 | forwards once to find the end of the match *and then run it backwards* to find | |
155 | the start of the match. The two scans and the cost of finding the real end of | |
156 | the leftmost-first match make this more expensive than `is_match`. | |
157 | ||
158 | `captures` is the most expensive of them all because it must do what `find` | |
159 | does, and then run either the bounded backtracker or the Pike VM to fill in the | |
160 | capture group locations. Both of these are simulations of an NFA, which must | |
161 | spend a lot of time shuffling states around. The DFA limits the performance hit | |
162 | somewhat by restricting the amount of text that must be searched via an NFA | |
163 | simulation. | |
164 | ||
165 | One other method not mentioned is `shortest_match`. This method has precisely | |
166 | the same performance characteristics as `is_match`, except it will return the | |
167 | end location of when it discovered a match. For example, given the regex `a+` | |
168 | and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`, | |
169 | the latter of which being the correct end location of the leftmost-first match. | |
170 | ||
171 | ## Literals in your regex may make it faster | |
172 | ||
173 | **Advice**: Literals can reduce the work that the regex engine needs to do. Use | |
174 | them if you can, especially as prefixes. | |
175 | ||
176 | In particular, if your regex starts with a prefix literal, the prefix is | |
177 | quickly searched before entering the (much slower) regex engine. For example, | |
178 | given the regex `foo\w+`, the literal `foo` will be searched for using | |
179 | Boyer-Moore. If there's no match, then no regex engine is ever used. Only when | |
180 | there's a match is the regex engine invoked at the location of the match, which | |
181 | effectively permits the regex engine to skip large portions of a haystack. | |
182 | If a regex is comprised entirely of literals (possibly more than one), then | |
183 | it's possible that the regex engine can be avoided entirely even when there's a | |
184 | match. | |
185 | ||
186 | When one literal is found, Boyer-Moore is used. When multiple literals are | |
187 | found, then an optimized version of Aho-Corasick is used. | |
188 | ||
189 | This optimization is in particular extended quite a bit in this crate. Here are | |
190 | a few examples of regexes that get literal prefixes detected: | |
191 | ||
192 | * `(foo|bar)` detects `foo` and `bar` | |
193 | * `(a|b)c` detects `ac` and `bc` | |
194 | * `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz` | |
195 | * `a?b` detects `a` and `b` | |
196 | * `a*b` detects `a` and `b` | |
197 | * `(ab){3,6}` detects `ababab` | |
198 | ||
199 | Literals in anchored regexes can also be used for detecting non-matches very | |
200 | quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match | |
201 | just by examing the first (or last) three bytes of the haystack. | |
202 | ||
203 | ## Unicode word boundaries may prevent the DFA from being used | |
204 | ||
205 | **Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)` | |
206 | instead of `\b` if you care about consistent performance more than correctness. | |
207 | ||
208 | It's a sad state of the current implementation. At the moment, the DFA will try | |
209 | to interpret Unicode word boundaries as if they were ASCII word boundaries. | |
210 | If the DFA comes across any non-ASCII byte, it will quit and fall back to an | |
211 | alternative matching engine that can handle Unicode word boundaries correctly. | |
212 | The alternate matching engine is generally quite a bit slower (perhaps by an | |
213 | order of magnitude). If necessary, this can be ameliorated in two ways. | |
214 | ||
215 | The first way is to add some number of literal prefixes to your regular | |
216 | expression. Even though the DFA may not be used, specialized routines will | |
217 | still kick in to find prefix literals quickly, which limits how much work the | |
218 | NFA simulation will need to do. | |
219 | ||
220 | The second way is to give up on Unicode and use an ASCII word boundary instead. | |
221 | One can use an ASCII word boundary by disabling Unicode support. That is, | |
222 | instead of using `\b`, use `(?-u:\b)`. Namely, given the regex `\b.+\b`, it | |
223 | can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It | |
224 | is important to limit the scope of disabling the `u` flag, since it might lead | |
225 | to a syntax error if the regex could match arbitrary bytes. For example, if one | |
226 | wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches | |
227 | any *byte* when the Unicode flag is disabled. | |
228 | ||
229 | The second way isn't appreciably different than just using a Unicode word | |
230 | boundary in the first place, since the DFA will speculatively interpret it as | |
231 | an ASCII word boundary anyway. The key difference is that if an ASCII word | |
232 | boundary is used explicitly, then the DFA won't quit in the presence of | |
233 | non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for | |
234 | more consistent performance. | |
235 | ||
236 | N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one | |
237 | can simply write `\b` to get an ASCII word boundary. | |
238 | ||
239 | ## Excessive counting can lead to exponential state blow up in the DFA | |
240 | ||
241 | **Advice**: Don't write regexes that cause DFA state blow up if you care about | |
242 | match performance. | |
243 | ||
244 | Wait, didn't I say that this crate guards against exponential worst cases? | |
245 | Well, it turns out that the process of converting an NFA to a DFA can lead to | |
246 | an exponential blow up in the number of states. This crate specifically guards | |
247 | against exponential blow up by doing two things: | |
248 | ||
249 | 1. The DFA is computed lazily. That is, a state in the DFA only exists in | |
250 | memory if it is visited. In particular, the lazy DFA guarantees that *at | |
251 | most* one state is created for every byte of input. This, on its own, | |
252 | guarantees linear time complexity. | |
253 | 2. Of course, creating a new state for *every* byte of input means that search | |
254 | will go incredibly slow because of very large constant factors. On top of | |
255 | that, creating a state for every byte in a large haystack could result in | |
256 | exorbitant memory usage. To ameliorate this, the DFA bounds the number of | |
257 | states it can store. Once it reaches its limit, it flushes its cache. This | |
258 | prevents reuse of states that it already computed. If the cache is flushed | |
259 | too frequently, then the DFA will give up and execution will fall back to | |
260 | one of the NFA simulations. | |
261 | ||
262 | In effect, this crate will detect exponential state blow up and fall back to | |
263 | a search routine with fixed memory requirements. This does, however, mean that | |
264 | searching will be much slower than one might expect. Regexes that rely on | |
265 | counting in particular are strong aggravators of this behavior. For example, | |
266 | matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s. | |
267 | ||
268 | In the future, it may be possible to increase the bound that the DFA uses, | |
269 | which would allow the caller to choose how much memory they're willing to | |
270 | spend. | |
271 | ||
272 | ## Resist the temptation to "optimize" regexes | |
273 | ||
274 | **Advice**: This ain't a backtracking engine. | |
275 | ||
276 | An entire book was written on how to optimize Perl-style regular expressions. | |
277 | Most of those techniques are not applicable for this library. For example, | |
278 | there is no problem with using non-greedy matching or having lots of | |
279 | alternations in your regex. |