]>
Commit | Line | Data |
---|---|---|
f035d41b XL |
1 | /*! |
2 | A low level regular expression library that uses deterministic finite automata. | |
3 | It supports a rich syntax with Unicode support, has extensive options for | |
4 | configuring the best space vs time trade off for your use case and provides | |
5 | support for cheap deserialization of automata for use in `no_std` environments. | |
6 | ||
7 | # Overview | |
8 | ||
9 | This section gives a brief overview of the primary types in this crate: | |
10 | ||
11 | * A [`Regex`](struct.Regex.html) provides a way to search for matches of a | |
12 | regular expression. This includes iterating over matches with both the start | |
13 | and end positions of each match. | |
14 | * A [`RegexBuilder`](struct.RegexBuilder.html) provides a way configure many | |
15 | compilation options for a regex. | |
16 | * A [`DenseDFA`](enum.DenseDFA.html) provides low level access to a DFA that | |
17 | uses a dense representation (uses lots of space, but fast searching). | |
18 | * A [`SparseDFA`](enum.SparseDFA.html) provides the same API as a `DenseDFA`, | |
19 | but uses a sparse representation (uses less space, but slower matching). | |
20 | * A [`DFA`](trait.DFA.html) trait that defines an interface that all DFAs must | |
21 | implement. | |
22 | * Both dense DFAs and sparse DFAs support | |
23 | [serialization to raw bytes](enum.DenseDFA.html#method.to_bytes_little_endian) | |
24 | and | |
25 | [cheap deserialization](enum.DenseDFA.html#method.from_bytes). | |
26 | ||
27 | # Example: basic regex searching | |
28 | ||
29 | This example shows how to compile a regex using the default configuration | |
30 | and then use it to find matches in a byte string: | |
31 | ||
32 | ``` | |
33 | use regex_automata::Regex; | |
34 | ||
35 | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); | |
36 | let text = b"2018-12-24 2016-10-08"; | |
37 | let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); | |
38 | assert_eq!(matches, vec![(0, 10), (11, 21)]); | |
39 | ``` | |
40 | ||
41 | # Example: use sparse DFAs | |
42 | ||
43 | By default, compiling a regex will use dense DFAs internally. This uses more | |
44 | memory, but executes searches more quickly. If you can abide slower searches | |
45 | (somewhere around 3-5x), then sparse DFAs might make more sense since they can | |
46 | use significantly less space. | |
47 | ||
48 | Using sparse DFAs is as easy as using `Regex::new_sparse` instead of | |
49 | `Regex::new`: | |
50 | ||
51 | ``` | |
52 | use regex_automata::Regex; | |
53 | ||
54 | # fn example() -> Result<(), regex_automata::Error> { | |
55 | let re = Regex::new_sparse(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); | |
56 | let text = b"2018-12-24 2016-10-08"; | |
57 | let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); | |
58 | assert_eq!(matches, vec![(0, 10), (11, 21)]); | |
59 | # Ok(()) }; example().unwrap() | |
60 | ``` | |
61 | ||
62 | If you already have dense DFAs for some reason, they can be converted to sparse | |
63 | DFAs and used to build a new `Regex`. For example: | |
64 | ||
65 | ``` | |
66 | use regex_automata::Regex; | |
67 | ||
68 | # fn example() -> Result<(), regex_automata::Error> { | |
69 | let dense_re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); | |
70 | let sparse_re = Regex::from_dfas( | |
71 | dense_re.forward().to_sparse()?, | |
72 | dense_re.reverse().to_sparse()?, | |
73 | ); | |
74 | let text = b"2018-12-24 2016-10-08"; | |
75 | let matches: Vec<(usize, usize)> = sparse_re.find_iter(text).collect(); | |
76 | assert_eq!(matches, vec![(0, 10), (11, 21)]); | |
77 | # Ok(()) }; example().unwrap() | |
78 | ``` | |
79 | ||
80 | # Example: deserialize a DFA | |
81 | ||
82 | This shows how to first serialize a DFA into raw bytes, and then deserialize | |
83 | those raw bytes back into a DFA. While this particular example is a bit | |
84 | contrived, this same technique can be used in your program to deserialize a | |
85 | DFA at start up time or by memory mapping a file. In particular, | |
86 | deserialization is guaranteed to be cheap because it will always be a constant | |
87 | time operation. | |
88 | ||
89 | ``` | |
90 | use regex_automata::{DenseDFA, Regex}; | |
91 | ||
92 | # fn example() -> Result<(), regex_automata::Error> { | |
93 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); | |
94 | // serialize both the forward and reverse DFAs, see note below | |
95 | let fwd_bytes = re1.forward().to_u16()?.to_bytes_native_endian()?; | |
96 | let rev_bytes = re1.reverse().to_u16()?.to_bytes_native_endian()?; | |
97 | // now deserialize both---we need to specify the correct type! | |
98 | let fwd: DenseDFA<&[u16], u16> = unsafe { DenseDFA::from_bytes(&fwd_bytes) }; | |
99 | let rev: DenseDFA<&[u16], u16> = unsafe { DenseDFA::from_bytes(&rev_bytes) }; | |
100 | // finally, reconstruct our regex | |
101 | let re2 = Regex::from_dfas(fwd, rev); | |
102 | ||
103 | // we can use it like normal | |
104 | let text = b"2018-12-24 2016-10-08"; | |
105 | let matches: Vec<(usize, usize)> = re2.find_iter(text).collect(); | |
106 | assert_eq!(matches, vec![(0, 10), (11, 21)]); | |
107 | # Ok(()) }; example().unwrap() | |
108 | ``` | |
109 | ||
110 | There are a few points worth noting here: | |
111 | ||
112 | * We need to extract the raw DFAs used by the regex and serialize those. You | |
113 | can build the DFAs manually yourself using | |
114 | [`dense::Builder`](dense/struct.Builder.html), but using the DFAs from a | |
115 | `Regex` guarantees that the DFAs are built correctly. | |
116 | * We specifically convert the dense DFA to a representation that uses `u16` | |
117 | for its state identifiers using | |
118 | [`DenseDFA::to_u16`](enum.DenseDFA.html#method.to_u16). While this isn't | |
119 | strictly necessary, if we skipped this step, then the serialized bytes would | |
120 | use `usize` for state identifiers, which does not have a fixed size. Using | |
121 | `u16` ensures that we can deserialize this DFA even on platforms with a | |
122 | smaller pointer size. If our DFA is too big for `u16` state identifiers, then | |
123 | one can use `u32` or `u64`. | |
124 | * To convert the DFA to raw bytes, we use the `to_bytes_native_endian` | |
125 | method. In practice, you'll want to use either | |
126 | [`DenseDFA::to_bytes_little_endian`](enum.DenseDFA.html#method.to_bytes_little_endian) | |
127 | or | |
128 | [`DenseDFA::to_bytes_big_endian`](enum.DenseDFA.html#method.to_bytes_big_endian), | |
129 | depending on which platform you're deserializing your DFA from. If you intend | |
130 | to deserialize on either platform, then you'll need to serialize both and | |
131 | deserialize the right one depending on your target's endianness. | |
132 | * Deserializing a DFA requires the use of `unsafe` because the raw bytes must | |
133 | be *trusted*. In particular, while some degree of sanity checks are | |
134 | performed, nothing guarantees the integrity of the DFA's transition table | |
135 | since deserialization is a constant time operation. Since searching with a | |
136 | DFA must be able to follow transitions blindly for performance reasons, | |
137 | giving incorrect bytes to the deserialization API can result in memory | |
138 | unsafety. | |
139 | ||
140 | The same process can be achieved with sparse DFAs as well: | |
141 | ||
142 | ``` | |
143 | use regex_automata::{SparseDFA, Regex}; | |
144 | ||
145 | # fn example() -> Result<(), regex_automata::Error> { | |
146 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); | |
147 | // serialize both | |
148 | let fwd_bytes = re1.forward().to_u16()?.to_sparse()?.to_bytes_native_endian()?; | |
149 | let rev_bytes = re1.reverse().to_u16()?.to_sparse()?.to_bytes_native_endian()?; | |
150 | // now deserialize both---we need to specify the correct type! | |
151 | let fwd: SparseDFA<&[u8], u16> = unsafe { SparseDFA::from_bytes(&fwd_bytes) }; | |
152 | let rev: SparseDFA<&[u8], u16> = unsafe { SparseDFA::from_bytes(&rev_bytes) }; | |
153 | // finally, reconstruct our regex | |
154 | let re2 = Regex::from_dfas(fwd, rev); | |
155 | ||
156 | // we can use it like normal | |
157 | let text = b"2018-12-24 2016-10-08"; | |
158 | let matches: Vec<(usize, usize)> = re2.find_iter(text).collect(); | |
159 | assert_eq!(matches, vec![(0, 10), (11, 21)]); | |
160 | # Ok(()) }; example().unwrap() | |
161 | ``` | |
162 | ||
163 | Note that unlike dense DFAs, sparse DFAs have no alignment requirements. | |
164 | Conversely, dense DFAs must be be aligned to the same alignment as their | |
165 | state identifier representation. | |
166 | ||
167 | # Support for `no_std` | |
168 | ||
169 | This crate comes with a `std` feature that is enabled by default. When the | |
170 | `std` feature is enabled, the API of this crate will include the facilities | |
171 | necessary for compiling, serializing, deserializing and searching with regular | |
172 | expressions. When the `std` feature is disabled, the API of this crate will | |
173 | shrink such that it only includes the facilities necessary for deserializing | |
174 | and searching with regular expressions. | |
175 | ||
176 | The intended workflow for `no_std` environments is thus as follows: | |
177 | ||
178 | * Write a program with the `std` feature that compiles and serializes a | |
179 | regular expression. Serialization should only happen after first converting | |
180 | the DFAs to use a fixed size state identifier instead of the default `usize`. | |
181 | You may also need to serialize both little and big endian versions of each | |
182 | DFA. (So that's 4 DFAs in total for each regex.) | |
183 | * In your `no_std` environment, follow the examples above for deserializing | |
184 | your previously serialized DFAs into regexes. You can then search with them | |
185 | as you would any regex. | |
186 | ||
187 | Deserialization can happen anywhere. For example, with bytes embedded into a | |
188 | binary or with a file memory mapped at runtime. | |
189 | ||
190 | Note that the | |
191 | [`ucd-generate`](https://github.com/BurntSushi/ucd-generate) | |
192 | tool will do the first step for you with its `dfa` or `regex` sub-commands. | |
193 | ||
194 | # Syntax | |
195 | ||
196 | This crate supports the same syntax as the `regex` crate, since they share the | |
197 | same parser. You can find an exhaustive list of supported syntax in the | |
198 | [documentation for the `regex` crate](https://docs.rs/regex/1.1/regex/#syntax). | |
199 | ||
200 | Currently, there are a couple limitations. In general, this crate does not | |
201 | support zero-width assertions, although they may be added in the future. This | |
202 | includes: | |
203 | ||
204 | * Anchors such as `^`, `$`, `\A` and `\z`. | |
205 | * Word boundary assertions such as `\b` and `\B`. | |
206 | ||
207 | It is possible to run a search that is anchored at the beginning of the input. | |
208 | To do that, set the | |
209 | [`RegexBuilder::anchored`](struct.RegexBuilder.html#method.anchored) | |
210 | option when building a regex. By default, all searches are unanchored. | |
211 | ||
212 | # Differences with the regex crate | |
213 | ||
214 | The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a | |
215 | general purpose regular expression engine. It aims to automatically balance low | |
216 | compile times, fast search times and low memory usage, while also providing | |
217 | a convenient API for users. In contrast, this crate provides a lower level | |
218 | regular expression interface that is a bit less convenient while providing more | |
219 | explicit control over memory usage and search times. | |
220 | ||
221 | Here are some specific negative differences: | |
222 | ||
223 | * **Compilation can take an exponential amount of time and space** in the size | |
224 | of the regex pattern. While most patterns do not exhibit worst case | |
225 | exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will | |
226 | build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should | |
227 | not be compiled with this library. (In the future, the API may expose an | |
228 | option to return an error if the DFA gets too big.) | |
229 | * This crate does not support sub-match extraction, which can be achieved with | |
230 | the regex crate's "captures" API. This may be added in the future, but is | |
231 | unlikely. | |
232 | * While the regex crate doesn't necessarily sport fast compilation times, the | |
233 | regexes in this crate are almost universally slow to compile, especially when | |
234 | they contain large Unicode character classes. For example, on my system, | |
235 | compiling `\w{3}` with byte classes enabled takes just over 1 second and | |
236 | almost 5MB of memory! (Compiling a sparse regex takes about the same time | |
237 | but only uses about 500KB of memory.) Conversly, compiling the same regex | |
238 | without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and | |
239 | less than 5KB of memory. For this reason, you should only use Unicode | |
240 | character classes if you absolutely need them! | |
241 | * This crate does not support regex sets. | |
242 | * This crate does not support zero-width assertions such as `^`, `$`, `\b` or | |
243 | `\B`. | |
244 | * As a lower level crate, this library does not do literal optimizations. In | |
245 | exchange, you get predictable performance regardless of input. The | |
246 | philosophy here is that literal optimizations should be applied at a higher | |
247 | level, although there is no easy support for this in the ecosystem yet. | |
248 | * There is no `&str` API like in the regex crate. In this crate, all APIs | |
249 | operate on `&[u8]`. By default, match indices are guaranteed to fall on | |
250 | UTF-8 boundaries, unless | |
251 | [`RegexBuilder::allow_invalid_utf8`](struct.RegexBuilder.html#method.allow_invalid_utf8) | |
252 | is enabled. | |
253 | ||
254 | With some of the downsides out of the way, here are some positive differences: | |
255 | ||
256 | * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply | |
257 | deserialized. Deserialization always takes constant time since searching can | |
258 | be performed directly on the raw serialized bytes of a DFA. | |
259 | * This crate was specifically designed so that the searching phase of a DFA has | |
260 | minimal runtime requirements, and can therefore be used in `no_std` | |
261 | environments. While `no_std` environments cannot compile regexes, they can | |
262 | deserialize pre-compiled regexes. | |
263 | * Since this crate builds DFAs ahead of time, it will generally out-perform | |
264 | the `regex` crate on equivalent tasks. The performance difference is likely | |
265 | not large. However, because of a complex set of optimizations in the regex | |
266 | crate (like literal optimizations), an accurate performance comparison may be | |
267 | difficult to do. | |
268 | * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search | |
269 | performance a small amount, but uses much less storage space. Potentially | |
270 | even less than what the regex crate uses. | |
271 | * This crate exposes DFAs directly, such as | |
272 | [`DenseDFA`](enum.DenseDFA.html) and [`SparseDFA`](enum.SparseDFA.html), | |
273 | which enables one to do less work in some cases. For example, if you only | |
274 | need the end of a match and not the start of a match, then you can use a DFA | |
275 | directly without building a `Regex`, which always requires a second DFA to | |
276 | find the start of a match. | |
277 | * Aside from choosing between dense and sparse DFAs, there are several options | |
278 | for configuring the space usage vs search time trade off. These include | |
279 | things like choosing a smaller state identifier representation, to | |
280 | premultiplying state identifiers and splitting a DFA's alphabet into | |
281 | equivalence classes. Finally, DFA minimization is also provided, but can | |
282 | increase compilation times dramatically. | |
283 | */ | |
284 | ||
285 | #![deny(missing_docs)] | |
286 | #![cfg_attr(not(feature = "std"), no_std)] | |
287 | ||
288 | #[cfg(feature = "std")] | |
289 | extern crate core; | |
290 | ||
291 | #[cfg(all(test, feature = "transducer"))] | |
292 | extern crate bstr; | |
f035d41b XL |
293 | #[cfg(feature = "transducer")] |
294 | extern crate fst; | |
295 | #[cfg(feature = "std")] | |
296 | extern crate regex_syntax; | |
297 | ||
298 | pub use dense::DenseDFA; | |
299 | pub use dfa::DFA; | |
300 | #[cfg(feature = "std")] | |
301 | pub use error::{Error, ErrorKind}; | |
302 | pub use regex::Regex; | |
303 | #[cfg(feature = "std")] | |
304 | pub use regex::RegexBuilder; | |
305 | pub use sparse::SparseDFA; | |
306 | pub use state_id::StateID; | |
307 | ||
136023e0 | 308 | mod byteorder; |
f035d41b XL |
309 | mod classes; |
310 | #[path = "dense.rs"] | |
311 | mod dense_imp; | |
312 | #[cfg(feature = "std")] | |
313 | mod determinize; | |
314 | mod dfa; | |
315 | #[cfg(feature = "std")] | |
316 | mod error; | |
317 | #[cfg(feature = "std")] | |
318 | mod minimize; | |
319 | #[cfg(feature = "std")] | |
320 | #[doc(hidden)] | |
321 | pub mod nfa; | |
322 | mod regex; | |
323 | #[path = "sparse.rs"] | |
324 | mod sparse_imp; | |
325 | #[cfg(feature = "std")] | |
326 | mod sparse_set; | |
327 | mod state_id; | |
328 | #[cfg(feature = "transducer")] | |
329 | mod transducer; | |
330 | ||
331 | /// Types and routines specific to dense DFAs. | |
332 | /// | |
333 | /// This module is the home of [`DenseDFA`](enum.DenseDFA.html) and each of its | |
334 | /// corresponding variant DFA types, such as [`Standard`](struct.Standard.html) | |
335 | /// and [`ByteClass`](struct.ByteClass.html). | |
336 | /// | |
337 | /// This module also contains a [builder](struct.Builder.html) for | |
338 | /// configuring the construction of a dense DFA. | |
339 | pub mod dense { | |
340 | pub use dense_imp::*; | |
341 | } | |
342 | ||
343 | /// Types and routines specific to sparse DFAs. | |
344 | /// | |
345 | /// This module is the home of [`SparseDFA`](enum.SparseDFA.html) and each of | |
346 | /// its corresponding variant DFA types, such as | |
347 | /// [`Standard`](struct.Standard.html) and | |
348 | /// [`ByteClass`](struct.ByteClass.html). | |
349 | /// | |
350 | /// Unlike the [`dense`](../dense/index.html) module, this module does not | |
351 | /// contain a builder specific for sparse DFAs. Instead, the intended way to | |
352 | /// build a sparse DFA is either by using a default configuration with its | |
353 | /// [constructor](enum.SparseDFA.html#method.new), | |
354 | /// or by first | |
355 | /// [configuring the construction of a dense DFA](../dense/struct.Builder.html) | |
356 | /// and then calling | |
357 | /// [`DenseDFA::to_sparse`](../enum.DenseDFA.html#method.to_sparse). | |
358 | pub mod sparse { | |
359 | pub use sparse_imp::*; | |
360 | } |