]> git.proxmox.com Git - rustc.git/blob - vendor/aho-corasick/src/lib.rs
New upstream version 1.72.1+dfsg1
[rustc.git] / vendor / aho-corasick / src / lib.rs
1 /*!
2 A library for finding occurrences of many patterns at once. This library
3 provides multiple pattern search principally through an implementation of the
4 [Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm),
5 which builds a fast finite state machine for executing searches in linear time.
6
7 Additionally, this library provides a number of configuration options for
8 building the automaton that permit controlling the space versus time trade
9 off. Other features include simple ASCII case insensitive matching, finding
10 overlapping matches, replacements, searching streams and even searching and
11 replacing text in streams.
12
13 Finally, unlike most other Aho-Corasick implementations, this one
14 supports enabling [leftmost-first](MatchKind::LeftmostFirst) or
15 [leftmost-longest](MatchKind::LeftmostLongest) match semantics, using a
16 (seemingly) novel alternative construction algorithm. For more details on what
17 match semantics means, see the [`MatchKind`] type.
18
19 # Overview
20
21 This section gives a brief overview of the primary types in this crate:
22
23 * [`AhoCorasick`] is the primary type and represents an Aho-Corasick automaton.
24 This is the type you use to execute searches.
25 * [`AhoCorasickBuilder`] can be used to build an Aho-Corasick automaton, and
26 supports configuring a number of options.
27 * [`Match`] represents a single match reported by an Aho-Corasick automaton.
28 Each match has two pieces of information: the pattern that matched and the
29 start and end byte offsets corresponding to the position in the haystack at
30 which it matched.
31
32 # Example: basic searching
33
34 This example shows how to search for occurrences of multiple patterns
35 simultaneously. Each match includes the pattern that matched along with the
36 byte offsets of the match.
37
38 ```
39 use aho_corasick::{AhoCorasick, PatternID};
40
41 let patterns = &["apple", "maple", "Snapple"];
42 let haystack = "Nobody likes maple in their apple flavored Snapple.";
43
44 let ac = AhoCorasick::new(patterns).unwrap();
45 let mut matches = vec![];
46 for mat in ac.find_iter(haystack) {
47 matches.push((mat.pattern(), mat.start(), mat.end()));
48 }
49 assert_eq!(matches, vec![
50 (PatternID::must(1), 13, 18),
51 (PatternID::must(0), 28, 33),
52 (PatternID::must(2), 43, 50),
53 ]);
54 ```
55
56 # Example: case insensitivity
57
58 This is like the previous example, but matches `Snapple` case insensitively
59 using `AhoCorasickBuilder`:
60
61 ```
62 use aho_corasick::{AhoCorasick, PatternID};
63
64 let patterns = &["apple", "maple", "snapple"];
65 let haystack = "Nobody likes maple in their apple flavored Snapple.";
66
67 let ac = AhoCorasick::builder()
68 .ascii_case_insensitive(true)
69 .build(patterns)
70 .unwrap();
71 let mut matches = vec![];
72 for mat in ac.find_iter(haystack) {
73 matches.push((mat.pattern(), mat.start(), mat.end()));
74 }
75 assert_eq!(matches, vec![
76 (PatternID::must(1), 13, 18),
77 (PatternID::must(0), 28, 33),
78 (PatternID::must(2), 43, 50),
79 ]);
80 ```
81
82 # Example: replacing matches in a stream
83
84 This example shows how to execute a search and replace on a stream without
85 loading the entire stream into memory first.
86
87 ```
88 # #[cfg(feature = "std")] {
89 use aho_corasick::AhoCorasick;
90
91 # fn example() -> Result<(), std::io::Error> {
92 let patterns = &["fox", "brown", "quick"];
93 let replace_with = &["sloth", "grey", "slow"];
94
95 // In a real example, these might be `std::fs::File`s instead. All you need to
96 // do is supply a pair of `std::io::Read` and `std::io::Write` implementations.
97 let rdr = "The quick brown fox.";
98 let mut wtr = vec![];
99
100 let ac = AhoCorasick::new(patterns).unwrap();
101 ac.try_stream_replace_all(rdr.as_bytes(), &mut wtr, replace_with)?;
102 assert_eq!(b"The slow grey sloth.".to_vec(), wtr);
103 # Ok(()) }; example().unwrap()
104 # }
105 ```
106
107 # Example: finding the leftmost first match
108
109 In the textbook description of Aho-Corasick, its formulation is typically
110 structured such that it reports all possible matches, even when they overlap
111 with another. In many cases, overlapping matches may not be desired, such as
112 the case of finding all successive non-overlapping matches like you might with
113 a standard regular expression.
114
115 Unfortunately the "obvious" way to modify the Aho-Corasick algorithm to do
116 this doesn't always work in the expected way, since it will report matches as
117 soon as they are seen. For example, consider matching the regex `Samwise|Sam`
118 against the text `Samwise`. Most regex engines (that are Perl-like, or
119 non-POSIX) will report `Samwise` as a match, but the standard Aho-Corasick
120 algorithm modified for reporting non-overlapping matches will report `Sam`.
121
122 A novel contribution of this library is the ability to change the match
123 semantics of Aho-Corasick (without additional search time overhead) such that
124 `Samwise` is reported instead. For example, here's the standard approach:
125
126 ```
127 use aho_corasick::AhoCorasick;
128
129 let patterns = &["Samwise", "Sam"];
130 let haystack = "Samwise";
131
132 let ac = AhoCorasick::new(patterns).unwrap();
133 let mat = ac.find(haystack).expect("should have a match");
134 assert_eq!("Sam", &haystack[mat.start()..mat.end()]);
135 ```
136
137 And now here's the leftmost-first version, which matches how a Perl-like
138 regex will work:
139
140 ```
141 use aho_corasick::{AhoCorasick, MatchKind};
142
143 let patterns = &["Samwise", "Sam"];
144 let haystack = "Samwise";
145
146 let ac = AhoCorasick::builder()
147 .match_kind(MatchKind::LeftmostFirst)
148 .build(patterns)
149 .unwrap();
150 let mat = ac.find(haystack).expect("should have a match");
151 assert_eq!("Samwise", &haystack[mat.start()..mat.end()]);
152 ```
153
154 In addition to leftmost-first semantics, this library also supports
155 leftmost-longest semantics, which match the POSIX behavior of a regular
156 expression alternation. See [`MatchKind`] for more details.
157
158 # Prefilters
159
160 While an Aho-Corasick automaton can perform admirably when compared to more
161 naive solutions, it is generally slower than more specialized algorithms that
162 are accelerated using vector instructions such as SIMD.
163
164 For that reason, this library will internally use a "prefilter" to attempt
165 to accelerate searches when possible. Currently, this library has several
166 different algorithms it might use depending on the patterns provided. Once the
167 number of patterns gets too big, prefilters are no longer used.
168
169 While a prefilter is generally good to have on by default since it works
170 well in the common case, it can lead to less predictable or even sub-optimal
171 performance in some cases. For that reason, prefilters can be explicitly
172 disabled via [`AhoCorasickBuilder::prefilter`].
173
174 # Lower level APIs
175
176 This crate also provides several sub-modules that collectively expose many of
177 the implementation details of the main [`AhoCorasick`] type. Most users of this
178 library can completely ignore the submodules and their contents, but if you
179 needed finer grained control, some parts of them may be useful to you. Here is
180 a brief overview of each and why you might want to use them:
181
182 * The [`packed`] sub-module contains a lower level API for using fast
183 vectorized routines for finding a small number of patterns in a haystack.
184 You might want to use this API when you want to completely side-step using
185 Aho-Corasick automata. Otherwise, the fast vectorized routines are used
186 automatically as prefilters for `AhoCorasick` searches whenever possible.
187 * The [`automaton`] sub-module provides a lower level finite state
188 machine interface that the various Aho-Corasick implementations in
189 this crate implement. This sub-module's main contribution is the
190 [`Automaton`](automaton::Automaton) trait, which permits manually walking the
191 state transitions of an Aho-Corasick automaton.
192 * The [`dfa`] and [`nfa`] sub-modules provide DFA and NFA implementations of
193 the aforementioned `Automaton` trait. The main reason one might want to use
194 these sub-modules is to get access to a type that implements the `Automaton`
195 trait. (The top-level `AhoCorasick` type does not implement the `Automaton`
196 trait.)
197
198 As mentioned above, if you aren't sure whether you need these sub-modules,
199 you should be able to safely ignore them and just focus on the [`AhoCorasick`]
200 type.
201
202 # Crate features
203
204 This crate exposes a few features for controlling dependency usage and whether
205 this crate can be used without the standard library.
206
207 * **std** -
208 Enables support for the standard library. This feature is enabled by
209 default. When disabled, only `core` and `alloc` are used. At an API
210 level, enabling `std` enables `std::error::Error` trait impls for the
211 various error types, and higher level stream search routines such as
212 [`AhoCorasick::try_stream_find_iter`]. But the `std` feature is also required
213 to enable vectorized prefilters. Prefilters can greatly accelerate searches,
214 but generally only apply when the number of patterns is small (less than
215 ~100).
216 * **perf-literal** -
217 Enables support for literal prefilters that use vectorized routines from
218 external crates. This feature is enabled by default. If you're only using
219 Aho-Corasick for large numbers of patterns or otherwise can abide lower
220 throughput when searching with a small number of patterns, then it is
221 reasonable to disable this feature.
222 * **logging** -
223 Enables a dependency on the `log` crate and emits messages to aide in
224 diagnostics. This feature is disabled by default.
225 */
226
227 #![no_std]
228 #![deny(missing_docs)]
229 #![deny(rustdoc::broken_intra_doc_links)]
230 #![cfg_attr(docsrs, feature(doc_auto_cfg))]
231
232 extern crate alloc;
233 #[cfg(any(test, feature = "std"))]
234 extern crate std;
235
236 #[cfg(doctest)]
237 doc_comment::doctest!("../README.md");
238
239 #[cfg(feature = "std")]
240 pub use crate::ahocorasick::StreamFindIter;
241 pub use crate::{
242 ahocorasick::{
243 AhoCorasick, AhoCorasickBuilder, AhoCorasickKind, FindIter,
244 FindOverlappingIter,
245 },
246 util::{
247 error::{BuildError, MatchError, MatchErrorKind},
248 primitives::{PatternID, PatternIDError},
249 search::{Anchored, Input, Match, MatchKind, Span, StartKind},
250 },
251 };
252
253 #[macro_use]
254 mod macros;
255
256 mod ahocorasick;
257 pub mod automaton;
258 pub mod dfa;
259 pub mod nfa;
260 pub mod packed;
261 #[cfg(test)]
262 mod tests;
263 // I wrote out the module for implementing fst::Automaton only to later realize
264 // that this would make fst a public dependency and fst is not at 1.0 yet. I
265 // decided to just keep the code in tree, but build it only during tests.
266 //
267 // TODO: I think I've changed my mind again. I'm considering pushing it out
268 // into either a separate crate or into 'fst' directly as an optional feature.
269 // #[cfg(test)]
270 // #[allow(dead_code)]
271 // mod transducer;
272 pub(crate) mod util;
273
274 #[cfg(test)]
275 mod testoibits {
276 use std::panic::{RefUnwindSafe, UnwindSafe};
277
278 use super::*;
279
280 fn assert_all<T: Send + Sync + UnwindSafe + RefUnwindSafe>() {}
281
282 #[test]
283 fn oibits_main() {
284 assert_all::<AhoCorasick>();
285 assert_all::<AhoCorasickBuilder>();
286 assert_all::<AhoCorasickKind>();
287 assert_all::<FindIter>();
288 assert_all::<FindOverlappingIter>();
289
290 assert_all::<BuildError>();
291 assert_all::<MatchError>();
292 assert_all::<MatchErrorKind>();
293
294 assert_all::<Anchored>();
295 assert_all::<Input>();
296 assert_all::<Match>();
297 assert_all::<MatchKind>();
298 assert_all::<Span>();
299 assert_all::<StartKind>();
300 }
301
302 #[test]
303 fn oibits_automaton() {
304 use crate::{automaton, dfa::DFA};
305
306 assert_all::<automaton::FindIter<DFA>>();
307 assert_all::<automaton::FindOverlappingIter<DFA>>();
308 #[cfg(feature = "std")]
309 assert_all::<automaton::StreamFindIter<DFA, std::io::Stdin>>();
310 assert_all::<automaton::OverlappingState>();
311
312 assert_all::<automaton::Prefilter>();
313 assert_all::<automaton::Candidate>();
314 }
315
316 #[test]
317 fn oibits_packed() {
318 use crate::packed;
319
320 assert_all::<packed::Config>();
321 assert_all::<packed::Builder>();
322 assert_all::<packed::Searcher>();
323 assert_all::<packed::FindIter>();
324 assert_all::<packed::MatchKind>();
325 }
326 }