]> git.proxmox.com Git - cargo.git/blob - vendor/regex-0.2.5/src/simd_accel/teddy128.rs
New upstream version 0.24.0
[cargo.git] / vendor / regex-0.2.5 / src / simd_accel / teddy128.rs
1 /*!
2 Teddy is a simd accelerated multiple substring matching algorithm. The name
3 and the core ideas in the algorithm were learned from the [Hyperscan][1_u]
4 project.
5
6
7 Background
8 ----------
9
10 The key idea of Teddy is to do *packed* substring matching. In the literature,
11 packed substring matching is the idea of examing multiple bytes in a haystack
12 at a time to detect matches. Implementations of, for example, memchr (which
13 detects matches of a single byte) have been doing this for years. Only
14 recently, with the introduction of various SIMD instructions, has this been
15 extended to substring matching. The PCMPESTRI instruction (and its relatives),
16 for example, implements substring matching in hardware. It is, however, limited
17 to substrings of length 16 bytes or fewer, but this restriction is fine in a
18 regex engine, since we rarely care about the performance difference between
19 searching for a 16 byte literal and a 16 + N literal—16 is already long
20 enough. The key downside of the PCMPESTRI instruction, on current (2016) CPUs
21 at least, is its latency and throughput. As a result, it is often faster to do
22 substring search with a Boyer-Moore variant and a well placed memchr to quickly
23 skip through the haystack.
24
25 There are fewer results from the literature on packed substring matching,
26 and even fewer for packed multiple substring matching. Ben-Kiki et al. [2]
27 describes use of PCMPESTRI for substring matching, but is mostly theoretical
28 and hand-waves performance. There is other theoretical work done by Bille [3]
29 as well.
30
31 The rest of the work in the field, as far as I'm aware, is by Faro and Kulekci
32 and is generally focused on multiple pattern search. Their first paper [4a]
33 introduces the concept of a fingerprint, which is computed for every block of
34 N bytes in every pattern. The haystack is then scanned N bytes at a time and
35 a fingerprint is computed in the same way it was computed for blocks in the
36 patterns. If the fingerprint corresponds to one that was found in a pattern,
37 then a verification step follows to confirm that one of the substrings with the
38 corresponding fingerprint actually matches at the current location. Various
39 implementation tricks are employed to make sure the fingerprint lookup is fast;
40 typically by truncating the fingerprint. (This may, of course, provoke more
41 steps in the verification process, so a balance must be struck.)
42
43 The main downside of [4a] is that the minimum substring length is 32 bytes,
44 presumably because of how the algorithm uses certain SIMD instructions. This
45 essentially makes it useless for general purpose regex matching, where a small
46 number of short patterns is far more likely.
47
48 Faro and Kulekci published another paper [4b] that is conceptually very similar
49 to [4a]. The key difference is that it uses the CRC32 instruction (introduced
50 as part of SSE 4.2) to compute fingerprint values. This also enables the
51 algorithm to work effectively on substrings as short as 7 bytes with 4 byte
52 windows. 7 bytes is unfortunately still too long. The window could be
53 technically shrunk to 2 bytes, thereby reducing minimum length to 3, but the
54 small window size ends up negating most performance benefits—and it's likely
55 the common case in a general purpose regex engine.
56
57 Faro and Kulekci also published [4c] that appears to be intended as a
58 replacement to using PCMPESTRI. In particular, it is specifically motivated by
59 the high throughput/latency time of PCMPESTRI and therefore chooses other SIMD
60 instructions that are faster. While this approach works for short substrings,
61 I personally couldn't see a way to generalize it to multiple substring search.
62
63 Faro and Kulekci have another paper [4d] that I haven't been able to read
64 because it is behind a paywall.
65
66
67 Teddy
68 -----
69
70 Finally, we get to Teddy. If the above literature review is complete, then it
71 appears that Teddy is a novel algorithm. More than that, in my experience, it
72 completely blows away the competition for short substrings, which is exactly
73 what we want in a general purpose regex engine. Again, the algorithm appears
74 to be developed by the authors of [Hyperscan][1_u]. Hyperscan was open sourced
75 late 2015, and no earlier history could be found. Therefore, tracking the exact
76 provenance of the algorithm with respect to the published literature seems
77 difficult.
78
79 DISCLAIMER: My understanding of Teddy is limited to reading auto-generated C
80 code, its disassembly and observing its runtime behavior.
81
82 At a high level, Teddy works somewhat similarly to the fingerprint algorithms
83 published by Faro and Kulekci, but Teddy does it in a way that scales a bit
84 better. Namely:
85
86 1. Teddy's core algorithm scans the haystack in 16 byte chunks. 16 is
87 significant because it corresponds to the number of bytes in a SIMD vector.
88 If one used AVX2 instructions, then we could scan the haystack in 32 byte
89 chunks. Similarly, if one used AVX512 instructions, we could scan the
90 haystack in 64 byte chunks. Hyperscan implements SIMD + AVX2, we only
91 implement SIMD for the moment. (The author doesn't have a CPU with AVX2
92 support... yet.)
93 2. Bitwise operations are performed on each chunk to discover if any region of
94 it matches a set of precomputed fingerprints from the patterns. If there are
95 matches, then a verification step is performed. In this implementation, our
96 verification step is naive. This can be improved upon.
97
98 The details to make this work are quite clever. First, we must choose how to
99 pick our fingerprints. In Hyperscan's implementation, I *believe* they use the
100 last N bytes of each substring, where N must be at least the minimum length of
101 any substring in the set being searched. In this implementation, we use the
102 first N bytes of each substring. (The tradeoffs between these choices aren't
103 yet clear to me.) We then must figure out how to quickly test whether an
104 occurrence of any fingerprint from the set of patterns appears in a 16 byte
105 block from the haystack. To keep things simple, let's assume N = 1 and examine
106 some examples to motivate the approach. Here are our patterns:
107
108 ```ignore
109 foo
110 bar
111 baz
112 ```
113
114 The corresponding fingerprints, for N = 1, are `f`, `b` and `b`. Now let's set
115 our 16 byte block to:
116
117 ```ignore
118 bat cat foo bump
119 xxxxxxxxxxxxxxxx
120 ```
121
122 To cut to the chase, Teddy works by using bitsets. In particular, Teddy creates
123 a mask that allows us to quickly compute membership of a fingerprint in a 16
124 byte block that also tells which pattern the fingerprint corresponds to. In
125 this case, our fingerprint is a single byte, so an appropriate abstraction is
126 a map from a single byte to a list of patterns that contain that fingerprint:
127
128 ```ignore
129 f |--> foo
130 b |--> bar, baz
131 ```
132
133 Now, all we need to do is figure out how to represent this map in vector space
134 and use normal SIMD operations to perform a lookup. The first simplification
135 we can make is to represent our patterns as bit fields occupying a single
136 byte. This is important, because a single SIMD vector can store 16 bytes.
137
138 ```ignore
139 f |--> 00000001
140 b |--> 00000010, 00000100
141 ```
142
143 How do we perform lookup though? It turns out that SSSE3 introduced a very cool
144 instruction called PSHUFB. The instruction takes two SIMD vectors, `A` and `B`,
145 and returns a third vector `C`. All vectors are treated as 16 8-bit integers.
146 `C` is formed by `C[i] = A[B[i]]`. (This is a bit of a simplification, but true
147 for the purposes of this algorithm. For full details, see [Intel's Intrinsics
148 Guide][5_u].) This essentially lets us use the values in `B` to lookup values in
149 `A`.
150
151 If we could somehow cause `B` to contain our 16 byte block from the haystack,
152 and if `A` could contain our bitmasks, then we'd end up with something like
153 this for `A`:
154
155 ```ignore
156 0x00 0x01 ... 0x62 ... 0x66 ... 0xFF
157 A = 0 0 00000110 00000001 0
158 ```
159
160 And if `B` contains our window from our haystack, we could use shuffle to take
161 the values from `B` and use them to look up our bitsets in `A`. But of course,
162 we can't do this because `A` in the above example contains 256 bytes, which
163 is much larger than the size of a SIMD vector.
164
165 Nybbles to the rescue! A nybble is 4 bits. Instead of one mask to hold all of
166 our bitsets, we can use two masks, where one mask corresponds to the lower four
167 bits of our fingerprint and the other mask corresponds to the upper four bits.
168 So our map now looks like:
169
170 ```ignore
171 'f' & 0xF = 0x6 |--> 00000001
172 'f' >> 4 = 0x6 |--> 00000111
173 'b' & 0xF = 0x2 |--> 00000110
174 'b' >> 4 = 0x6 |--> 00000111
175 ```
176
177 Notice that the bitsets for each nybble correspond to the union of all
178 fingerprints that contain that nibble. For example, both `f` and `b` have the
179 same upper 4 bits but differ on the lower 4 bits. Putting this together, we
180 have `A0`, `A1` and `B`, where `A0` is our mask for the lower nybble, `A1` is
181 our mask for the upper nybble and `B` is our 16 byte block from the haystack:
182
183 ```ignore
184 0x00 0x01 0x02 0x03 ... 0x06 ... 0xF
185 A0 = 0 0 00000110 0 00000001 0
186 A1 = 0 0 0 0 00000111 0
187 B = b a t _ t p
188 B = 0x62 0x61 0x74 0x20 0x74 0x70
189 ```
190
191 But of course, we can't use `B` with `PSHUFB` yet, since its values are 8 bits,
192 and we need indexes that are at most 4 bits (corresponding to one of 16
193 values). We can apply the same transformation to split `B` into lower and upper
194 nybbles as we did `A`. As before, `B0` corresponds to the lower nybbles and
195 `B1` corresponds to the upper nybbles:
196
197 ```ignore
198 b a t _ c a t _ f o o _ b u m p
199 B0 = 0x2 0x1 0x4 0x0 0x3 0x1 0x4 0x0 0x6 0xF 0xF 0x0 0x2 0x5 0xD 0x0
200 B1 = 0x6 0x6 0x7 0x2 0x6 0x6 0x7 0x2 0x6 0x6 0x6 0x2 0x6 0x7 0x6 0x7
201 ```
202
203 And now we have a nice correspondence. `B0` can index `A0` and `B1` can index
204 `A1`. Here's what we get when we apply `C0 = PSHUFB(A0, B0)`:
205
206 ```ignore
207 b a ... f o ... p
208 A0[0x2] A0[0x1] A0[0x6] A0[0xF] A0[0x0]
209 C0 = 00000110 0 00000001 0 0
210 ```
211
212 And `C1 = PSHUFB(A1, B1)`:
213
214 ```ignore
215 b a ... f o ... p
216 A1[0x6] A1[0x6] A1[0x6] A1[0x6] A1[0x7]
217 C1 = 00000111 00000111 00000111 00000111 0
218 ```
219
220 Notice how neither one of `C0` or `C1` is guaranteed to report fully correct
221 results all on its own. For example, `C1` claims that `b` is a fingerprint for
222 the pattern `foo` (since `A1[0x6] = 00000111`), and that `o` is a fingerprint
223 for all of our patterns. But if we combined `C0` and `C1` with an `AND`
224 operation:
225
226 ```
227 b a ... f o ... p
228 C = 00000110 0 00000001 0 0
229 ```
230
231 Then we now have that `C[i]` contains a bitset corresponding to the matching
232 fingerprints in a haystack's 16 byte block, where `i` is the `ith` byte in that
233 block.
234
235 Once we have that, we can look for the position of the least significant bit
236 in `C`. That position, modulo `8`, gives us the pattern that the fingerprint
237 matches. That position, integer divided by `8`, also gives us the byte offset
238 that the fingerprint occurs in inside the 16 byte haystack block. Using those
239 two pieces of information, we can run a verification procedure that tries
240 to match all substrings containing that fingerprint at that position in the
241 haystack.
242
243
244 Implementation notes
245 --------------------
246
247 The problem with the algorithm as described above is that it uses a single byte
248 for a fingerprint. This will work well if the fingerprints are rare in the
249 haystack (e.g., capital letters or special characters in normal English text),
250 but if the fingerprints are common, you'll wind up spending too much time in
251 the verification step, which effectively negate the performance benefits of
252 scanning 16 bytes at a time. Remember, the key to the performance of this
253 algorithm is to do as little work as possible per 16 bytes.
254
255 This algorithm can be extrapolated in a relatively straight-forward way to use
256 larger fingerprints. That is, instead of a single byte prefix, we might use a
257 three byte prefix. The implementation below implements N = {1, 2, 3} and always
258 picks the largest N possible. The rationale is that the bigger the fingerprint,
259 the fewer verification steps we'll do. Of course, if N is too large, then we'll
260 end up doing too much on each step.
261
262 The way to extend it is:
263
264 1. Add a mask for each byte in the fingerprint. (Remember that each mask is
265 composed of two SIMD vectors.) This results in a value of `C` for each byte
266 in the fingerprint while searching.
267 2. When testing each 16 byte block, each value of `C` must be shifted so that
268 they are aligned. Once aligned, they should all be `AND`'d together. This
269 will give you only the bitsets corresponding to the full match of the
270 fingerprint.
271
272 The implementation below is commented to fill in the nitty gritty details.
273
274 References
275 ----------
276
277 - **[1]** [Hyperscan on GitHub](https://github.com/01org/hyperscan),
278 [webpage](https://01.org/hyperscan)
279 - **[2a]** Ben-Kiki, O., Bille, P., Breslauer, D., Gasieniec, L., Grossi, R.,
280 & Weimann, O. (2011).
281 _Optimal packed string matching_.
282 In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 13).
283 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
284 DOI: 10.4230/LIPIcs.FSTTCS.2011.423.
285 [PDF](http://drops.dagstuhl.de/opus/volltexte/2011/3355/pdf/37.pdf).
286 - **[2b]** Ben-Kiki, O., Bille, P., Breslauer, D., Ga̧sieniec, L., Grossi, R.,
287 & Weimann, O. (2014).
288 _Towards optimal packed string matching_.
289 Theoretical Computer Science, 525, 111-129.
290 DOI: 10.1016/j.tcs.2013.06.013.
291 [PDF](http://www.cs.haifa.ac.il/~oren/Publications/bpsm.pdf).
292 - **[3]** Bille, P. (2011).
293 _Fast searching in packed strings_.
294 Journal of Discrete Algorithms, 9(1), 49-56.
295 DOI: 10.1016/j.jda.2010.09.003.
296 [PDF](http://www.sciencedirect.com/science/article/pii/S1570866710000353).
297 - **[4a]** Faro, S., & KĂĽlekci, M. O. (2012, October).
298 _Fast multiple string matching using streaming SIMD extensions technology_.
299 In String Processing and Information Retrieval (pp. 217-228).
300 Springer Berlin Heidelberg.
301 DOI: 10.1007/978-3-642-34109-0_23.
302 [PDF](http://www.dmi.unict.it/~faro/papers/conference/faro32.pdf).
303 - **[4b]** Faro, S., & KĂĽlekci, M. O. (2013, September).
304 _Towards a Very Fast Multiple String Matching Algorithm for Short Patterns_.
305 In Stringology (pp. 78-91).
306 [PDF](http://www.dmi.unict.it/~faro/papers/conference/faro36.pdf).
307 - **[4c]** Faro, S., & KĂĽlekci, M. O. (2013, January).
308 _Fast packed string matching for short patterns_.
309 In Proceedings of the Meeting on Algorithm Engineering & Expermiments
310 (pp. 113-121).
311 Society for Industrial and Applied Mathematics.
312 [PDF](http://arxiv.org/pdf/1209.6449.pdf).
313 - **[4d]** Faro, S., & KĂĽlekci, M. O. (2014).
314 _Fast and flexible packed string matching_.
315 Journal of Discrete Algorithms, 28, 61-72.
316 DOI: 10.1016/j.jda.2014.07.003.
317
318 [1_u]: https://github.com/01org/hyperscan
319 [5_u]: https://software.intel.com/sites/landingpage/IntrinsicsGuide
320 */
321
322 // TODO: Extend this to use AVX2 instructions.
323 // TODO: Extend this to use AVX512 instructions.
324 // TODO: Make the inner loop do aligned loads.
325
326 use std::cmp;
327 use std::ptr;
328
329 use aho_corasick::{Automaton, AcAutomaton, FullAcAutomaton};
330 use simd::u8x16;
331 use simd::x86::sse2::Sse2Bool8ix16;
332 use simd::x86::ssse3::Ssse3U8x16;
333
334 use syntax;
335
336 /// Corresponds to the number of bytes read at a time in the haystack.
337 const BLOCK_SIZE: usize = 16;
338
339 pub fn is_teddy_128_available() -> bool {
340 true
341 }
342
343 /// Match reports match information.
344 #[derive(Debug, Clone)]
345 pub struct Match {
346 /// The index of the pattern that matched. The index is in correspondence
347 /// with the order of the patterns given at construction.
348 pub pat: usize,
349 /// The start byte offset of the match.
350 pub start: usize,
351 /// The end byte offset of the match. This is always `start + pat.len()`.
352 pub end: usize,
353 }
354
355 /// A SIMD accelerated multi substring searcher.
356 #[derive(Debug, Clone)]
357 pub struct Teddy {
358 /// A list of substrings to match.
359 pats: Vec<Vec<u8>>,
360 /// An Aho-Corasick automaton of the patterns. We use this when we need to
361 /// search pieces smaller than the Teddy block size.
362 ac: FullAcAutomaton<Vec<u8>>,
363 /// A set of 8 buckets. Each bucket corresponds to a single member of a
364 /// bitset. A bucket contains zero or more substrings. This is useful
365 /// when the number of substrings exceeds 8, since our bitsets cannot have
366 /// more than 8 members.
367 buckets: Vec<Vec<usize>>,
368 /// Our set of masks. There's one mask for each byte in the fingerprint.
369 masks: Masks,
370 }
371
372 /// A list of masks. This has length equal to the length of the fingerprint.
373 /// The length of the fingerprint is always `min(3, len(smallest_substring))`.
374 #[derive(Debug, Clone)]
375 struct Masks(Vec<Mask>);
376
377 /// A single mask.
378 #[derive(Debug, Clone, Copy)]
379 struct Mask {
380 /// Bitsets for the low nybbles in a fingerprint.
381 lo: u8x16,
382 /// Bitsets for the high nybbles in a fingerprint.
383 hi: u8x16,
384 }
385
386 impl Teddy {
387 /// Create a new `Teddy` multi substring matcher.
388 ///
389 /// If a `Teddy` matcher could not be created (e.g., `pats` is empty or has
390 /// an empty substring), then `None` is returned.
391 pub fn new(pats: &syntax::Literals) -> Option<Teddy> {
392 let pats: Vec<_> = pats.literals().iter().map(|p|p.to_vec()).collect();
393 let min_len = pats.iter().map(|p| p.len()).min().unwrap_or(0);
394 // Don't allow any empty patterns and require that we have at
395 // least one pattern.
396 if min_len < 1 {
397 return None;
398 }
399 // Pick the largest mask possible, but no larger than 3.
400 let nmasks = cmp::min(3, min_len);
401 let mut masks = Masks::new(nmasks);
402 let mut buckets = vec![vec![]; 8];
403 // Assign a substring to each bucket, and add the bucket's bitfield to
404 // the appropriate position in the mask.
405 for (pati, pat) in pats.iter().enumerate() {
406 let bucket = pati % 8;
407 buckets[bucket].push(pati);
408 masks.add(bucket as u8, pat);
409 }
410 Some(Teddy {
411 pats: pats.to_vec(),
412 ac: AcAutomaton::new(pats.to_vec()).into_full(),
413 buckets: buckets,
414 masks: masks,
415 })
416 }
417
418 /// Returns all of the substrings matched by this `Teddy`.
419 pub fn patterns(&self) -> &[Vec<u8>] {
420 &self.pats
421 }
422
423 /// Returns the number of substrings in this matcher.
424 pub fn len(&self) -> usize {
425 self.pats.len()
426 }
427
428 /// Returns the approximate size on the heap used by this matcher.
429 pub fn approximate_size(&self) -> usize {
430 self.pats.iter().fold(0, |a, b| a + b.len())
431 }
432
433 /// Searches `haystack` for the substrings in this `Teddy`. If a match was
434 /// found, then it is returned. Otherwise, `None` is returned.
435 pub fn find(&self, haystack: &[u8]) -> Option<Match> {
436 // If our haystack is smaller than the block size, then fall back to
437 // a naive brute force search.
438 if haystack.is_empty() || haystack.len() < (BLOCK_SIZE + 2) {
439 return self.slow(haystack, 0);
440 }
441 match self.masks.len() {
442 0 => None,
443 1 => self.find1(haystack),
444 2 => self.find2(haystack),
445 3 => self.find3(haystack),
446 _ => unreachable!(),
447 }
448 }
449
450 /// `find1` is used when there is only 1 mask. This is the easy case and is
451 /// pretty much as described in the module documentation.
452 #[inline(always)]
453 fn find1(&self, haystack: &[u8]) -> Option<Match> {
454 let mut pos = 0;
455 let zero = u8x16::splat(0);
456 let len = haystack.len();
457 debug_assert!(len >= BLOCK_SIZE);
458 while pos <= len - BLOCK_SIZE {
459 let h = unsafe { u8x16::load_unchecked(haystack, pos) };
460 // N.B. `res0` is our `C` in the module documentation.
461 let res0 = self.masks.members1(h);
462 // Only do expensive verification if there are any non-zero bits.
463 let bitfield = res0.ne(zero).move_mask();
464 if bitfield != 0 {
465 if let Some(m) = self.verify(haystack, pos, res0, bitfield) {
466 return Some(m);
467 }
468 }
469 pos += BLOCK_SIZE;
470 }
471 self.slow(haystack, pos)
472 }
473
474 /// `find2` is used when there are 2 masks, e.g., the fingerprint is 2 bytes
475 /// long.
476 #[inline(always)]
477 fn find2(&self, haystack: &[u8]) -> Option<Match> {
478 // This is an exotic way to right shift a SIMD vector across lanes.
479 // See below at use for more details.
480 let res0shuffle = u8x16::new(
481 0, 0, 1, 2,
482 3, 4, 5, 6,
483 7, 8, 9, 10,
484 11, 12, 13, 14,
485 );
486 let zero = u8x16::splat(0);
487 let len = haystack.len();
488 // The previous value of `C` (from the module documentation) for the
489 // *first* byte in the fingerprint. On subsequent iterations, we take
490 // the last bitset from the previous `C` and insert it into the first
491 // position of the current `C`, shifting all other bitsets to the right
492 // one lane. This causes `C` for the first byte to line up with `C` for
493 // the second byte, so that they can be `AND`'d together.
494 let mut prev0 = u8x16::splat(0xFF);
495 let mut pos = 1;
496 debug_assert!(len >= BLOCK_SIZE);
497 while pos <= len - BLOCK_SIZE {
498 let h = unsafe { u8x16::load_unchecked(haystack, pos) };
499 let (res0, res1) = self.masks.members2(h);
500
501 // The next three lines are essentially equivalent to
502 //
503 // ```rust,ignore
504 // (prev0 << 15) | (res0 >> 1)
505 // ```
506 //
507 // ... if SIMD vectors could shift across lanes. There is the
508 // `PALIGNR` instruction, but apparently LLVM doesn't expose it as
509 // a proper intrinsic. Thankfully, it appears the following
510 // sequence does indeed compile down to a `PALIGNR`.
511 let prev0byte0 = prev0.extract(15);
512 let res0shiftr8 = res0.shuffle_bytes(res0shuffle);
513 let res0prev0 = res0shiftr8.replace(0, prev0byte0);
514
515 // `AND`'s our `C` values together.
516 let res = res0prev0 & res1;
517 prev0 = res0;
518
519 let bitfield = res.ne(zero).move_mask();
520 if bitfield != 0 {
521 let pos = pos.checked_sub(1).unwrap();
522 if let Some(m) = self.verify(haystack, pos, res, bitfield) {
523 return Some(m);
524 }
525 }
526 pos += BLOCK_SIZE;
527 }
528 // The windowing above doesn't check the last byte in the last
529 // window, so start the slow search at the last byte of the last
530 // window.
531 self.slow(haystack, pos.checked_sub(1).unwrap())
532 }
533
534 /// `find3` is used when there are 3 masks, e.g., the fingerprint is 3 bytes
535 /// long.
536 ///
537 /// N.B. This is a straight-forward extrapolation of `find2`. The only
538 /// difference is that we need to keep track of two previous values of `C`,
539 /// since we now need to align for three bytes.
540 #[inline(always)]
541 fn find3(&self, haystack: &[u8]) -> Option<Match> {
542 let zero = u8x16::splat(0);
543 let len = haystack.len();
544
545 let res0shuffle = u8x16::new(
546 0, 0, 0, 1,
547 2, 3, 4, 5,
548 6, 7, 8, 9,
549 10, 11, 12, 13,
550 );
551 let res1shuffle = u8x16::new(
552 0, 0, 1, 2,
553 3, 4, 5, 6,
554 7, 8, 9, 10,
555 11, 12, 13, 14,
556 );
557 let mut prev0 = u8x16::splat(0xFF);
558 let mut prev1 = u8x16::splat(0xFF);
559 let mut pos = 2;
560 while pos <= len - BLOCK_SIZE {
561 let h = unsafe { u8x16::load_unchecked(haystack, pos) };
562 let (res0, res1, res2) = self.masks.members3(h);
563
564 let prev0byte0 = prev0.extract(14);
565 let prev0byte1 = prev0.extract(15);
566 let res0shiftr16 = res0.shuffle_bytes(res0shuffle);
567 let res0prev0 = res0shiftr16.replace(0, prev0byte0)
568 .replace(1, prev0byte1);
569
570 let prev1byte0 = prev1.extract(15);
571 let res1shiftr8 = res1.shuffle_bytes(res1shuffle);
572 let res1prev1 = res1shiftr8.replace(0, prev1byte0);
573
574 let res = res0prev0 & res1prev1 & res2;
575
576 prev0 = res0;
577 prev1 = res1;
578
579 let bitfield = res.ne(zero).move_mask();
580 if bitfield != 0 {
581 let pos = pos.checked_sub(2).unwrap();
582 if let Some(m) = self.verify(haystack, pos, res, bitfield) {
583 return Some(m);
584 }
585 }
586 pos += BLOCK_SIZE;
587 }
588 // The windowing above doesn't check the last two bytes in the last
589 // window, so start the slow search at the penultimate byte of the
590 // last window.
591 // self.slow(haystack, pos.saturating_sub(2))
592 self.slow(haystack, pos.checked_sub(2).unwrap())
593 }
594
595 /// Runs the verification procedure on `res` (i.e., `C` from the module
596 /// documentation), where the haystack block starts at `pos` in
597 /// `haystack`. `bitfield` has ones in the bit positions that `res` has
598 /// non-zero bytes.
599 ///
600 /// If a match exists, it returns the first one.
601 #[inline(always)]
602 fn verify(
603 &self,
604 haystack: &[u8],
605 pos: usize,
606 res: u8x16,
607 mut bitfield: u32,
608 ) -> Option<Match> {
609 while bitfield != 0 {
610 // The next offset, relative to pos, where some fingerprint
611 // matched.
612 let byte_pos = bitfield.trailing_zeros();
613 bitfield &= !(1 << byte_pos);
614
615 // Offset relative to the beginning of the haystack.
616 let start = pos + byte_pos as usize;
617
618 // The bitfield telling us which patterns had fingerprints that
619 // match at this starting position.
620 let mut patterns = res.extract(byte_pos);
621 while patterns != 0 {
622 let bucket = patterns.trailing_zeros() as usize;
623 patterns &= !(1 << bucket);
624
625 // Actual substring search verification.
626 if let Some(m) = self.verify_bucket(haystack, bucket, start) {
627 return Some(m);
628 }
629 }
630 }
631
632 None
633 }
634
635 /// Verifies whether any substring in the given bucket matches in haystack
636 /// at the given starting position.
637 #[inline(always)]
638 fn verify_bucket(
639 &self,
640 haystack: &[u8],
641 bucket: usize,
642 start: usize,
643 ) -> Option<Match> {
644 // This cycles through the patterns in the bucket in the order that
645 // the patterns were given. Therefore, we guarantee leftmost-first
646 // semantics.
647 for &pati in &self.buckets[bucket] {
648 let pat = &*self.pats[pati];
649 if start + pat.len() > haystack.len() {
650 continue;
651 }
652 if pat == &haystack[start..start + pat.len()] {
653 return Some(Match {
654 pat: pati,
655 start: start,
656 end: start + pat.len(),
657 });
658 }
659 }
660 None
661 }
662
663 /// Slow substring search through all patterns in this matcher.
664 ///
665 /// This is used when we don't have enough bytes in the haystack for our
666 /// block based approach.
667 fn slow(&self, haystack: &[u8], pos: usize) -> Option<Match> {
668 self.ac.find(&haystack[pos..]).next().map(|m| {
669 Match {
670 pat: m.pati,
671 start: pos + m.start,
672 end: pos + m.end,
673 }
674 })
675 }
676 }
677
678 impl Masks {
679 /// Create a new set of masks of size `n`, where `n` corresponds to the
680 /// number of bytes in a fingerprint.
681 fn new(n: usize) -> Masks {
682 Masks(vec![Mask::new(); n])
683 }
684
685 /// Returns the number of masks.
686 fn len(&self) -> usize {
687 self.0.len()
688 }
689
690 /// Adds the given pattern to the given bucket. The bucket should be a
691 /// power of `2 <= 2^7`.
692 fn add(&mut self, bucket: u8, pat: &[u8]) {
693 for (i, mask) in self.0.iter_mut().enumerate() {
694 mask.add(bucket, pat[i]);
695 }
696 }
697
698 /// Finds the fingerprints that are in the given haystack block. i.e., this
699 /// returns `C` as described in the module documentation.
700 ///
701 /// More specifically, `for i in 0..16` and `j in 0..8, C[i][j] == 1` if and
702 /// only if `haystack_block[i]` corresponds to a fingerprint that is part
703 /// of a pattern in bucket `j`.
704 #[inline(always)]
705 fn members1(&self, haystack_block: u8x16) -> u8x16 {
706 let masklo = u8x16::splat(0xF);
707 let hlo = haystack_block & masklo;
708 let hhi = (haystack_block >> 4) & masklo;
709
710 self.0[0].lo.shuffle_bytes(hlo) & self.0[0].hi.shuffle_bytes(hhi)
711 }
712
713 /// Like members1, but computes C for the first and second bytes in the
714 /// fingerprint.
715 #[inline(always)]
716 fn members2(&self, haystack_block: u8x16) -> (u8x16, u8x16) {
717 let masklo = u8x16::splat(0xF);
718 let hlo = haystack_block & masklo;
719 let hhi = (haystack_block >> 4) & masklo;
720
721 let res0 = self.0[0].lo.shuffle_bytes(hlo)
722 & self.0[0].hi.shuffle_bytes(hhi);
723 let res1 = self.0[1].lo.shuffle_bytes(hlo)
724 & self.0[1].hi.shuffle_bytes(hhi);
725 (res0, res1)
726 }
727
728 /// Like `members1`, but computes `C` for the first, second and third bytes
729 /// in the fingerprint.
730 #[inline(always)]
731 fn members3(&self, haystack_block: u8x16) -> (u8x16, u8x16, u8x16) {
732 let masklo = u8x16::splat(0xF);
733 let hlo = haystack_block & masklo;
734 let hhi = (haystack_block >> 4) & masklo;
735
736 let res0 = self.0[0].lo.shuffle_bytes(hlo)
737 & self.0[0].hi.shuffle_bytes(hhi);
738 let res1 = self.0[1].lo.shuffle_bytes(hlo)
739 & self.0[1].hi.shuffle_bytes(hhi);
740 let res2 = self.0[2].lo.shuffle_bytes(hlo)
741 & self.0[2].hi.shuffle_bytes(hhi);
742 (res0, res1, res2)
743 }
744 }
745
746 impl Mask {
747 /// Create a new mask with no members.
748 fn new() -> Mask {
749 Mask {
750 lo: u8x16::splat(0),
751 hi: u8x16::splat(0),
752 }
753 }
754
755 /// Adds the given byte to the given bucket.
756 fn add(&mut self, bucket: u8, byte: u8) {
757 // Split our byte into two nybbles, and add each nybble to our
758 // mask.
759 let byte_lo = (byte & 0xF) as u32;
760 let byte_hi = (byte >> 4) as u32;
761
762 let lo = self.lo.extract(byte_lo);
763 self.lo = self.lo.replace(byte_lo, ((1 << bucket) as u8) | lo);
764
765 let hi = self.hi.extract(byte_hi);
766 self.hi = self.hi.replace(byte_hi, ((1 << bucket) as u8) | hi);
767 }
768 }
769
770 /// UnsafeLoad permits loading data into a SIMD vector without bounds checks.
771 ///
772 /// Ideally, this would be part of the `simd` crate, or even better, we could
773 /// figure out how to do it without `unsafe` at all.
774 trait UnsafeLoad {
775 type Elem;
776
777 /// load_unchecked creates a new SIMD vector from the elements in `slice`
778 /// starting at `offset`. `slice` must have at least the number of elements
779 /// required to fill a SIMD vector.
780 unsafe fn load_unchecked(slice: &[Self::Elem], offset: usize) -> Self;
781 }
782
783 impl UnsafeLoad for u8x16 {
784 type Elem = u8;
785
786 unsafe fn load_unchecked(slice: &[u8], offset: usize) -> u8x16 {
787 let mut x = u8x16::splat(0);
788 ptr::copy_nonoverlapping(
789 slice.get_unchecked(offset),
790 &mut x as *mut u8x16 as *mut u8,
791 16);
792 x
793 }
794 }