vendor/regex-0.2.5/src/simd_accel/teddy128.rs

   1 /*!
   2 Teddy is a simd accelerated multiple substring matching algorithm. The name
   3 and the core ideas in the algorithm were learned from the [Hyperscan][1_u]
   4 project.
   5
   6
   7 Background
   8 ----------
   9
  10 The key idea of Teddy is to do *packed* substring matching. In the literature,
  11 packed substring matching is the idea of examing multiple bytes in a haystack
  12 at a time to detect matches. Implementations of, for example, memchr (which
  13 detects matches of a single byte) have been doing this for years. Only
  14 recently, with the introduction of various SIMD instructions, has this been
  15 extended to substring matching. The PCMPESTRI instruction (and its relatives),
  16 for example, implements substring matching in hardware. It is, however, limited
  17 to substrings of length 16 bytes or fewer, but this restriction is fine in a
  18 regex engine, since we rarely care about the performance difference between
  19 searching for a 16 byte literal and a 16 + N literal—16 is already long
  20 enough. The key downside of the PCMPESTRI instruction, on current (2016) CPUs
  21 at least, is its latency and throughput. As a result, it is often faster to do
  22 substring search with a Boyer-Moore variant and a well placed memchr to quickly
  23 skip through the haystack.
  24
  25 There are fewer results from the literature on packed substring matching,
  26 and even fewer for packed multiple substring matching. Ben-Kiki et al. [2]
  27 describes use of PCMPESTRI for substring matching, but is mostly theoretical
  28 and hand-waves performance. There is other theoretical work done by Bille [3]
  29 as well.
  30
  31 The rest of the work in the field, as far as I'm aware, is by Faro and Kulekci
  32 and is generally focused on multiple pattern search. Their first paper [4a]
  33 introduces the concept of a fingerprint, which is computed for every block of
  34 N bytes in every pattern. The haystack is then scanned N bytes at a time and
  35 a fingerprint is computed in the same way it was computed for blocks in the
  36 patterns. If the fingerprint corresponds to one that was found in a pattern,
  37 then a verification step follows to confirm that one of the substrings with the
  38 corresponding fingerprint actually matches at the current location. Various
  39 implementation tricks are employed to make sure the fingerprint lookup is fast;
  40 typically by truncating the fingerprint. (This may, of course, provoke more
  41 steps in the verification process, so a balance must be struck.)
  42
  43 The main downside of [4a] is that the minimum substring length is 32 bytes,
  44 presumably because of how the algorithm uses certain SIMD instructions. This
  45 essentially makes it useless for general purpose regex matching, where a small
  46 number of short patterns is far more likely.
  47
  48 Faro and Kulekci published another paper [4b] that is conceptually very similar
  49 to [4a]. The key difference is that it uses the CRC32 instruction (introduced
  50 as part of SSE 4.2) to compute fingerprint values. This also enables the
  51 algorithm to work effectively on substrings as short as 7 bytes with 4 byte
  52 windows. 7 bytes is unfortunately still too long. The window could be
  53 technically shrunk to 2 bytes, thereby reducing minimum length to 3, but the
  54 small window size ends up negating most performance benefits—and it's likely
  55 the common case in a general purpose regex engine.
  56
  57 Faro and Kulekci also published [4c] that appears to be intended as a
  58 replacement to using PCMPESTRI. In particular, it is specifically motivated by
  59 the high throughput/latency time of PCMPESTRI and therefore chooses other SIMD
  60 instructions that are faster. While this approach works for short substrings,
  61 I personally couldn't see a way to generalize it to multiple substring search.
  62
  63 Faro and Kulekci have another paper [4d] that I haven't been able to read
  64 because it is behind a paywall.
  65
  66
  67 Teddy
  68 -----
  69
  70 Finally, we get to Teddy. If the above literature review is complete, then it
  71 appears that Teddy is a novel algorithm. More than that, in my experience, it
  72 completely blows away the competition for short substrings, which is exactly
  73 what we want in a general purpose regex engine. Again, the algorithm appears
  74 to be developed by the authors of [Hyperscan][1_u]. Hyperscan was open sourced
  75 late 2015, and no earlier history could be found. Therefore, tracking the exact
  76 provenance of the algorithm with respect to the published literature seems
  77 difficult.
  78
  79 DISCLAIMER: My understanding of Teddy is limited to reading auto-generated C
  80 code, its disassembly and observing its runtime behavior.
  81
  82 At a high level, Teddy works somewhat similarly to the fingerprint algorithms
  83 published by Faro and Kulekci, but Teddy does it in a way that scales a bit
  84 better. Namely:
  85
  86 1. Teddy's core algorithm scans the haystack in 16 byte chunks. 16 is
  87    significant because it corresponds to the number of bytes in a SIMD vector.
  88    If one used AVX2 instructions, then we could scan the haystack in 32 byte
  89    chunks. Similarly, if one used AVX512 instructions, we could scan the
  90    haystack in 64 byte chunks. Hyperscan implements SIMD + AVX2, we only
  91    implement SIMD for the moment. (The author doesn't have a CPU with AVX2
  92    support... yet.)
  93 2. Bitwise operations are performed on each chunk to discover if any region of
  94    it matches a set of precomputed fingerprints from the patterns. If there are
  95    matches, then a verification step is performed. In this implementation, our
  96    verification step is naive. This can be improved upon.
  97
  98 The details to make this work are quite clever. First, we must choose how to
  99 pick our fingerprints. In Hyperscan's implementation, I *believe* they use the
 100 last N bytes of each substring, where N must be at least the minimum length of
 101 any substring in the set being searched. In this implementation, we use the
 102 first N bytes of each substring. (The tradeoffs between these choices aren't
 103 yet clear to me.) We then must figure out how to quickly test whether an
 104 occurrence of any fingerprint from the set of patterns appears in a 16 byte
 105 block from the haystack. To keep things simple, let's assume N = 1 and examine
 106 some examples to motivate the approach. Here are our patterns:
 107
 108 ```ignore
 109 foo
 110 bar
 111 baz
 112 ```
 113
 114 The corresponding fingerprints, for N = 1, are `f`, `b` and `b`. Now let's set
 115 our 16 byte block to:
 116
 117 ```ignore
 118 bat cat foo bump
 119 xxxxxxxxxxxxxxxx
 120 ```
 121
 122 To cut to the chase, Teddy works by using bitsets. In particular, Teddy creates
 123 a mask that allows us to quickly compute membership of a fingerprint in a 16
 124 byte block that also tells which pattern the fingerprint corresponds to. In
 125 this case, our fingerprint is a single byte, so an appropriate abstraction is
 126 a map from a single byte to a list of patterns that contain that fingerprint:
 127
 128 ```ignore
 129 f |--> foo
 130 b |--> bar, baz
 131 ```
 132
 133 Now, all we need to do is figure out how to represent this map in vector space
 134 and use normal SIMD operations to perform a lookup. The first simplification
 135 we can make is to represent our patterns as bit fields occupying a single
 136 byte. This is important, because a single SIMD vector can store 16 bytes.
 137
 138 ```ignore
 139 f |--> 00000001
 140 b |--> 00000010, 00000100
 141 ```
 142
 143 How do we perform lookup though? It turns out that SSSE3 introduced a very cool
 144 instruction called PSHUFB. The instruction takes two SIMD vectors, `A` and `B`,
 145 and returns a third vector `C`. All vectors are treated as 16 8-bit integers.
 146 `C` is formed by `C[i] = A[B[i]]`. (This is a bit of a simplification, but true
 147 for the purposes of this algorithm. For full details, see [Intel's Intrinsics
 148 Guide][5_u].) This essentially lets us use the values in `B` to lookup values in
 149 `A`.
 150
 151 If we could somehow cause `B` to contain our 16 byte block from the haystack,
 152 and if `A` could contain our bitmasks, then we'd end up with something like
 153 this for `A`:
 154
 155 ```ignore
 156     0x00 0x01 ... 0x62      ... 0x66      ... 0xFF
 157 A = 0    0        00000110      00000001      0
 158 ```
 159
 160 And if `B` contains our window from our haystack, we could use shuffle to take
 161 the values from `B` and use them to look up our bitsets in `A`. But of course,
 162 we can't do this because `A` in the above example contains 256 bytes, which
 163 is much larger than the size of a SIMD vector.
 164
 165 Nybbles to the rescue! A nybble is 4 bits. Instead of one mask to hold all of
 166 our bitsets, we can use two masks, where one mask corresponds to the lower four
 167 bits of our fingerprint and the other mask corresponds to the upper four bits.
 168 So our map now looks like:
 169
 170 ```ignore
 171 'f' & 0xF = 0x6 |--> 00000001
 172 'f' >> 4  = 0x6 |--> 00000111
 173 'b' & 0xF = 0x2 |--> 00000110
 174 'b' >> 4  = 0x6 |--> 00000111
 175 ```
 176
 177 Notice that the bitsets for each nybble correspond to the union of all
 178 fingerprints that contain that nibble. For example, both `f` and `b` have the
 179 same upper 4 bits but differ on the lower 4 bits. Putting this together, we
 180 have `A0`, `A1` and `B`, where `A0` is our mask for the lower nybble, `A1` is
 181 our mask for the upper nybble and `B` is our 16 byte block from the haystack:
 182
 183 ```ignore
 184       0x00 0x01 0x02      0x03 ... 0x06      ... 0xF
 185 A0 =  0    0    00000110  0        00000001      0
 186 A1 =  0    0    0         0        00000111      0
 187 B  =  b    a    t         _        t             p
 188 B  =  0x62 0x61 0x74      0x20     0x74          0x70
 189 ```
 190
 191 But of course, we can't use `B` with `PSHUFB` yet, since its values are 8 bits,
 192 and we need indexes that are at most 4 bits (corresponding to one of 16
 193 values). We can apply the same transformation to split `B` into lower and upper
 194 nybbles as we did `A`. As before, `B0` corresponds to the lower nybbles and
 195 `B1` corresponds to the upper nybbles:
 196
 197 ```ignore
 198      b   a   t   _   c   a   t   _   f   o   o   _   b   u   m   p
 199 B0 = 0x2 0x1 0x4 0x0 0x3 0x1 0x4 0x0 0x6 0xF 0xF 0x0 0x2 0x5 0xD 0x0
 200 B1 = 0x6 0x6 0x7 0x2 0x6 0x6 0x7 0x2 0x6 0x6 0x6 0x2 0x6 0x7 0x6 0x7
 201 ```
 202
 203 And now we have a nice correspondence. `B0` can index `A0` and `B1` can index
 204 `A1`. Here's what we get when we apply `C0 = PSHUFB(A0, B0)`:
 205
 206 ```ignore
 207      b         a        ... f         o         ... p
 208      A0[0x2]   A0[0x1]      A0[0x6]   A0[0xF]       A0[0x0]
 209 C0 = 00000110  0            00000001  0             0
 210 ```
 211
 212 And `C1 = PSHUFB(A1, B1)`:
 213
 214 ```ignore
 215      b         a        ... f         o        ... p
 216      A1[0x6]   A1[0x6]      A1[0x6]   A1[0x6]      A1[0x7]
 217 C1 = 00000111  00000111     00000111  00000111     0
 218 ```
 219
 220 Notice how neither one of `C0` or `C1` is guaranteed to report fully correct
 221 results all on its own. For example, `C1` claims that `b` is a fingerprint for
 222 the pattern `foo` (since `A1[0x6] = 00000111`), and that `o` is a fingerprint
 223 for all of our patterns. But if we combined `C0` and `C1` with an `AND`
 224 operation:
 225
 226 ```
 227      b         a        ... f         o        ... p
 228 C  = 00000110  0            00000001  0            0
 229 ```
 230
 231 Then we now have that `C[i]` contains a bitset corresponding to the matching
 232 fingerprints in a haystack's 16 byte block, where `i` is the `ith` byte in that
 233 block.
 234
 235 Once we have that, we can look for the position of the least significant bit
 236 in `C`. That position, modulo `8`, gives us the pattern that the fingerprint
 237 matches. That position, integer divided by `8`, also gives us the byte offset
 238 that the fingerprint occurs in inside the 16 byte haystack block. Using those
 239 two pieces of information, we can run a verification procedure that tries
 240 to match all substrings containing that fingerprint at that position in the
 241 haystack.
 242
 243
 244 Implementation notes
 245 --------------------
 246
 247 The problem with the algorithm as described above is that it uses a single byte
 248 for a fingerprint. This will work well if the fingerprints are rare in the
 249 haystack (e.g., capital letters or special characters in normal English text),
 250 but if the fingerprints are common, you'll wind up spending too much time in
 251 the verification step, which effectively negate the performance benefits of
 252 scanning 16 bytes at a time. Remember, the key to the performance of this
 253 algorithm is to do as little work as possible per 16 bytes.
 254
 255 This algorithm can be extrapolated in a relatively straight-forward way to use
 256 larger fingerprints. That is, instead of a single byte prefix, we might use a
 257 three byte prefix. The implementation below implements N = {1, 2, 3} and always
 258 picks the largest N possible. The rationale is that the bigger the fingerprint,
 259 the fewer verification steps we'll do. Of course, if N is too large, then we'll
 260 end up doing too much on each step.
 261
 262 The way to extend it is:
 263
 264 1. Add a mask for each byte in the fingerprint. (Remember that each mask is
 265    composed of two SIMD vectors.) This results in a value of `C` for each byte
 266    in the fingerprint while searching.
 267 2. When testing each 16 byte block, each value of `C` must be shifted so that
 268    they are aligned. Once aligned, they should all be `AND`'d together. This
 269    will give you only the bitsets corresponding to the full match of the
 270    fingerprint.
 271
 272 The implementation below is commented to fill in the nitty gritty details.
 273
 274 References
 275 ----------
 276
 277 - **[1]** [Hyperscan on GitHub](https://github.com/01org/hyperscan),
 278     [webpage](https://01.org/hyperscan)
 279 - **[2a]** Ben-Kiki, O., Bille, P., Breslauer, D., Gasieniec, L., Grossi, R.,
 280     & Weimann, O. (2011).
 281     _Optimal packed string matching_.
 282     In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 13).
 283     Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
 284     DOI: 10.4230/LIPIcs.FSTTCS.2011.423.
 285     [PDF](http://drops.dagstuhl.de/opus/volltexte/2011/3355/pdf/37.pdf).
 286 - **[2b]** Ben-Kiki, O., Bille, P., Breslauer, D., Ga̧sieniec, L., Grossi, R.,
 287     & Weimann, O. (2014).
 288     _Towards optimal packed string matching_.
 289     Theoretical Computer Science, 525, 111-129.
 290     DOI: 10.1016/j.tcs.2013.06.013.
 291     [PDF](http://www.cs.haifa.ac.il/~oren/Publications/bpsm.pdf).
 292 - **[3]** Bille, P. (2011).
 293     _Fast searching in packed strings_.
 294     Journal of Discrete Algorithms, 9(1), 49-56.
 295     DOI: 10.1016/j.jda.2010.09.003.
 296     [PDF](http://www.sciencedirect.com/science/article/pii/S1570866710000353).
 297 - **[4a]** Faro, S., & Külekci, M. O. (2012, October).
 298     _Fast multiple string matching using streaming SIMD extensions technology_.
 299     In String Processing and Information Retrieval (pp. 217-228).
 300     Springer Berlin Heidelberg.
 301     DOI: 10.1007/978-3-642-34109-0_23.
 302     [PDF](http://www.dmi.unict.it/~faro/papers/conference/faro32.pdf).
 303 - **[4b]** Faro, S., & Külekci, M. O. (2013, September).
 304     _Towards a Very Fast Multiple String Matching Algorithm for Short Patterns_.
 305     In Stringology (pp. 78-91).
 306     [PDF](http://www.dmi.unict.it/~faro/papers/conference/faro36.pdf).
 307 - **[4c]** Faro, S., & Külekci, M. O. (2013, January).
 308     _Fast packed string matching for short patterns_.
 309     In Proceedings of the Meeting on Algorithm Engineering & Expermiments
 310     (pp. 113-121).
 311     Society for Industrial and Applied Mathematics.
 312     [PDF](http://arxiv.org/pdf/1209.6449.pdf).
 313 - **[4d]** Faro, S., & Külekci, M. O. (2014).
 314     _Fast and flexible packed string matching_.
 315     Journal of Discrete Algorithms, 28, 61-72.
 316     DOI: 10.1016/j.jda.2014.07.003.
 317
 318 [1_u]: https://github.com/01org/hyperscan
 319 [5_u]: https://software.intel.com/sites/landingpage/IntrinsicsGuide
 320 */
 321
 322 // TODO: Extend this to use AVX2 instructions.
 323 // TODO: Extend this to use AVX512 instructions.
 324 // TODO: Make the inner loop do aligned loads.
 325
 326 use std::cmp;
 327 use std::ptr;
 328
 329 use aho_corasick::{Automaton, AcAutomaton, FullAcAutomaton};
 330 use simd::u8x16;
 331 use simd::x86::sse2::Sse2Bool8ix16;
 332 use simd::x86::ssse3::Ssse3U8x16;
 333
 334 use syntax;
 335
 336 /// Corresponds to the number of bytes read at a time in the haystack.
 337 const BLOCK_SIZE: usize = 16;
 338
 339 pub fn is_teddy_128_available() -> bool {
 340     true
 341 }
 342
 343 /// Match reports match information.
 344 #[derive(Debug, Clone)]
 345 pub struct Match {
 346     /// The index of the pattern that matched. The index is in correspondence
 347     /// with the order of the patterns given at construction.
 348     pub pat: usize,
 349     /// The start byte offset of the match.
 350     pub start: usize,
 351     /// The end byte offset of the match. This is always `start + pat.len()`.
 352     pub end: usize,
 353 }
 354
 355 /// A SIMD accelerated multi substring searcher.
 356 #[derive(Debug, Clone)]
 357 pub struct Teddy {
 358     /// A list of substrings to match.
 359     pats: Vec<Vec<u8>>,
 360     /// An Aho-Corasick automaton of the patterns. We use this when we need to
 361     /// search pieces smaller than the Teddy block size.
 362     ac: FullAcAutomaton<Vec<u8>>,
 363     /// A set of 8 buckets. Each bucket corresponds to a single member of a
 364     /// bitset. A bucket contains zero or more substrings. This is useful
 365     /// when the number of substrings exceeds 8, since our bitsets cannot have
 366     /// more than 8 members.
 367     buckets: Vec<Vec<usize>>,
 368     /// Our set of masks. There's one mask for each byte in the fingerprint.
 369     masks: Masks,
 370 }
 371
 372 /// A list of masks. This has length equal to the length of the fingerprint.
 373 /// The length of the fingerprint is always `min(3, len(smallest_substring))`.
 374 #[derive(Debug, Clone)]
 375 struct Masks(Vec<Mask>);
 376
 377 /// A single mask.
 378 #[derive(Debug, Clone, Copy)]
 379 struct Mask {
 380     /// Bitsets for the low nybbles in a fingerprint.
 381     lo: u8x16,
 382     /// Bitsets for the high nybbles in a fingerprint.
 383     hi: u8x16,
 384 }
 385
 386 impl Teddy {
 387     /// Create a new `Teddy` multi substring matcher.
 388     ///
 389     /// If a `Teddy` matcher could not be created (e.g., `pats` is empty or has
 390     /// an empty substring), then `None` is returned.
 391     pub fn new(pats: &syntax::Literals) -> Option<Teddy> {
 392         let pats: Vec<_> = pats.literals().iter().map(|p|p.to_vec()).collect();
 393         let min_len = pats.iter().map(|p| p.len()).min().unwrap_or(0);
 394         // Don't allow any empty patterns and require that we have at
 395         // least one pattern.
 396         if min_len < 1 {
 397             return None;
 398         }
 399         // Pick the largest mask possible, but no larger than 3.
 400         let nmasks = cmp::min(3, min_len);
 401         let mut masks = Masks::new(nmasks);
 402         let mut buckets = vec![vec![]; 8];
 403         // Assign a substring to each bucket, and add the bucket's bitfield to
 404         // the appropriate position in the mask.
 405         for (pati, pat) in pats.iter().enumerate() {
 406             let bucket = pati % 8;
 407             buckets[bucket].push(pati);
 408             masks.add(bucket as u8, pat);
 409         }
 410         Some(Teddy {
 411             pats: pats.to_vec(),
 412             ac: AcAutomaton::new(pats.to_vec()).into_full(),
 413             buckets: buckets,
 414             masks: masks,
 415         })
 416     }
 417
 418     /// Returns all of the substrings matched by this `Teddy`.
 419     pub fn patterns(&self) -> &[Vec<u8>] {
 420         &self.pats
 421     }
 422
 423     /// Returns the number of substrings in this matcher.
 424     pub fn len(&self) -> usize {
 425         self.pats.len()
 426     }
 427
 428     /// Returns the approximate size on the heap used by this matcher.
 429     pub fn approximate_size(&self) -> usize {
 430         self.pats.iter().fold(0, |a, b| a + b.len())
 431     }
 432
 433     /// Searches `haystack` for the substrings in this `Teddy`. If a match was
 434     /// found, then it is returned. Otherwise, `None` is returned.
 435     pub fn find(&self, haystack: &[u8]) -> Option<Match> {
 436         // If our haystack is smaller than the block size, then fall back to
 437         // a naive brute force search.
 438         if haystack.is_empty() || haystack.len() < (BLOCK_SIZE + 2) {
 439             return self.slow(haystack, 0);
 440         }
 441         match self.masks.len() {
 442             0 => None,
 443             1 => self.find1(haystack),
 444             2 => self.find2(haystack),
 445             3 => self.find3(haystack),
 446             _ => unreachable!(),
 447         }
 448     }
 449
 450     /// `find1` is used when there is only 1 mask. This is the easy case and is
 451     /// pretty much as described in the module documentation.
 452     #[inline(always)]
 453     fn find1(&self, haystack: &[u8]) -> Option<Match> {
 454         let mut pos = 0;
 455         let zero = u8x16::splat(0);
 456         let len = haystack.len();
 457         debug_assert!(len >= BLOCK_SIZE);
 458         while pos <= len - BLOCK_SIZE {
 459             let h = unsafe { u8x16::load_unchecked(haystack, pos) };
 460             // N.B. `res0` is our `C` in the module documentation.
 461             let res0 = self.masks.members1(h);
 462             // Only do expensive verification if there are any non-zero bits.
 463             let bitfield = res0.ne(zero).move_mask();
 464             if bitfield != 0 {
 465                 if let Some(m) = self.verify(haystack, pos, res0, bitfield) {
 466                     return Some(m);
 467                 }
 468             }
 469             pos += BLOCK_SIZE;
 470         }
 471         self.slow(haystack, pos)
 472     }
 473
 474     /// `find2` is used when there are 2 masks, e.g., the fingerprint is 2 bytes
 475     /// long.
 476     #[inline(always)]
 477     fn find2(&self, haystack: &[u8]) -> Option<Match> {
 478         // This is an exotic way to right shift a SIMD vector across lanes.
 479         // See below at use for more details.
 480         let res0shuffle = u8x16::new(
 481             0, 0, 1, 2,
 482             3, 4, 5, 6,
 483             7, 8, 9, 10,
 484             11, 12, 13, 14,
 485         );
 486         let zero = u8x16::splat(0);
 487         let len = haystack.len();
 488         // The previous value of `C` (from the module documentation) for the
 489         // *first* byte in the fingerprint. On subsequent iterations, we take
 490         // the last bitset from the previous `C` and insert it into the first
 491         // position of the current `C`, shifting all other bitsets to the right
 492         // one lane. This causes `C` for the first byte to line up with `C` for
 493         // the second byte, so that they can be `AND`'d together.
 494         let mut prev0 = u8x16::splat(0xFF);
 495         let mut pos = 1;
 496         debug_assert!(len >= BLOCK_SIZE);
 497         while pos <= len - BLOCK_SIZE {
 498             let h = unsafe { u8x16::load_unchecked(haystack, pos) };
 499             let (res0, res1) = self.masks.members2(h);
 500
 501             // The next three lines are essentially equivalent to
 502             //
 503             // ```rust,ignore
 504             // (prev0 << 15) | (res0 >> 1)
 505             // ```
 506             //
 507             // ... if SIMD vectors could shift across lanes. There is the
 508             // `PALIGNR` instruction, but apparently LLVM doesn't expose it as
 509             // a proper intrinsic. Thankfully, it appears the following
 510             // sequence does indeed compile down to a `PALIGNR`.
 511             let prev0byte0 = prev0.extract(15);
 512             let res0shiftr8 = res0.shuffle_bytes(res0shuffle);
 513             let res0prev0 = res0shiftr8.replace(0, prev0byte0);
 514
 515             // `AND`'s our `C` values together.
 516             let res = res0prev0 & res1;
 517             prev0 = res0;
 518
 519             let bitfield = res.ne(zero).move_mask();
 520             if bitfield != 0 {
 521                 let pos = pos.checked_sub(1).unwrap();
 522                 if let Some(m) = self.verify(haystack, pos, res, bitfield) {
 523                     return Some(m);
 524                 }
 525             }
 526             pos += BLOCK_SIZE;
 527         }
 528         // The windowing above doesn't check the last byte in the last
 529         // window, so start the slow search at the last byte of the last
 530         // window.
 531         self.slow(haystack, pos.checked_sub(1).unwrap())
 532     }
 533
 534     /// `find3` is used when there are 3 masks, e.g., the fingerprint is 3 bytes
 535     /// long.
 536     ///
 537     /// N.B. This is a straight-forward extrapolation of `find2`. The only
 538     /// difference is that we need to keep track of two previous values of `C`,
 539     /// since we now need to align for three bytes.
 540     #[inline(always)]
 541     fn find3(&self, haystack: &[u8]) -> Option<Match> {
 542         let zero = u8x16::splat(0);
 543         let len = haystack.len();
 544
 545         let res0shuffle = u8x16::new(
 546             0, 0, 0, 1,
 547             2, 3, 4, 5,
 548             6, 7, 8, 9,
 549             10, 11, 12, 13,
 550         );
 551         let res1shuffle = u8x16::new(
 552             0, 0, 1, 2,
 553             3, 4, 5, 6,
 554             7, 8, 9, 10,
 555             11, 12, 13, 14,
 556         );
 557         let mut prev0 = u8x16::splat(0xFF);
 558         let mut prev1 = u8x16::splat(0xFF);
 559         let mut pos = 2;
 560         while pos <= len - BLOCK_SIZE {
 561             let h = unsafe { u8x16::load_unchecked(haystack, pos) };
 562             let (res0, res1, res2) = self.masks.members3(h);
 563
 564             let prev0byte0 = prev0.extract(14);
 565             let prev0byte1 = prev0.extract(15);
 566             let res0shiftr16 = res0.shuffle_bytes(res0shuffle);
 567             let res0prev0 = res0shiftr16.replace(0, prev0byte0)
 568                                         .replace(1, prev0byte1);
 569
 570             let prev1byte0 = prev1.extract(15);
 571             let res1shiftr8 = res1.shuffle_bytes(res1shuffle);
 572             let res1prev1 = res1shiftr8.replace(0, prev1byte0);
 573
 574             let res = res0prev0 & res1prev1 & res2;
 575
 576             prev0 = res0;
 577             prev1 = res1;
 578
 579             let bitfield = res.ne(zero).move_mask();
 580             if bitfield != 0 {
 581                 let pos = pos.checked_sub(2).unwrap();
 582                 if let Some(m) = self.verify(haystack, pos, res, bitfield) {
 583                     return Some(m);
 584                 }
 585             }
 586             pos += BLOCK_SIZE;
 587         }
 588         // The windowing above doesn't check the last two bytes in the last
 589         // window, so start the slow search at the penultimate byte of the
 590         // last window.
 591         // self.slow(haystack, pos.saturating_sub(2))
 592         self.slow(haystack, pos.checked_sub(2).unwrap())
 593     }
 594
 595     /// Runs the verification procedure on `res` (i.e., `C` from the module
 596     /// documentation), where the haystack block starts at `pos` in
 597     /// `haystack`. `bitfield` has ones in the bit positions that `res` has
 598     /// non-zero bytes.
 599     ///
 600     /// If a match exists, it returns the first one.
 601     #[inline(always)]
 602     fn verify(
 603         &self,
 604         haystack: &[u8],
 605         pos: usize,
 606         res: u8x16,
 607         mut bitfield: u32,
 608     ) -> Option<Match> {
 609         while bitfield != 0 {
 610             // The next offset, relative to pos, where some fingerprint
 611             // matched.
 612             let byte_pos = bitfield.trailing_zeros();
 613             bitfield &= !(1 << byte_pos);
 614
 615             // Offset relative to the beginning of the haystack.
 616             let start = pos + byte_pos as usize;
 617
 618             // The bitfield telling us which patterns had fingerprints that
 619             // match at this starting position.
 620             let mut patterns = res.extract(byte_pos);
 621             while patterns != 0 {
 622                 let bucket = patterns.trailing_zeros() as usize;
 623                 patterns &= !(1 << bucket);
 624
 625                 // Actual substring search verification.
 626                 if let Some(m) = self.verify_bucket(haystack, bucket, start) {
 627                     return Some(m);
 628                 }
 629             }
 630         }
 631
 632         None
 633     }
 634
 635     /// Verifies whether any substring in the given bucket matches in haystack
 636     /// at the given starting position.
 637     #[inline(always)]
 638     fn verify_bucket(
 639         &self,
 640         haystack: &[u8],
 641         bucket: usize,
 642         start: usize,
 643     ) -> Option<Match> {
 644         // This cycles through the patterns in the bucket in the order that
 645         // the patterns were given. Therefore, we guarantee leftmost-first
 646         // semantics.
 647         for &pati in &self.buckets[bucket] {
 648             let pat = &*self.pats[pati];
 649             if start + pat.len() > haystack.len() {
 650                 continue;
 651             }
 652             if pat == &haystack[start..start + pat.len()] {
 653                 return Some(Match {
 654                     pat: pati,
 655                     start: start,
 656                     end: start + pat.len(),
 657                 });
 658             }
 659         }
 660         None
 661     }
 662
 663     /// Slow substring search through all patterns in this matcher.
 664     ///
 665     /// This is used when we don't have enough bytes in the haystack for our
 666     /// block based approach.
 667     fn slow(&self, haystack: &[u8], pos: usize) -> Option<Match> {
 668         self.ac.find(&haystack[pos..]).next().map(|m| {
 669             Match {
 670                 pat: m.pati,
 671                 start: pos + m.start,
 672                 end: pos + m.end,
 673             }
 674         })
 675     }
 676 }
 677
 678 impl Masks {
 679     /// Create a new set of masks of size `n`, where `n` corresponds to the
 680     /// number of bytes in a fingerprint.
 681     fn new(n: usize) -> Masks {
 682         Masks(vec![Mask::new(); n])
 683     }
 684
 685     /// Returns the number of masks.
 686     fn len(&self) -> usize {
 687         self.0.len()
 688     }
 689
 690     /// Adds the given pattern to the given bucket. The bucket should be a
 691     /// power of `2 <= 2^7`.
 692     fn add(&mut self, bucket: u8, pat: &[u8]) {
 693         for (i, mask) in self.0.iter_mut().enumerate() {
 694             mask.add(bucket, pat[i]);
 695         }
 696     }
 697
 698     /// Finds the fingerprints that are in the given haystack block. i.e., this
 699     /// returns `C` as described in the module documentation.
 700     ///
 701     /// More specifically, `for i in 0..16` and `j in 0..8, C[i][j] == 1` if and
 702     /// only if `haystack_block[i]` corresponds to a fingerprint that is part
 703     /// of a pattern in bucket `j`.
 704     #[inline(always)]
 705     fn members1(&self, haystack_block: u8x16) -> u8x16 {
 706         let masklo = u8x16::splat(0xF);
 707         let hlo = haystack_block & masklo;
 708         let hhi = (haystack_block >> 4) & masklo;
 709
 710         self.0[0].lo.shuffle_bytes(hlo) & self.0[0].hi.shuffle_bytes(hhi)
 711     }
 712
 713     /// Like members1, but computes C for the first and second bytes in the
 714     /// fingerprint.
 715     #[inline(always)]
 716     fn members2(&self, haystack_block: u8x16) -> (u8x16, u8x16) {
 717         let masklo = u8x16::splat(0xF);
 718         let hlo = haystack_block & masklo;
 719         let hhi = (haystack_block >> 4) & masklo;
 720
 721         let res0 = self.0[0].lo.shuffle_bytes(hlo)
 722                    & self.0[0].hi.shuffle_bytes(hhi);
 723         let res1 = self.0[1].lo.shuffle_bytes(hlo)
 724                    & self.0[1].hi.shuffle_bytes(hhi);
 725         (res0, res1)
 726     }
 727
 728     /// Like `members1`, but computes `C` for the first, second and third bytes
 729     /// in the fingerprint.
 730     #[inline(always)]
 731     fn members3(&self, haystack_block: u8x16) -> (u8x16, u8x16, u8x16) {
 732         let masklo = u8x16::splat(0xF);
 733         let hlo = haystack_block & masklo;
 734         let hhi = (haystack_block >> 4) & masklo;
 735
 736         let res0 = self.0[0].lo.shuffle_bytes(hlo)
 737                    & self.0[0].hi.shuffle_bytes(hhi);
 738         let res1 = self.0[1].lo.shuffle_bytes(hlo)
 739                    & self.0[1].hi.shuffle_bytes(hhi);
 740         let res2 = self.0[2].lo.shuffle_bytes(hlo)
 741                    & self.0[2].hi.shuffle_bytes(hhi);
 742         (res0, res1, res2)
 743     }
 744 }
 745
 746 impl Mask {
 747     /// Create a new mask with no members.
 748     fn new() -> Mask {
 749         Mask {
 750             lo: u8x16::splat(0),
 751             hi: u8x16::splat(0),
 752         }
 753     }
 754
 755     /// Adds the given byte to the given bucket.
 756     fn add(&mut self, bucket: u8, byte: u8) {
 757         // Split our byte into two nybbles, and add each nybble to our
 758         // mask.
 759         let byte_lo = (byte & 0xF) as u32;
 760         let byte_hi = (byte >> 4) as u32;
 761
 762         let lo = self.lo.extract(byte_lo);
 763         self.lo = self.lo.replace(byte_lo, ((1 << bucket) as u8) | lo);
 764
 765         let hi = self.hi.extract(byte_hi);
 766         self.hi = self.hi.replace(byte_hi, ((1 << bucket) as u8) | hi);
 767     }
 768 }
 769
 770 /// UnsafeLoad permits loading data into a SIMD vector without bounds checks.
 771 ///
 772 /// Ideally, this would be part of the `simd` crate, or even better, we could
 773 /// figure out how to do it without `unsafe` at all.
 774 trait UnsafeLoad {
 775     type Elem;
 776
 777     /// load_unchecked creates a new SIMD vector from the elements in `slice`
 778     /// starting at `offset`. `slice` must have at least the number of elements
 779     /// required to fill a SIMD vector.
 780     unsafe fn load_unchecked(slice: &[Self::Elem], offset: usize) -> Self;
 781 }
 782
 783 impl UnsafeLoad for u8x16 {
 784     type Elem = u8;
 785
 786     unsafe fn load_unchecked(slice: &[u8], offset: usize) -> u8x16 {
 787         let mut x = u8x16::splat(0);
 788         ptr::copy_nonoverlapping(
 789             slice.get_unchecked(offset),
 790             &mut x as *mut u8x16 as *mut u8,
 791             16);
 792         x
 793     }
 794 }