vendor/regex-automata/src/nfa/range_trie.rs

   1 // I've called the primary data structure in this module a "range trie." As far
   2 // as I can tell, there is no prior art on a data structure like this, however,
   3 // it's likely someone somewhere has built something like it. Searching for
   4 // "range trie" turns up the paper "Range Tries for Scalable Address Lookup,"
   5 // but it does not appear relevant.
   6 //
   7 // The range trie is just like a trie in that it is a special case of a
   8 // deterministic finite state machine. It has states and each state has a set
   9 // of transitions to other states. It is acyclic, and, like a normal trie,
  10 // it makes no attempt to reuse common suffixes among its elements. The key
  11 // difference between a normal trie and a range trie below is that a range trie
  12 // operates on *contiguous sequences* of bytes instead of singleton bytes.
  13 // One could say say that our alphabet is ranges of bytes instead of bytes
  14 // themselves, except a key part of range trie construction is splitting ranges
  15 // apart to ensure there is at most one transition that can be taken for any
  16 // byte in a given state.
  17 //
  18 // I've tried to explain the details of how the range trie works below, so
  19 // for now, we are left with trying to understand what problem we're trying to
  20 // solve. Which is itself fairly involved!
  21 //
  22 // At the highest level, here's what we want to do. We want to convert a
  23 // sequence of Unicode codepoints into a finite state machine whose transitions
  24 // are over *bytes* and *not* Unicode codepoints. We want this because it makes
  25 // said finite state machines much smaller and much faster to execute. As a
  26 // simple example, consider a byte oriented automaton for all Unicode scalar
  27 // values (0x00 through 0x10FFFF, not including surrogate codepoints):
  28 //
  29 //     [00-7F]
  30 //     [C2-DF][80-BF]
  31 //     [E0-E0][A0-BF][80-BF]
  32 //     [E1-EC][80-BF][80-BF]
  33 //     [ED-ED][80-9F][80-BF]
  34 //     [EE-EF][80-BF][80-BF]
  35 //     [F0-F0][90-BF][80-BF][80-BF]
  36 //     [F1-F3][80-BF][80-BF][80-BF]
  37 //     [F4-F4][80-8F][80-BF][80-BF]
  38 //
  39 // (These byte ranges are generated via the regex-syntax::utf8 module, which
  40 // was based on Russ Cox's code in RE2, which was in turn based on Ken
  41 // Thompson's implementation of the same idea in his Plan9 implementation of
  42 // grep.)
  43 //
  44 // It should be fairly straight-forward to see how one could compile this into
  45 // a DFA. The sequences are sorted and non-overlapping. Essentially, you could
  46 // build a trie from this fairly easy. The problem comes when your initial
  47 // range (in this case, 0x00-0x10FFFF) isn't so nice. For example, the class
  48 // represented by '\w' contains only a tenth of the codepoints that
  49 // 0x00-0x10FFFF contains, but if we were to write out the byte based ranges
  50 // as we did above, the list would stretch to 892 entries! This turns into
  51 // quite a large NFA with a few thousand states. Turning this beast into a DFA
  52 // takes quite a bit of time. We are thus left with trying to trim down the
  53 // number of states we produce as early as possible.
  54 //
  55 // One approach (used by RE2 and still by the regex crate, at time of writing)
  56 // is to try to find common suffixes while building NFA states for the above
  57 // and reuse them. This is very cheap to do and one can control precisely how
  58 // much extra memory you want to use for the cache.
  59 //
  60 // Another approach, however, is to reuse an algorithm for constructing a
  61 // *minimal* DFA from a sorted sequence of inputs. I don't want to go into
  62 // the full details here, but I explain it in more depth in my blog post on
  63 // FSTs[1]. Note that the algorithm not invented by me, but was published
  64 // in paper by Daciuk et al. in 2000 called "Incremental Construction of
  65 // MinimalAcyclic Finite-State Automata." Like the suffix cache approach above,
  66 // it is also possible to control the amount of extra memory one uses, although
  67 // this usually comes with the cost of sacrificing true minimality. (But it's
  68 // typically close enough with a reasonably sized cache of states.)
  69 //
  70 // The catch is that Daciuk's algorithm only works if you add your keys in
  71 // lexicographic ascending order. In our case, since we're dealing with ranges,
  72 // we also need the additional requirement that ranges are either equivalent
  73 // or do not overlap at all. For example, if one were given the following byte
  74 // ranges:
  75 //
  76 //     [BC-BF][80-BF]
  77 //     [BC-BF][90-BF]
  78 //
  79 // Then Daciuk's algorithm also would not work, since there is nothing to
  80 // handle the fact that the ranges overlap. They would need to be split apart.
  81 // Thankfully, Thompson's algorithm for producing byte ranges for Unicode
  82 // codepoint ranges meets both of our requirements.
  83 //
  84 // ... however, we would also like to be able to compile UTF-8 automata in
  85 // reverse. We want this because in order to find the starting location of a
  86 // match using a DFA, we need to run a second DFA---a reversed version of the
  87 // forward DFA---backwards to discover the match location. Unfortunately, if
  88 // we reverse our byte sequences for 0x00-0x10FFFF, we get sequences that are
  89 // can overlap, even if they are sorted:
  90 //
  91 //     [00-7F]
  92 //     [80-BF][80-9F][ED-ED]
  93 //     [80-BF][80-BF][80-8F][F4-F4]
  94 //     [80-BF][80-BF][80-BF][F1-F3]
  95 //     [80-BF][80-BF][90-BF][F0-F0]
  96 //     [80-BF][80-BF][E1-EC]
  97 //     [80-BF][80-BF][EE-EF]
  98 //     [80-BF][A0-BF][E0-E0]
  99 //     [80-BF][C2-DF]
 100 //
 101 // For example, '[80-BF][80-BF][EE-EF]' and '[80-BF][A0-BF][E0-E0]' have
 102 // overlapping ranges between '[80-BF]' and '[A0-BF]'. Thus, there is no
 103 // simple way to apply Daciuk's algorithm.
 104 //
 105 // And thus, the range trie was born. The range trie's only purpose is to take
 106 // sequences of byte ranges like the ones above, collect them into a trie and
 107 // then spit them in a sorted fashion with no overlapping ranges. For example,
 108 // 0x00-0x10FFFF gets translated to:
 109 //
 110 //     [0-7F]
 111 //     [80-BF][80-9F][80-8F][F1-F3]
 112 //     [80-BF][80-9F][80-8F][F4]
 113 //     [80-BF][80-9F][90-BF][F0]
 114 //     [80-BF][80-9F][90-BF][F1-F3]
 115 //     [80-BF][80-9F][E1-EC]
 116 //     [80-BF][80-9F][ED]
 117 //     [80-BF][80-9F][EE-EF]
 118 //     [80-BF][A0-BF][80-8F][F1-F3]
 119 //     [80-BF][A0-BF][80-8F][F4]
 120 //     [80-BF][A0-BF][90-BF][F0]
 121 //     [80-BF][A0-BF][90-BF][F1-F3]
 122 //     [80-BF][A0-BF][E0]
 123 //     [80-BF][A0-BF][E1-EC]
 124 //     [80-BF][A0-BF][EE-EF]
 125 //     [80-BF][C2-DF]
 126 //
 127 // We've thus satisfied our requirements for running Daciuk's algorithm. All
 128 // sequences of ranges are sorted, and any corresponding ranges are either
 129 // exactly equivalent or non-overlapping.
 130 //
 131 // In effect, a range trie is building a DFA from a sequence of arbitrary
 132 // byte ranges. But it uses an algoritm custom tailored to its input, so it
 133 // is not as costly as traditional DFA construction. While it is still quite
 134 // a bit more costly than the forward's case (which only needs Daciuk's
 135 // algorithm), it winds up saving a substantial amount of time if one is doing
 136 // a full DFA powerset construction later by virtue of producing a much much
 137 // smaller NFA.
 138 //
 139 // [1] - https://blog.burntsushi.net/transducers/
 140 // [2] - https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120100561601
 141
 142 use std::cell::RefCell;
 143 use std::fmt;
 144 use std::mem;
 145 use std::ops::RangeInclusive;
 146 use std::u32;
 147
 148 use regex_syntax::utf8::Utf8Range;
 149
 150 /// A smaller state ID means more effective use of the CPU cache and less
 151 /// time spent copying. The implementation below will panic if the state ID
 152 /// space is exhausted, but in order for that to happen, the range trie itself
 153 /// would use well over 100GB of memory. Moreover, it's likely impossible
 154 /// for the state ID space to get that big. In fact, it's likely that even a
 155 /// u16 would be good enough here. But it's not quite clear how to prove this.
 156 type StateID = u32;
 157
 158 /// There is only one final state in this trie. Every sequence of byte ranges
 159 /// added shares the same final state.
 160 const FINAL: StateID = 0;
 161
 162 /// The root state of the trie.
 163 const ROOT: StateID = 1;
 164
 165 /// A range trie represents an ordered set of sequences of bytes.
 166 ///
 167 /// A range trie accepts as input a sequence of byte ranges and merges
 168 /// them into the existing set such that the trie can produce a sorted
 169 /// non-overlapping sequence of byte ranges. The sequence emitted corresponds
 170 /// precisely to the sequence of bytes matched by the given keys, although the
 171 /// byte ranges themselves may be split at different boundaries.
 172 ///
 173 /// The order complexity of this data structure seems difficult to analyze.
 174 /// If the size of a byte is held as a constant, then insertion is clearly
 175 /// O(n) where n is the number of byte ranges in the input key. However, if
 176 /// k=256 is our alphabet size, then insertion could be O(k^2 * n). In
 177 /// particular it seems possible for pathological inputs to cause insertion
 178 /// to do a lot of work. However, for what we use this data structure for,
 179 /// there should be no pathological inputs since the ultimate source is always
 180 /// a sorted set of Unicode scalar value ranges.
 181 ///
 182 /// Internally, this trie is setup like a finite state machine. Note though
 183 /// that it is acyclic.
 184 #[derive(Clone)]
 185 pub struct RangeTrie {
 186     /// The states in this trie. The first is always the shared final state.
 187     /// The second is always the root state. Otherwise, there is no
 188     /// particular order.
 189     states: Vec<State>,
 190     /// A free-list of states. When a range trie is cleared, all of its states
 191     /// are added to list. Creating a new state reuses states from this list
 192     /// before allocating a new one.
 193     free: Vec<State>,
 194     /// A stack for traversing this trie to yield sequences of byte ranges in
 195     /// lexicographic order.
 196     iter_stack: RefCell<Vec<NextIter>>,
 197     /// A bufer that stores the current sequence during iteration.
 198     iter_ranges: RefCell<Vec<Utf8Range>>,
 199     /// A stack used for traversing the trie in order to (deeply) duplicate
 200     /// a state.
 201     dupe_stack: Vec<NextDupe>,
 202     /// A stack used for traversing the trie during insertion of a new
 203     /// sequence of byte ranges.
 204     insert_stack: Vec<NextInsert>,
 205 }
 206
 207 /// A single state in this trie.
 208 #[derive(Clone)]
 209 struct State {
 210     /// A sorted sequence of non-overlapping transitions to other states. Each
 211     /// transition corresponds to a single range of bytes.
 212     transitions: Vec<Transition>,
 213 }
 214
 215 /// A transition is a single range of bytes. If a particular byte is in this
 216 /// range, then the corresponding machine may transition to the state pointed
 217 /// to by `next_id`.
 218 #[derive(Clone)]
 219 struct Transition {
 220     /// The byte range.
 221     range: Utf8Range,
 222     /// The next state to transition to.
 223     next_id: StateID,
 224 }
 225
 226 impl RangeTrie {
 227     /// Create a new empty range trie.
 228     pub fn new() -> RangeTrie {
 229         let mut trie = RangeTrie {
 230             states: vec![],
 231             free: vec![],
 232             iter_stack: RefCell::new(vec![]),
 233             iter_ranges: RefCell::new(vec![]),
 234             dupe_stack: vec![],
 235             insert_stack: vec![],
 236         };
 237         trie.clear();
 238         trie
 239     }
 240
 241     /// Clear this range trie such that it is empty. Clearing a range trie
 242     /// and reusing it can beneficial because this may reuse allocations.
 243     pub fn clear(&mut self) {
 244         self.free.extend(self.states.drain(..));
 245         self.add_empty(); // final
 246         self.add_empty(); // root
 247     }
 248
 249     /// Iterate over all of the sequences of byte ranges in this trie, and
 250     /// call the provided function for each sequence. Iteration occurs in
 251     /// lexicographic order.
 252     pub fn iter<F: FnMut(&[Utf8Range])>(&self, mut f: F) {
 253         let mut stack = self.iter_stack.borrow_mut();
 254         stack.clear();
 255         let mut ranges = self.iter_ranges.borrow_mut();
 256         ranges.clear();
 257
 258         // We do iteration in a way that permits us to use a single buffer
 259         // for our keys. We iterate in a depth first fashion, while being
 260         // careful to expand our frontier as we move deeper in the trie.
 261         stack.push(NextIter { state_id: ROOT, tidx: 0 });
 262         while let Some(NextIter { mut state_id, mut tidx }) = stack.pop() {
 263             // This could be implemented more simply without an inner loop
 264             // here, but at the cost of more stack pushes.
 265             loop {
 266                 let state = self.state(state_id);
 267                 // If we're visited all transitions in this state, then pop
 268                 // back to the parent state.
 269                 if tidx >= state.transitions.len() {
 270                     ranges.pop();
 271                     break;
 272                 }
 273
 274                 let t = &state.transitions[tidx];
 275                 ranges.push(t.range);
 276                 if t.next_id == FINAL {
 277                     f(&ranges);
 278                     ranges.pop();
 279                     tidx += 1;
 280                 } else {
 281                     // Expand our frontier. Once we come back to this state
 282                     // via the stack, start in on the next transition.
 283                     stack.push(NextIter { state_id, tidx: tidx + 1 });
 284                     // Otherwise, move to the first transition of the next
 285                     // state.
 286                     state_id = t.next_id;
 287                     tidx = 0;
 288                 }
 289             }
 290         }
 291     }
 292
 293     /// Inserts a new sequence of ranges into this trie.
 294     ///
 295     /// The sequence given must be non-empty and must not have a length
 296     /// exceeding 4.
 297     pub fn insert(&mut self, ranges: &[Utf8Range]) {
 298         assert!(!ranges.is_empty());
 299         assert!(ranges.len() <= 4);
 300
 301         let mut stack = mem::replace(&mut self.insert_stack, vec![]);
 302         stack.clear();
 303
 304         stack.push(NextInsert::new(ROOT, ranges));
 305         while let Some(next) = stack.pop() {
 306             let (state_id, ranges) = (next.state_id(), next.ranges());
 307             assert!(!ranges.is_empty());
 308
 309             let (mut new, rest) = (ranges[0], &ranges[1..]);
 310
 311             // i corresponds to the position of the existing transition on
 312             // which we are operating. Typically, the result is to remove the
 313             // transition and replace it with two or more new transitions
 314             // corresponding to the partitions generated by splitting the
 315             // 'new' with the ith transition's range.
 316             let mut i = self.state(state_id).find(new);
 317
 318             // In this case, there is no overlap *and* the new range is greater
 319             // than all existing ranges. So we can just add it to the end.
 320             if i == self.state(state_id).transitions.len() {
 321                 let next_id = NextInsert::push(self, &mut stack, rest);
 322                 self.add_transition(state_id, new, next_id);
 323                 continue;
 324             }
 325
 326             // The need for this loop is a bit subtle, buf basically, after
 327             // we've handled the partitions from our initial split, it's
 328             // possible that there will be a partition leftover that overlaps
 329             // with a subsequent transition. If so, then we have to repeat
 330             // the split process again with the leftovers and that subsequent
 331             // transition.
 332             'OUTER: loop {
 333                 let old = self.state(state_id).transitions[i].clone();
 334                 let split = match Split::new(old.range, new) {
 335                     Some(split) => split,
 336                     None => {
 337                         let next_id = NextInsert::push(self, &mut stack, rest);
 338                         self.add_transition_at(i, state_id, new, next_id);
 339                         continue;
 340                     }
 341                 };
 342                 let splits = split.as_slice();
 343                 // If we only have one partition, then the ranges must be
 344                 // equivalent. There's nothing to do here for this state, so
 345                 // just move on to the next one.
 346                 if splits.len() == 1 {
 347                     // ... but only if we have anything left to do.
 348                     if !rest.is_empty() {
 349                         stack.push(NextInsert::new(old.next_id, rest));
 350                     }
 351                     break;
 352                 }
 353                 // At this point, we know that 'split' is non-empty and there
 354                 // must be some overlap AND that the two ranges are not
 355                 // equivalent. Therefore, the existing range MUST be removed
 356                 // and split up somehow. Instead of actually doing the removal
 357                 // and then a subsequent insertion---with all the memory
 358                 // shuffling that entails---we simply overwrite the transition
 359                 // at position `i` for the first new transition we want to
 360                 // insert. After that, we're forced to do expensive inserts.
 361                 let mut first = true;
 362                 let mut add_trans =
 363                     |trie: &mut RangeTrie, pos, from, range, to| {
 364                         if first {
 365                             trie.set_transition_at(pos, from, range, to);
 366                             first = false;
 367                         } else {
 368                             trie.add_transition_at(pos, from, range, to);
 369                         }
 370                     };
 371                 for (j, &srange) in splits.iter().enumerate() {
 372                     match srange {
 373                         SplitRange::Old(r) => {
 374                             // Deep clone the state pointed to by the ith
 375                             // transition. This is always necessary since 'old'
 376                             // is always coupled with at least a 'both'
 377                             // partition. We don't want any new changes made
 378                             // via the 'both' partition to impact the part of
 379                             // the transition that doesn't overlap with the
 380                             // new range.
 381                             let dup_id = self.duplicate(old.next_id);
 382                             add_trans(self, i, state_id, r, dup_id);
 383                         }
 384                         SplitRange::New(r) => {
 385                             // This is a bit subtle, but if this happens to be
 386                             // the last partition in our split, it is possible
 387                             // that this overlaps with a subsequent transition.
 388                             // If it does, then we must repeat the whole
 389                             // splitting process over again with `r` and the
 390                             // subsequent transition.
 391                             {
 392                                 let trans = &self.state(state_id).transitions;
 393                                 if j + 1 == splits.len()
 394                                     && i < trans.len()
 395                                     && intersects(r, trans[i].range)
 396                                 {
 397                                     new = r;
 398                                     continue 'OUTER;
 399                                 }
 400                             }
 401
 402                             // ... otherwise, setup exploration for a new
 403                             // empty state and add a brand new transition for
 404                             // this new range.
 405                             let next_id =
 406                                 NextInsert::push(self, &mut stack, rest);
 407                             add_trans(self, i, state_id, r, next_id);
 408                         }
 409                         SplitRange::Both(r) => {
 410                             // Continue adding the remaining ranges on this
 411                             // path and update the transition with the new
 412                             // range.
 413                             if !rest.is_empty() {
 414                                 stack.push(NextInsert::new(old.next_id, rest));
 415                             }
 416                             add_trans(self, i, state_id, r, old.next_id);
 417                         }
 418                     }
 419                     i += 1;
 420                 }
 421                 // If we've reached this point, then we know that there are
 422                 // no subsequent transitions with any overlap. Therefore, we
 423                 // can stop processing this range and move on to the next one.
 424                 break;
 425             }
 426         }
 427         self.insert_stack = stack;
 428     }
 429
 430     pub fn add_empty(&mut self) -> StateID {
 431         if self.states.len() as u64 > u32::MAX as u64 {
 432             // This generally should not happen since a range trie is only
 433             // ever used to compile a single sequence of Unicode scalar values.
 434             // If we ever got to this point, we would, at *minimum*, be using
 435             // 96GB in just the range trie alone.
 436             panic!("too many sequences added to range trie");
 437         }
 438         let id = self.states.len() as StateID;
 439         // If we have some free states available, then use them to avoid
 440         // more allocations.
 441         if let Some(mut state) = self.free.pop() {
 442             state.clear();
 443             self.states.push(state);
 444         } else {
 445             self.states.push(State { transitions: vec![] });
 446         }
 447         id
 448     }
 449
 450     /// Performs a deep clone of the given state and returns the duplicate's
 451     /// state ID.
 452     ///
 453     /// A "deep clone" in this context means that the state given along with
 454     /// recursively all states that it points to are copied. Once complete,
 455     /// the given state ID and the returned state ID share nothing.
 456     ///
 457     /// This is useful during range trie insertion when a new range overlaps
 458     /// with an existing range that is bigger than the new one. The part of
 459     /// the existing range that does *not* overlap with the new one is that
 460     /// duplicated so that adding the new range to the overlap doesn't disturb
 461     /// the non-overlapping portion.
 462     ///
 463     /// There's one exception: if old_id is the final state, then it is not
 464     /// duplicated and the same final state is returned. This is because all
 465     /// final states in this trie are equivalent.
 466     fn duplicate(&mut self, old_id: StateID) -> StateID {
 467         if old_id == FINAL {
 468             return FINAL;
 469         }
 470
 471         let mut stack = mem::replace(&mut self.dupe_stack, vec![]);
 472         stack.clear();
 473
 474         let new_id = self.add_empty();
 475         // old_id is the state we're cloning and new_id is the ID of the
 476         // duplicated state for old_id.
 477         stack.push(NextDupe { old_id, new_id });
 478         while let Some(NextDupe { old_id, new_id }) = stack.pop() {
 479             for i in 0..self.state(old_id).transitions.len() {
 480                 let t = self.state(old_id).transitions[i].clone();
 481                 if t.next_id == FINAL {
 482                     // All final states are the same, so there's no need to
 483                     // duplicate it.
 484                     self.add_transition(new_id, t.range, FINAL);
 485                     continue;
 486                 }
 487
 488                 let new_child_id = self.add_empty();
 489                 self.add_transition(new_id, t.range, new_child_id);
 490                 stack.push(NextDupe {
 491                     old_id: t.next_id,
 492                     new_id: new_child_id,
 493                 });
 494             }
 495         }
 496         self.dupe_stack = stack;
 497         new_id
 498     }
 499
 500     /// Adds the given transition to the given state.
 501     ///
 502     /// Callers must ensure that all previous transitions in this state
 503     /// are lexicographically smaller than the given range.
 504     fn add_transition(
 505         &mut self,
 506         from_id: StateID,
 507         range: Utf8Range,
 508         next_id: StateID,
 509     ) {
 510         self.state_mut(from_id)
 511             .transitions
 512             .push(Transition { range, next_id });
 513     }
 514
 515     /// Like `add_transition`, except this inserts the transition just before
 516     /// the ith transition.
 517     fn add_transition_at(
 518         &mut self,
 519         i: usize,
 520         from_id: StateID,
 521         range: Utf8Range,
 522         next_id: StateID,
 523     ) {
 524         self.state_mut(from_id)
 525             .transitions
 526             .insert(i, Transition { range, next_id });
 527     }
 528
 529     /// Overwrites the transition at position i with the given transition.
 530     fn set_transition_at(
 531         &mut self,
 532         i: usize,
 533         from_id: StateID,
 534         range: Utf8Range,
 535         next_id: StateID,
 536     ) {
 537         self.state_mut(from_id).transitions[i] = Transition { range, next_id };
 538     }
 539
 540     /// Return an immutable borrow for the state with the given ID.
 541     fn state(&self, id: StateID) -> &State {
 542         &self.states[id as usize]
 543     }
 544
 545     /// Return a mutable borrow for the state with the given ID.
 546     fn state_mut(&mut self, id: StateID) -> &mut State {
 547         &mut self.states[id as usize]
 548     }
 549 }
 550
 551 impl State {
 552     /// Find the position at which the given range should be inserted in this
 553     /// state.
 554     ///
 555     /// The position returned is always in the inclusive range
 556     /// [0, transitions.len()]. If 'transitions.len()' is returned, then the
 557     /// given range overlaps with no other range in this state *and* is greater
 558     /// than all of them.
 559     ///
 560     /// For all other possible positions, the given range either overlaps
 561     /// with the transition at that position or is otherwise less than it
 562     /// with no overlap (and is greater than the previous transition). In the
 563     /// former case, careful attention must be paid to inserting this range
 564     /// as a new transition. In the latter case, the range can be inserted as
 565     /// a new transition at the given position without disrupting any other
 566     /// transitions.
 567     fn find(&self, range: Utf8Range) -> usize {
 568         /// Returns the position `i` at which `pred(xs[i])` first returns true
 569         /// such that for all `j >= i`, `pred(xs[j]) == true`. If `pred` never
 570         /// returns true, then `xs.len()` is returned.
 571         ///
 572         /// We roll our own binary search because it doesn't seem like the
 573         /// standard library's binary search can be used here. Namely, if
 574         /// there is an overlapping range, then we want to find the first such
 575         /// occurrence, but there may be many. Or at least, it's not quite
 576         /// clear to me how to do it.
 577         fn binary_search<T, F>(xs: &[T], mut pred: F) -> usize
 578         where
 579             F: FnMut(&T) -> bool,
 580         {
 581             let (mut left, mut right) = (0, xs.len());
 582             while left < right {
 583                 // Overflow is impossible because xs.len() <= 256.
 584                 let mid = (left + right) / 2;
 585                 if pred(&xs[mid]) {
 586                     right = mid;
 587                 } else {
 588                     left = mid + 1;
 589                 }
 590             }
 591             left
 592         }
 593
 594         // Benchmarks suggest that binary search is just a bit faster than
 595         // straight linear search. Specifically when using the debug tool:
 596         //
 597         //   hyperfine "regex-automata-debug debug -acqr '\w{40} ecurB'"
 598         binary_search(&self.transitions, |t| range.start <= t.range.end)
 599     }
 600
 601     /// Clear this state such that it has zero transitions.
 602     fn clear(&mut self) {
 603         self.transitions.clear();
 604     }
 605 }
 606
 607 /// The next state to process during duplication.
 608 #[derive(Clone, Debug)]
 609 struct NextDupe {
 610     /// The state we want to duplicate.
 611     old_id: StateID,
 612     /// The ID of the new state that is a duplicate of old_id.
 613     new_id: StateID,
 614 }
 615
 616 /// The next state (and its corresponding transition) that we want to visit
 617 /// during iteration in lexicographic order.
 618 #[derive(Clone, Debug)]
 619 struct NextIter {
 620     state_id: StateID,
 621     tidx: usize,
 622 }
 623
 624 /// The next state to process during insertion and any remaining ranges that we
 625 /// want to add for a partcular sequence of ranges. The first such instance
 626 /// is always the root state along with all ranges given.
 627 #[derive(Clone, Debug)]
 628 struct NextInsert {
 629     /// The next state to begin inserting ranges. This state should be the
 630     /// state at which `ranges[0]` should be inserted.
 631     state_id: StateID,
 632     /// The ranges to insert. We used a fixed-size array here to avoid an
 633     /// allocation.
 634     ranges: [Utf8Range; 4],
 635     /// The number of valid ranges in the above array.
 636     len: u8,
 637 }
 638
 639 impl NextInsert {
 640     /// Create the next item to visit. The given state ID should correspond
 641     /// to the state at which the first range in the given slice should be
 642     /// inserted. The slice given must not be empty and it must be no longer
 643     /// than 4.
 644     fn new(state_id: StateID, ranges: &[Utf8Range]) -> NextInsert {
 645         let len = ranges.len();
 646         assert!(len > 0);
 647         assert!(len <= 4);
 648
 649         let mut tmp = [Utf8Range { start: 0, end: 0 }; 4];
 650         tmp[..len].copy_from_slice(ranges);
 651         NextInsert { state_id, ranges: tmp, len: len as u8 }
 652     }
 653
 654     /// Push a new empty state to visit along with any remaining ranges that
 655     /// still need to be inserted. The ID of the new empty state is returned.
 656     ///
 657     /// If ranges is empty, then no new state is created and FINAL is returned.
 658     fn push(
 659         trie: &mut RangeTrie,
 660         stack: &mut Vec<NextInsert>,
 661         ranges: &[Utf8Range],
 662     ) -> StateID {
 663         if ranges.is_empty() {
 664             FINAL
 665         } else {
 666             let next_id = trie.add_empty();
 667             stack.push(NextInsert::new(next_id, ranges));
 668             next_id
 669         }
 670     }
 671
 672     /// Return the ID of the state to visit.
 673     fn state_id(&self) -> StateID {
 674         self.state_id
 675     }
 676
 677     /// Return the remaining ranges to insert.
 678     fn ranges(&self) -> &[Utf8Range] {
 679         &self.ranges[..self.len as usize]
 680     }
 681 }
 682
 683 /// Split represents a partitioning of two ranges into one or more ranges. This
 684 /// is the secret sauce that makes a range trie work, as it's what tells us
 685 /// how to deal with two overlapping but unequal ranges during insertion.
 686 ///
 687 /// Essentially, either two ranges overlap or they don't. If they don't, then
 688 /// handling insertion is easy: just insert the new range into its
 689 /// lexicographically correct position. Since it does not overlap with anything
 690 /// else, no other transitions are impacted by the new range.
 691 ///
 692 /// If they do overlap though, there are generally three possible cases to
 693 /// handle:
 694 ///
 695 /// 1. The part where the two ranges actually overlap. i.e., The intersection.
 696 /// 2. The part of the existing range that is not in the the new range.
 697 /// 3. The part of the new range that is not in the old range.
 698 ///
 699 /// (1) is guaranteed to always occur since all overlapping ranges have a
 700 /// non-empty intersection. If the two ranges are not equivalent, then at
 701 /// least one of (2) or (3) is guaranteed to occur as well. In some cases,
 702 /// e.g., `[0-4]` and `[4-9]`, all three cases will occur.
 703 ///
 704 /// This `Split` type is responsible for providing (1), (2) and (3) for any
 705 /// possible pair of byte ranges.
 706 ///
 707 /// As for insertion, for the overlap in (1), the remaining ranges to insert
 708 /// should be added by following the corresponding transition. However, this
 709 /// should only be done for the overlapping parts of the range. If there was
 710 /// a part of the existing range that was not in the new range, then that
 711 /// existing part must be split off from the transition and duplicated. The
 712 /// remaining parts of the overlap can then be added to using the new ranges
 713 /// without disturbing the existing range.
 714 ///
 715 /// Handling the case for the part of a new range that is not in an existing
 716 /// range is seemingly easy. Just treat it as if it were a non-overlapping
 717 /// range. The problem here is that if this new non-overlapping range occurs
 718 /// after both (1) and (2), then it's possible that it can overlap with the
 719 /// next transition in the current state. If it does, then the whole process
 720 /// must be repeated!
 721 ///
 722 /// # Details of the 3 cases
 723 ///
 724 /// The following details the various cases that are implemented in code
 725 /// below. It's plausible that the number of cases is not actually minimal,
 726 /// but it's important for this code to remain at least somewhat readable.
 727 ///
 728 /// Given [a,b] and [x,y], where a <= b, x <= y, b < 256 and y < 256, we define
 729 /// the follow distinct relationships where at least one must apply. The order
 730 /// of these matters, since multiple can match. The first to match applies.
 731 ///
 732 ///   1. b < x <=> [a,b] < [x,y]
 733 ///   2. y < a <=> [x,y] < [a,b]
 734 ///
 735 /// In the case of (1) and (2), these are the only cases where there is no
 736 /// overlap. Or otherwise, the intersection of [a,b] and [x,y] is empty. In
 737 /// order to compute the intersection, one can do [max(a,x), min(b,y)]. The
 738 /// intersection in all of the following cases is non-empty.
 739 ///
 740 ///    3. a = x && b = y <=> [a,b] == [x,y]
 741 ///    4. a = x && b < y <=> [x,y] right-extends [a,b]
 742 ///    5. b = y && a > x <=> [x,y] left-extends [a,b]
 743 ///    6. x = a && y < b <=> [a,b] right-extends [x,y]
 744 ///    7. y = b && x > a <=> [a,b] left-extends [x,y]
 745 ///    8. a > x && b < y <=> [x,y] covers [a,b]
 746 ///    9. x > a && y < b <=> [a,b] covers [x,y]
 747 ///   10. b = x && a < y <=> [a,b] is left-adjacent to [x,y]
 748 ///   11. y = a && x < b <=> [x,y] is left-adjacent to [a,b]
 749 ///   12. b > x && b < y <=> [a,b] left-overlaps [x,y]
 750 ///   13. y > a && y < b <=> [x,y] left-overlaps [a,b]
 751 ///
 752 /// In cases 3-13, we can form rules that partition the ranges into a
 753 /// non-overlapping ordered sequence of ranges:
 754 ///
 755 ///    3. [a,b]
 756 ///    4. [a,b], [b+1,y]
 757 ///    5. [x,a-1], [a,b]
 758 ///    6. [x,y], [y+1,b]
 759 ///    7. [a,x-1], [x,y]
 760 ///    8. [x,a-1], [a,b], [b+1,y]
 761 ///    9. [a,x-1], [x,y], [y+1,b]
 762 ///   10. [a,b-1], [b,b], [b+1,y]
 763 ///   11. [x,y-1], [y,y], [y+1,b]
 764 ///   12. [a,x-1], [x,b], [b+1,y]
 765 ///   13. [x,a-1], [a,y], [y+1,b]
 766 ///
 767 /// In the code below, we go a step further and identify each of the above
 768 /// outputs as belonging either to the overlap of the two ranges or to one
 769 /// of [a,b] or [x,y] exclusively.
 770 #[derive(Clone, Debug, Eq, PartialEq)]
 771 struct Split {
 772     partitions: [SplitRange; 3],
 773     len: usize,
 774 }
 775
 776 /// A tagged range indicating how it was derived from a pair of ranges.
 777 #[derive(Clone, Copy, Debug, Eq, PartialEq)]
 778 enum SplitRange {
 779     Old(Utf8Range),
 780     New(Utf8Range),
 781     Both(Utf8Range),
 782 }
 783
 784 impl Split {
 785     /// Create a partitioning of the given ranges.
 786     ///
 787     /// If the given ranges have an empty intersection, then None is returned.
 788     fn new(o: Utf8Range, n: Utf8Range) -> Option<Split> {
 789         let range = |r: RangeInclusive<u8>| Utf8Range {
 790             start: *r.start(),
 791             end: *r.end(),
 792         };
 793         let old = |r| SplitRange::Old(range(r));
 794         let new = |r| SplitRange::New(range(r));
 795         let both = |r| SplitRange::Both(range(r));
 796
 797         // Use same names as the comment above to make it easier to compare.
 798         let (a, b, x, y) = (o.start, o.end, n.start, n.end);
 799
 800         if b < x || y < a {
 801             // case 1, case 2
 802             None
 803         } else if a == x && b == y {
 804             // case 3
 805             Some(Split::parts1(both(a..=b)))
 806         } else if a == x && b < y {
 807             // case 4
 808             Some(Split::parts2(both(a..=b), new(b + 1..=y)))
 809         } else if b == y && a > x {
 810             // case 5
 811             Some(Split::parts2(new(x..=a - 1), both(a..=b)))
 812         } else if x == a && y < b {
 813             // case 6
 814             Some(Split::parts2(both(x..=y), old(y + 1..=b)))
 815         } else if y == b && x > a {
 816             // case 7
 817             Some(Split::parts2(old(a..=x - 1), both(x..=y)))
 818         } else if a > x && b < y {
 819             // case 8
 820             Some(Split::parts3(new(x..=a - 1), both(a..=b), new(b + 1..=y)))
 821         } else if x > a && y < b {
 822             // case 9
 823             Some(Split::parts3(old(a..=x - 1), both(x..=y), old(y + 1..=b)))
 824         } else if b == x && a < y {
 825             // case 10
 826             Some(Split::parts3(old(a..=b - 1), both(b..=b), new(b + 1..=y)))
 827         } else if y == a && x < b {
 828             // case 11
 829             Some(Split::parts3(new(x..=y - 1), both(y..=y), old(y + 1..=b)))
 830         } else if b > x && b < y {
 831             // case 12
 832             Some(Split::parts3(old(a..=x - 1), both(x..=b), new(b + 1..=y)))
 833         } else if y > a && y < b {
 834             // case 13
 835             Some(Split::parts3(new(x..=a - 1), both(a..=y), old(y + 1..=b)))
 836         } else {
 837             unreachable!()
 838         }
 839     }
 840
 841     /// Create a new split with a single partition. This only occurs when two
 842     /// ranges are equivalent.
 843     fn parts1(r1: SplitRange) -> Split {
 844         // This value doesn't matter since it is never accessed.
 845         let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
 846         Split { partitions: [r1, nada, nada], len: 1 }
 847     }
 848
 849     /// Create a new split with two partitions.
 850     fn parts2(r1: SplitRange, r2: SplitRange) -> Split {
 851         // This value doesn't matter since it is never accessed.
 852         let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
 853         Split { partitions: [r1, r2, nada], len: 2 }
 854     }
 855
 856     /// Create a new split with three partitions.
 857     fn parts3(r1: SplitRange, r2: SplitRange, r3: SplitRange) -> Split {
 858         Split { partitions: [r1, r2, r3], len: 3 }
 859     }
 860
 861     /// Return the partitions in this split as a slice.
 862     fn as_slice(&self) -> &[SplitRange] {
 863         &self.partitions[..self.len]
 864     }
 865 }
 866
 867 impl fmt::Debug for RangeTrie {
 868     fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
 869         writeln!(f, "")?;
 870         for (i, state) in self.states.iter().enumerate() {
 871             let status = if i == FINAL as usize { '*' } else { ' ' };
 872             writeln!(f, "{}{:06}: {:?}", status, i, state)?;
 873         }
 874         Ok(())
 875     }
 876 }
 877
 878 impl fmt::Debug for State {
 879     fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
 880         let rs = self
 881             .transitions
 882             .iter()
 883             .map(|t| format!("{:?}", t))
 884             .collect::<Vec<String>>()
 885             .join(", ");
 886         write!(f, "{}", rs)
 887     }
 888 }
 889
 890 impl fmt::Debug for Transition {
 891     fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
 892         if self.range.start == self.range.end {
 893             write!(f, "{:02X} => {:02X}", self.range.start, self.next_id)
 894         } else {
 895             write!(
 896                 f,
 897                 "{:02X}-{:02X} => {:02X}",
 898                 self.range.start, self.range.end, self.next_id
 899             )
 900         }
 901     }
 902 }
 903
 904 /// Returns true if and only if the given ranges intersect.
 905 fn intersects(r1: Utf8Range, r2: Utf8Range) -> bool {
 906     !(r1.end < r2.start || r2.end < r1.start)
 907 }
 908
 909 #[cfg(test)]
 910 mod tests {
 911     use std::ops::RangeInclusive;
 912
 913     use regex_syntax::utf8::Utf8Range;
 914
 915     use super::*;
 916
 917     fn r(range: RangeInclusive<u8>) -> Utf8Range {
 918         Utf8Range { start: *range.start(), end: *range.end() }
 919     }
 920
 921     fn split_maybe(
 922         old: RangeInclusive<u8>,
 923         new: RangeInclusive<u8>,
 924     ) -> Option<Split> {
 925         Split::new(r(old), r(new))
 926     }
 927
 928     fn split(
 929         old: RangeInclusive<u8>,
 930         new: RangeInclusive<u8>,
 931     ) -> Vec<SplitRange> {
 932         split_maybe(old, new).unwrap().as_slice().to_vec()
 933     }
 934
 935     #[test]
 936     fn no_splits() {
 937         // case 1
 938         assert_eq!(None, split_maybe(0..=1, 2..=3));
 939         // case 2
 940         assert_eq!(None, split_maybe(2..=3, 0..=1));
 941     }
 942
 943     #[test]
 944     fn splits() {
 945         let range = |r: RangeInclusive<u8>| Utf8Range {
 946             start: *r.start(),
 947             end: *r.end(),
 948         };
 949         let old = |r| SplitRange::Old(range(r));
 950         let new = |r| SplitRange::New(range(r));
 951         let both = |r| SplitRange::Both(range(r));
 952
 953         // case 3
 954         assert_eq!(split(0..=0, 0..=0), vec![both(0..=0)]);
 955         assert_eq!(split(9..=9, 9..=9), vec![both(9..=9)]);
 956
 957         // case 4
 958         assert_eq!(split(0..=5, 0..=6), vec![both(0..=5), new(6..=6)]);
 959         assert_eq!(split(0..=5, 0..=8), vec![both(0..=5), new(6..=8)]);
 960         assert_eq!(split(5..=5, 5..=8), vec![both(5..=5), new(6..=8)]);
 961
 962         // case 5
 963         assert_eq!(split(1..=5, 0..=5), vec![new(0..=0), both(1..=5)]);
 964         assert_eq!(split(3..=5, 0..=5), vec![new(0..=2), both(3..=5)]);
 965         assert_eq!(split(5..=5, 0..=5), vec![new(0..=4), both(5..=5)]);
 966
 967         // case 6
 968         assert_eq!(split(0..=6, 0..=5), vec![both(0..=5), old(6..=6)]);
 969         assert_eq!(split(0..=8, 0..=5), vec![both(0..=5), old(6..=8)]);
 970         assert_eq!(split(5..=8, 5..=5), vec![both(5..=5), old(6..=8)]);
 971
 972         // case 7
 973         assert_eq!(split(0..=5, 1..=5), vec![old(0..=0), both(1..=5)]);
 974         assert_eq!(split(0..=5, 3..=5), vec![old(0..=2), both(3..=5)]);
 975         assert_eq!(split(0..=5, 5..=5), vec![old(0..=4), both(5..=5)]);
 976
 977         // case 8
 978         assert_eq!(
 979             split(3..=6, 2..=7),
 980             vec![new(2..=2), both(3..=6), new(7..=7)],
 981         );
 982         assert_eq!(
 983             split(3..=6, 1..=8),
 984             vec![new(1..=2), both(3..=6), new(7..=8)],
 985         );
 986
 987         // case 9
 988         assert_eq!(
 989             split(2..=7, 3..=6),
 990             vec![old(2..=2), both(3..=6), old(7..=7)],
 991         );
 992         assert_eq!(
 993             split(1..=8, 3..=6),
 994             vec![old(1..=2), both(3..=6), old(7..=8)],
 995         );
 996
 997         // case 10
 998         assert_eq!(
 999             split(3..=6, 6..=7),
1000             vec![old(3..=5), both(6..=6), new(7..=7)],
1001         );
1002         assert_eq!(
1003             split(3..=6, 6..=8),
1004             vec![old(3..=5), both(6..=6), new(7..=8)],
1005         );
1006         assert_eq!(
1007             split(5..=6, 6..=7),
1008             vec![old(5..=5), both(6..=6), new(7..=7)],
1009         );
1010
1011         // case 11
1012         assert_eq!(
1013             split(6..=7, 3..=6),
1014             vec![new(3..=5), both(6..=6), old(7..=7)],
1015         );
1016         assert_eq!(
1017             split(6..=8, 3..=6),
1018             vec![new(3..=5), both(6..=6), old(7..=8)],
1019         );
1020         assert_eq!(
1021             split(6..=7, 5..=6),
1022             vec![new(5..=5), both(6..=6), old(7..=7)],
1023         );
1024
1025         // case 12
1026         assert_eq!(
1027             split(3..=7, 5..=9),
1028             vec![old(3..=4), both(5..=7), new(8..=9)],
1029         );
1030         assert_eq!(
1031             split(3..=5, 4..=6),
1032             vec![old(3..=3), both(4..=5), new(6..=6)],
1033         );
1034
1035         // case 13
1036         assert_eq!(
1037             split(5..=9, 3..=7),
1038             vec![new(3..=4), both(5..=7), old(8..=9)],
1039         );
1040         assert_eq!(
1041             split(4..=6, 3..=5),
1042             vec![new(3..=3), both(4..=5), old(6..=6)],
1043         );
1044     }
1045
1046     // Arguably there should be more tests here, but in practice, this data
1047     // structure is well covered by the huge number of regex tests.
1048 }