1 // I've called the primary data structure in this module a "range trie." As far
2 // as I can tell, there is no prior art on a data structure like this, however,
3 // it's likely someone somewhere has built something like it. Searching for
4 // "range trie" turns up the paper "Range Tries for Scalable Address Lookup,"
5 // but it does not appear relevant.
7 // The range trie is just like a trie in that it is a special case of a
8 // deterministic finite state machine. It has states and each state has a set
9 // of transitions to other states. It is acyclic, and, like a normal trie,
10 // it makes no attempt to reuse common suffixes among its elements. The key
11 // difference between a normal trie and a range trie below is that a range trie
12 // operates on *contiguous sequences* of bytes instead of singleton bytes.
13 // One could say say that our alphabet is ranges of bytes instead of bytes
14 // themselves, except a key part of range trie construction is splitting ranges
15 // apart to ensure there is at most one transition that can be taken for any
16 // byte in a given state.
18 // I've tried to explain the details of how the range trie works below, so
19 // for now, we are left with trying to understand what problem we're trying to
20 // solve. Which is itself fairly involved!
22 // At the highest level, here's what we want to do. We want to convert a
23 // sequence of Unicode codepoints into a finite state machine whose transitions
24 // are over *bytes* and *not* Unicode codepoints. We want this because it makes
25 // said finite state machines much smaller and much faster to execute. As a
26 // simple example, consider a byte oriented automaton for all Unicode scalar
27 // values (0x00 through 0x10FFFF, not including surrogate codepoints):
31 // [E0-E0][A0-BF][80-BF]
32 // [E1-EC][80-BF][80-BF]
33 // [ED-ED][80-9F][80-BF]
34 // [EE-EF][80-BF][80-BF]
35 // [F0-F0][90-BF][80-BF][80-BF]
36 // [F1-F3][80-BF][80-BF][80-BF]
37 // [F4-F4][80-8F][80-BF][80-BF]
39 // (These byte ranges are generated via the regex-syntax::utf8 module, which
40 // was based on Russ Cox's code in RE2, which was in turn based on Ken
41 // Thompson's implementation of the same idea in his Plan9 implementation of
44 // It should be fairly straight-forward to see how one could compile this into
45 // a DFA. The sequences are sorted and non-overlapping. Essentially, you could
46 // build a trie from this fairly easy. The problem comes when your initial
47 // range (in this case, 0x00-0x10FFFF) isn't so nice. For example, the class
48 // represented by '\w' contains only a tenth of the codepoints that
49 // 0x00-0x10FFFF contains, but if we were to write out the byte based ranges
50 // as we did above, the list would stretch to 892 entries! This turns into
51 // quite a large NFA with a few thousand states. Turning this beast into a DFA
52 // takes quite a bit of time. We are thus left with trying to trim down the
53 // number of states we produce as early as possible.
55 // One approach (used by RE2 and still by the regex crate, at time of writing)
56 // is to try to find common suffixes while building NFA states for the above
57 // and reuse them. This is very cheap to do and one can control precisely how
58 // much extra memory you want to use for the cache.
60 // Another approach, however, is to reuse an algorithm for constructing a
61 // *minimal* DFA from a sorted sequence of inputs. I don't want to go into
62 // the full details here, but I explain it in more depth in my blog post on
63 // FSTs[1]. Note that the algorithm not invented by me, but was published
64 // in paper by Daciuk et al. in 2000 called "Incremental Construction of
65 // MinimalAcyclic Finite-State Automata." Like the suffix cache approach above,
66 // it is also possible to control the amount of extra memory one uses, although
67 // this usually comes with the cost of sacrificing true minimality. (But it's
68 // typically close enough with a reasonably sized cache of states.)
70 // The catch is that Daciuk's algorithm only works if you add your keys in
71 // lexicographic ascending order. In our case, since we're dealing with ranges,
72 // we also need the additional requirement that ranges are either equivalent
73 // or do not overlap at all. For example, if one were given the following byte
79 // Then Daciuk's algorithm also would not work, since there is nothing to
80 // handle the fact that the ranges overlap. They would need to be split apart.
81 // Thankfully, Thompson's algorithm for producing byte ranges for Unicode
82 // codepoint ranges meets both of our requirements.
84 // ... however, we would also like to be able to compile UTF-8 automata in
85 // reverse. We want this because in order to find the starting location of a
86 // match using a DFA, we need to run a second DFA---a reversed version of the
87 // forward DFA---backwards to discover the match location. Unfortunately, if
88 // we reverse our byte sequences for 0x00-0x10FFFF, we get sequences that are
89 // can overlap, even if they are sorted:
92 // [80-BF][80-9F][ED-ED]
93 // [80-BF][80-BF][80-8F][F4-F4]
94 // [80-BF][80-BF][80-BF][F1-F3]
95 // [80-BF][80-BF][90-BF][F0-F0]
96 // [80-BF][80-BF][E1-EC]
97 // [80-BF][80-BF][EE-EF]
98 // [80-BF][A0-BF][E0-E0]
101 // For example, '[80-BF][80-BF][EE-EF]' and '[80-BF][A0-BF][E0-E0]' have
102 // overlapping ranges between '[80-BF]' and '[A0-BF]'. Thus, there is no
103 // simple way to apply Daciuk's algorithm.
105 // And thus, the range trie was born. The range trie's only purpose is to take
106 // sequences of byte ranges like the ones above, collect them into a trie and
107 // then spit them in a sorted fashion with no overlapping ranges. For example,
108 // 0x00-0x10FFFF gets translated to:
111 // [80-BF][80-9F][80-8F][F1-F3]
112 // [80-BF][80-9F][80-8F][F4]
113 // [80-BF][80-9F][90-BF][F0]
114 // [80-BF][80-9F][90-BF][F1-F3]
115 // [80-BF][80-9F][E1-EC]
116 // [80-BF][80-9F][ED]
117 // [80-BF][80-9F][EE-EF]
118 // [80-BF][A0-BF][80-8F][F1-F3]
119 // [80-BF][A0-BF][80-8F][F4]
120 // [80-BF][A0-BF][90-BF][F0]
121 // [80-BF][A0-BF][90-BF][F1-F3]
122 // [80-BF][A0-BF][E0]
123 // [80-BF][A0-BF][E1-EC]
124 // [80-BF][A0-BF][EE-EF]
127 // We've thus satisfied our requirements for running Daciuk's algorithm. All
128 // sequences of ranges are sorted, and any corresponding ranges are either
129 // exactly equivalent or non-overlapping.
131 // In effect, a range trie is building a DFA from a sequence of arbitrary
132 // byte ranges. But it uses an algoritm custom tailored to its input, so it
133 // is not as costly as traditional DFA construction. While it is still quite
134 // a bit more costly than the forward's case (which only needs Daciuk's
135 // algorithm), it winds up saving a substantial amount of time if one is doing
136 // a full DFA powerset construction later by virtue of producing a much much
139 // [1] - https://blog.burntsushi.net/transducers/
140 // [2] - https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120100561601
142 use std
::cell
::RefCell
;
145 use std
::ops
::RangeInclusive
;
148 use regex_syntax
::utf8
::Utf8Range
;
150 /// A smaller state ID means more effective use of the CPU cache and less
151 /// time spent copying. The implementation below will panic if the state ID
152 /// space is exhausted, but in order for that to happen, the range trie itself
153 /// would use well over 100GB of memory. Moreover, it's likely impossible
154 /// for the state ID space to get that big. In fact, it's likely that even a
155 /// u16 would be good enough here. But it's not quite clear how to prove this.
158 /// There is only one final state in this trie. Every sequence of byte ranges
159 /// added shares the same final state.
160 const FINAL
: StateID
= 0;
162 /// The root state of the trie.
163 const ROOT
: StateID
= 1;
165 /// A range trie represents an ordered set of sequences of bytes.
167 /// A range trie accepts as input a sequence of byte ranges and merges
168 /// them into the existing set such that the trie can produce a sorted
169 /// non-overlapping sequence of byte ranges. The sequence emitted corresponds
170 /// precisely to the sequence of bytes matched by the given keys, although the
171 /// byte ranges themselves may be split at different boundaries.
173 /// The order complexity of this data structure seems difficult to analyze.
174 /// If the size of a byte is held as a constant, then insertion is clearly
175 /// O(n) where n is the number of byte ranges in the input key. However, if
176 /// k=256 is our alphabet size, then insertion could be O(k^2 * n). In
177 /// particular it seems possible for pathological inputs to cause insertion
178 /// to do a lot of work. However, for what we use this data structure for,
179 /// there should be no pathological inputs since the ultimate source is always
180 /// a sorted set of Unicode scalar value ranges.
182 /// Internally, this trie is setup like a finite state machine. Note though
183 /// that it is acyclic.
185 pub struct RangeTrie
{
186 /// The states in this trie. The first is always the shared final state.
187 /// The second is always the root state. Otherwise, there is no
188 /// particular order.
190 /// A free-list of states. When a range trie is cleared, all of its states
191 /// are added to list. Creating a new state reuses states from this list
192 /// before allocating a new one.
194 /// A stack for traversing this trie to yield sequences of byte ranges in
195 /// lexicographic order.
196 iter_stack
: RefCell
<Vec
<NextIter
>>,
197 /// A bufer that stores the current sequence during iteration.
198 iter_ranges
: RefCell
<Vec
<Utf8Range
>>,
199 /// A stack used for traversing the trie in order to (deeply) duplicate
201 dupe_stack
: Vec
<NextDupe
>,
202 /// A stack used for traversing the trie during insertion of a new
203 /// sequence of byte ranges.
204 insert_stack
: Vec
<NextInsert
>,
207 /// A single state in this trie.
210 /// A sorted sequence of non-overlapping transitions to other states. Each
211 /// transition corresponds to a single range of bytes.
212 transitions
: Vec
<Transition
>,
215 /// A transition is a single range of bytes. If a particular byte is in this
216 /// range, then the corresponding machine may transition to the state pointed
222 /// The next state to transition to.
227 /// Create a new empty range trie.
228 pub fn new() -> RangeTrie
{
229 let mut trie
= RangeTrie
{
232 iter_stack
: RefCell
::new(vec
![]),
233 iter_ranges
: RefCell
::new(vec
![]),
235 insert_stack
: vec
![],
241 /// Clear this range trie such that it is empty. Clearing a range trie
242 /// and reusing it can beneficial because this may reuse allocations.
243 pub fn clear(&mut self) {
244 self.free
.extend(self.states
.drain(..));
245 self.add_empty(); // final
246 self.add_empty(); // root
249 /// Iterate over all of the sequences of byte ranges in this trie, and
250 /// call the provided function for each sequence. Iteration occurs in
251 /// lexicographic order.
252 pub fn iter
<F
: FnMut(&[Utf8Range
])>(&self, mut f
: F
) {
253 let mut stack
= self.iter_stack
.borrow_mut();
255 let mut ranges
= self.iter_ranges
.borrow_mut();
258 // We do iteration in a way that permits us to use a single buffer
259 // for our keys. We iterate in a depth first fashion, while being
260 // careful to expand our frontier as we move deeper in the trie.
261 stack
.push(NextIter { state_id: ROOT, tidx: 0 }
);
262 while let Some(NextIter { mut state_id, mut tidx }
) = stack
.pop() {
263 // This could be implemented more simply without an inner loop
264 // here, but at the cost of more stack pushes.
266 let state
= self.state(state_id
);
267 // If we're visited all transitions in this state, then pop
268 // back to the parent state.
269 if tidx
>= state
.transitions
.len() {
274 let t
= &state
.transitions
[tidx
];
275 ranges
.push(t
.range
);
276 if t
.next_id
== FINAL
{
281 // Expand our frontier. Once we come back to this state
282 // via the stack, start in on the next transition.
283 stack
.push(NextIter { state_id, tidx: tidx + 1 }
);
284 // Otherwise, move to the first transition of the next
286 state_id
= t
.next_id
;
293 /// Inserts a new sequence of ranges into this trie.
295 /// The sequence given must be non-empty and must not have a length
297 pub fn insert(&mut self, ranges
: &[Utf8Range
]) {
298 assert
!(!ranges
.is_empty());
299 assert
!(ranges
.len() <= 4);
301 let mut stack
= mem
::replace(&mut self.insert_stack
, vec
![]);
304 stack
.push(NextInsert
::new(ROOT
, ranges
));
305 while let Some(next
) = stack
.pop() {
306 let (state_id
, ranges
) = (next
.state_id(), next
.ranges());
307 assert
!(!ranges
.is_empty());
309 let (mut new
, rest
) = (ranges
[0], &ranges
[1..]);
311 // i corresponds to the position of the existing transition on
312 // which we are operating. Typically, the result is to remove the
313 // transition and replace it with two or more new transitions
314 // corresponding to the partitions generated by splitting the
315 // 'new' with the ith transition's range.
316 let mut i
= self.state(state_id
).find(new
);
318 // In this case, there is no overlap *and* the new range is greater
319 // than all existing ranges. So we can just add it to the end.
320 if i
== self.state(state_id
).transitions
.len() {
321 let next_id
= NextInsert
::push(self, &mut stack
, rest
);
322 self.add_transition(state_id
, new
, next_id
);
326 // The need for this loop is a bit subtle, buf basically, after
327 // we've handled the partitions from our initial split, it's
328 // possible that there will be a partition leftover that overlaps
329 // with a subsequent transition. If so, then we have to repeat
330 // the split process again with the leftovers and that subsequent
333 let old
= self.state(state_id
).transitions
[i
].clone();
334 let split
= match Split
::new(old
.range
, new
) {
335 Some(split
) => split
,
337 let next_id
= NextInsert
::push(self, &mut stack
, rest
);
338 self.add_transition_at(i
, state_id
, new
, next_id
);
342 let splits
= split
.as_slice();
343 // If we only have one partition, then the ranges must be
344 // equivalent. There's nothing to do here for this state, so
345 // just move on to the next one.
346 if splits
.len() == 1 {
347 // ... but only if we have anything left to do.
348 if !rest
.is_empty() {
349 stack
.push(NextInsert
::new(old
.next_id
, rest
));
353 // At this point, we know that 'split' is non-empty and there
354 // must be some overlap AND that the two ranges are not
355 // equivalent. Therefore, the existing range MUST be removed
356 // and split up somehow. Instead of actually doing the removal
357 // and then a subsequent insertion---with all the memory
358 // shuffling that entails---we simply overwrite the transition
359 // at position `i` for the first new transition we want to
360 // insert. After that, we're forced to do expensive inserts.
361 let mut first
= true;
363 |trie
: &mut RangeTrie
, pos
, from
, range
, to
| {
365 trie
.set_transition_at(pos
, from
, range
, to
);
368 trie
.add_transition_at(pos
, from
, range
, to
);
371 for (j
, &srange
) in splits
.iter().enumerate() {
373 SplitRange
::Old(r
) => {
374 // Deep clone the state pointed to by the ith
375 // transition. This is always necessary since 'old'
376 // is always coupled with at least a 'both'
377 // partition. We don't want any new changes made
378 // via the 'both' partition to impact the part of
379 // the transition that doesn't overlap with the
381 let dup_id
= self.duplicate(old
.next_id
);
382 add_trans(self, i
, state_id
, r
, dup_id
);
384 SplitRange
::New(r
) => {
385 // This is a bit subtle, but if this happens to be
386 // the last partition in our split, it is possible
387 // that this overlaps with a subsequent transition.
388 // If it does, then we must repeat the whole
389 // splitting process over again with `r` and the
390 // subsequent transition.
392 let trans
= &self.state(state_id
).transitions
;
393 if j
+ 1 == splits
.len()
395 && intersects(r
, trans
[i
].range
)
402 // ... otherwise, setup exploration for a new
403 // empty state and add a brand new transition for
406 NextInsert
::push(self, &mut stack
, rest
);
407 add_trans(self, i
, state_id
, r
, next_id
);
409 SplitRange
::Both(r
) => {
410 // Continue adding the remaining ranges on this
411 // path and update the transition with the new
413 if !rest
.is_empty() {
414 stack
.push(NextInsert
::new(old
.next_id
, rest
));
416 add_trans(self, i
, state_id
, r
, old
.next_id
);
421 // If we've reached this point, then we know that there are
422 // no subsequent transitions with any overlap. Therefore, we
423 // can stop processing this range and move on to the next one.
427 self.insert_stack
= stack
;
430 pub fn add_empty(&mut self) -> StateID
{
431 if self.states
.len() as u64 > u32::MAX
as u64 {
432 // This generally should not happen since a range trie is only
433 // ever used to compile a single sequence of Unicode scalar values.
434 // If we ever got to this point, we would, at *minimum*, be using
435 // 96GB in just the range trie alone.
436 panic
!("too many sequences added to range trie");
438 let id
= self.states
.len() as StateID
;
439 // If we have some free states available, then use them to avoid
441 if let Some(mut state
) = self.free
.pop() {
443 self.states
.push(state
);
445 self.states
.push(State { transitions: vec![] }
);
450 /// Performs a deep clone of the given state and returns the duplicate's
453 /// A "deep clone" in this context means that the state given along with
454 /// recursively all states that it points to are copied. Once complete,
455 /// the given state ID and the returned state ID share nothing.
457 /// This is useful during range trie insertion when a new range overlaps
458 /// with an existing range that is bigger than the new one. The part of
459 /// the existing range that does *not* overlap with the new one is that
460 /// duplicated so that adding the new range to the overlap doesn't disturb
461 /// the non-overlapping portion.
463 /// There's one exception: if old_id is the final state, then it is not
464 /// duplicated and the same final state is returned. This is because all
465 /// final states in this trie are equivalent.
466 fn duplicate(&mut self, old_id
: StateID
) -> StateID
{
471 let mut stack
= mem
::replace(&mut self.dupe_stack
, vec
![]);
474 let new_id
= self.add_empty();
475 // old_id is the state we're cloning and new_id is the ID of the
476 // duplicated state for old_id.
477 stack
.push(NextDupe { old_id, new_id }
);
478 while let Some(NextDupe { old_id, new_id }
) = stack
.pop() {
479 for i
in 0..self.state(old_id
).transitions
.len() {
480 let t
= self.state(old_id
).transitions
[i
].clone();
481 if t
.next_id
== FINAL
{
482 // All final states are the same, so there's no need to
484 self.add_transition(new_id
, t
.range
, FINAL
);
488 let new_child_id
= self.add_empty();
489 self.add_transition(new_id
, t
.range
, new_child_id
);
490 stack
.push(NextDupe
{
492 new_id
: new_child_id
,
496 self.dupe_stack
= stack
;
500 /// Adds the given transition to the given state.
502 /// Callers must ensure that all previous transitions in this state
503 /// are lexicographically smaller than the given range.
510 self.state_mut(from_id
)
512 .push(Transition { range, next_id }
);
515 /// Like `add_transition`, except this inserts the transition just before
516 /// the ith transition.
517 fn add_transition_at(
524 self.state_mut(from_id
)
526 .insert(i
, Transition { range, next_id }
);
529 /// Overwrites the transition at position i with the given transition.
530 fn set_transition_at(
537 self.state_mut(from_id
).transitions
[i
] = Transition { range, next_id }
;
540 /// Return an immutable borrow for the state with the given ID.
541 fn state(&self, id
: StateID
) -> &State
{
542 &self.states
[id
as usize]
545 /// Return a mutable borrow for the state with the given ID.
546 fn state_mut(&mut self, id
: StateID
) -> &mut State
{
547 &mut self.states
[id
as usize]
552 /// Find the position at which the given range should be inserted in this
555 /// The position returned is always in the inclusive range
556 /// [0, transitions.len()]. If 'transitions.len()' is returned, then the
557 /// given range overlaps with no other range in this state *and* is greater
558 /// than all of them.
560 /// For all other possible positions, the given range either overlaps
561 /// with the transition at that position or is otherwise less than it
562 /// with no overlap (and is greater than the previous transition). In the
563 /// former case, careful attention must be paid to inserting this range
564 /// as a new transition. In the latter case, the range can be inserted as
565 /// a new transition at the given position without disrupting any other
567 fn find(&self, range
: Utf8Range
) -> usize {
568 /// Returns the position `i` at which `pred(xs[i])` first returns true
569 /// such that for all `j >= i`, `pred(xs[j]) == true`. If `pred` never
570 /// returns true, then `xs.len()` is returned.
572 /// We roll our own binary search because it doesn't seem like the
573 /// standard library's binary search can be used here. Namely, if
574 /// there is an overlapping range, then we want to find the first such
575 /// occurrence, but there may be many. Or at least, it's not quite
576 /// clear to me how to do it.
577 fn binary_search
<T
, F
>(xs
: &[T
], mut pred
: F
) -> usize
579 F
: FnMut(&T
) -> bool
,
581 let (mut left
, mut right
) = (0, xs
.len());
583 // Overflow is impossible because xs.len() <= 256.
584 let mid
= (left
+ right
) / 2;
594 // Benchmarks suggest that binary search is just a bit faster than
595 // straight linear search. Specifically when using the debug tool:
597 // hyperfine "regex-automata-debug debug -acqr '\w{40} ecurB'"
598 binary_search(&self.transitions
, |t
| range
.start
<= t
.range
.end
)
601 /// Clear this state such that it has zero transitions.
602 fn clear(&mut self) {
603 self.transitions
.clear();
607 /// The next state to process during duplication.
608 #[derive(Clone, Debug)]
610 /// The state we want to duplicate.
612 /// The ID of the new state that is a duplicate of old_id.
616 /// The next state (and its corresponding transition) that we want to visit
617 /// during iteration in lexicographic order.
618 #[derive(Clone, Debug)]
624 /// The next state to process during insertion and any remaining ranges that we
625 /// want to add for a partcular sequence of ranges. The first such instance
626 /// is always the root state along with all ranges given.
627 #[derive(Clone, Debug)]
629 /// The next state to begin inserting ranges. This state should be the
630 /// state at which `ranges[0]` should be inserted.
632 /// The ranges to insert. We used a fixed-size array here to avoid an
634 ranges
: [Utf8Range
; 4],
635 /// The number of valid ranges in the above array.
640 /// Create the next item to visit. The given state ID should correspond
641 /// to the state at which the first range in the given slice should be
642 /// inserted. The slice given must not be empty and it must be no longer
644 fn new(state_id
: StateID
, ranges
: &[Utf8Range
]) -> NextInsert
{
645 let len
= ranges
.len();
649 let mut tmp
= [Utf8Range { start: 0, end: 0 }
; 4];
650 tmp
[..len
].copy_from_slice(ranges
);
651 NextInsert { state_id, ranges: tmp, len: len as u8 }
654 /// Push a new empty state to visit along with any remaining ranges that
655 /// still need to be inserted. The ID of the new empty state is returned.
657 /// If ranges is empty, then no new state is created and FINAL is returned.
659 trie
: &mut RangeTrie
,
660 stack
: &mut Vec
<NextInsert
>,
661 ranges
: &[Utf8Range
],
663 if ranges
.is_empty() {
666 let next_id
= trie
.add_empty();
667 stack
.push(NextInsert
::new(next_id
, ranges
));
672 /// Return the ID of the state to visit.
673 fn state_id(&self) -> StateID
{
677 /// Return the remaining ranges to insert.
678 fn ranges(&self) -> &[Utf8Range
] {
679 &self.ranges
[..self.len
as usize]
683 /// Split represents a partitioning of two ranges into one or more ranges. This
684 /// is the secret sauce that makes a range trie work, as it's what tells us
685 /// how to deal with two overlapping but unequal ranges during insertion.
687 /// Essentially, either two ranges overlap or they don't. If they don't, then
688 /// handling insertion is easy: just insert the new range into its
689 /// lexicographically correct position. Since it does not overlap with anything
690 /// else, no other transitions are impacted by the new range.
692 /// If they do overlap though, there are generally three possible cases to
695 /// 1. The part where the two ranges actually overlap. i.e., The intersection.
696 /// 2. The part of the existing range that is not in the the new range.
697 /// 3. The part of the new range that is not in the old range.
699 /// (1) is guaranteed to always occur since all overlapping ranges have a
700 /// non-empty intersection. If the two ranges are not equivalent, then at
701 /// least one of (2) or (3) is guaranteed to occur as well. In some cases,
702 /// e.g., `[0-4]` and `[4-9]`, all three cases will occur.
704 /// This `Split` type is responsible for providing (1), (2) and (3) for any
705 /// possible pair of byte ranges.
707 /// As for insertion, for the overlap in (1), the remaining ranges to insert
708 /// should be added by following the corresponding transition. However, this
709 /// should only be done for the overlapping parts of the range. If there was
710 /// a part of the existing range that was not in the new range, then that
711 /// existing part must be split off from the transition and duplicated. The
712 /// remaining parts of the overlap can then be added to using the new ranges
713 /// without disturbing the existing range.
715 /// Handling the case for the part of a new range that is not in an existing
716 /// range is seemingly easy. Just treat it as if it were a non-overlapping
717 /// range. The problem here is that if this new non-overlapping range occurs
718 /// after both (1) and (2), then it's possible that it can overlap with the
719 /// next transition in the current state. If it does, then the whole process
720 /// must be repeated!
722 /// # Details of the 3 cases
724 /// The following details the various cases that are implemented in code
725 /// below. It's plausible that the number of cases is not actually minimal,
726 /// but it's important for this code to remain at least somewhat readable.
728 /// Given [a,b] and [x,y], where a <= b, x <= y, b < 256 and y < 256, we define
729 /// the follow distinct relationships where at least one must apply. The order
730 /// of these matters, since multiple can match. The first to match applies.
732 /// 1. b < x <=> [a,b] < [x,y]
733 /// 2. y < a <=> [x,y] < [a,b]
735 /// In the case of (1) and (2), these are the only cases where there is no
736 /// overlap. Or otherwise, the intersection of [a,b] and [x,y] is empty. In
737 /// order to compute the intersection, one can do [max(a,x), min(b,y)]. The
738 /// intersection in all of the following cases is non-empty.
740 /// 3. a = x && b = y <=> [a,b] == [x,y]
741 /// 4. a = x && b < y <=> [x,y] right-extends [a,b]
742 /// 5. b = y && a > x <=> [x,y] left-extends [a,b]
743 /// 6. x = a && y < b <=> [a,b] right-extends [x,y]
744 /// 7. y = b && x > a <=> [a,b] left-extends [x,y]
745 /// 8. a > x && b < y <=> [x,y] covers [a,b]
746 /// 9. x > a && y < b <=> [a,b] covers [x,y]
747 /// 10. b = x && a < y <=> [a,b] is left-adjacent to [x,y]
748 /// 11. y = a && x < b <=> [x,y] is left-adjacent to [a,b]
749 /// 12. b > x && b < y <=> [a,b] left-overlaps [x,y]
750 /// 13. y > a && y < b <=> [x,y] left-overlaps [a,b]
752 /// In cases 3-13, we can form rules that partition the ranges into a
753 /// non-overlapping ordered sequence of ranges:
756 /// 4. [a,b], [b+1,y]
757 /// 5. [x,a-1], [a,b]
758 /// 6. [x,y], [y+1,b]
759 /// 7. [a,x-1], [x,y]
760 /// 8. [x,a-1], [a,b], [b+1,y]
761 /// 9. [a,x-1], [x,y], [y+1,b]
762 /// 10. [a,b-1], [b,b], [b+1,y]
763 /// 11. [x,y-1], [y,y], [y+1,b]
764 /// 12. [a,x-1], [x,b], [b+1,y]
765 /// 13. [x,a-1], [a,y], [y+1,b]
767 /// In the code below, we go a step further and identify each of the above
768 /// outputs as belonging either to the overlap of the two ranges or to one
769 /// of [a,b] or [x,y] exclusively.
770 #[derive(Clone, Debug, Eq, PartialEq)]
772 partitions
: [SplitRange
; 3],
776 /// A tagged range indicating how it was derived from a pair of ranges.
777 #[derive(Clone, Copy, Debug, Eq, PartialEq)]
785 /// Create a partitioning of the given ranges.
787 /// If the given ranges have an empty intersection, then None is returned.
788 fn new(o
: Utf8Range
, n
: Utf8Range
) -> Option
<Split
> {
789 let range
= |r
: RangeInclusive
<u8>| Utf8Range
{
793 let old
= |r
| SplitRange
::Old(range(r
));
794 let new
= |r
| SplitRange
::New(range(r
));
795 let both
= |r
| SplitRange
::Both(range(r
));
797 // Use same names as the comment above to make it easier to compare.
798 let (a
, b
, x
, y
) = (o
.start
, o
.end
, n
.start
, n
.end
);
803 } else if a
== x
&& b
== y
{
805 Some(Split
::parts1(both(a
..=b
)))
806 } else if a
== x
&& b
< y
{
808 Some(Split
::parts2(both(a
..=b
), new(b
+ 1..=y
)))
809 } else if b
== y
&& a
> x
{
811 Some(Split
::parts2(new(x
..=a
- 1), both(a
..=b
)))
812 } else if x
== a
&& y
< b
{
814 Some(Split
::parts2(both(x
..=y
), old(y
+ 1..=b
)))
815 } else if y
== b
&& x
> a
{
817 Some(Split
::parts2(old(a
..=x
- 1), both(x
..=y
)))
818 } else if a
> x
&& b
< y
{
820 Some(Split
::parts3(new(x
..=a
- 1), both(a
..=b
), new(b
+ 1..=y
)))
821 } else if x
> a
&& y
< b
{
823 Some(Split
::parts3(old(a
..=x
- 1), both(x
..=y
), old(y
+ 1..=b
)))
824 } else if b
== x
&& a
< y
{
826 Some(Split
::parts3(old(a
..=b
- 1), both(b
..=b
), new(b
+ 1..=y
)))
827 } else if y
== a
&& x
< b
{
829 Some(Split
::parts3(new(x
..=y
- 1), both(y
..=y
), old(y
+ 1..=b
)))
830 } else if b
> x
&& b
< y
{
832 Some(Split
::parts3(old(a
..=x
- 1), both(x
..=b
), new(b
+ 1..=y
)))
833 } else if y
> a
&& y
< b
{
835 Some(Split
::parts3(new(x
..=a
- 1), both(a
..=y
), old(y
+ 1..=b
)))
841 /// Create a new split with a single partition. This only occurs when two
842 /// ranges are equivalent.
843 fn parts1(r1
: SplitRange
) -> Split
{
844 // This value doesn't matter since it is never accessed.
845 let nada
= SplitRange
::Old(Utf8Range { start: 0, end: 0 }
);
846 Split { partitions: [r1, nada, nada], len: 1 }
849 /// Create a new split with two partitions.
850 fn parts2(r1
: SplitRange
, r2
: SplitRange
) -> Split
{
851 // This value doesn't matter since it is never accessed.
852 let nada
= SplitRange
::Old(Utf8Range { start: 0, end: 0 }
);
853 Split { partitions: [r1, r2, nada], len: 2 }
856 /// Create a new split with three partitions.
857 fn parts3(r1
: SplitRange
, r2
: SplitRange
, r3
: SplitRange
) -> Split
{
858 Split { partitions: [r1, r2, r3], len: 3 }
861 /// Return the partitions in this split as a slice.
862 fn as_slice(&self) -> &[SplitRange
] {
863 &self.partitions
[..self.len
]
867 impl fmt
::Debug
for RangeTrie
{
868 fn fmt(&self, f
: &mut fmt
::Formatter
) -> fmt
::Result
{
870 for (i
, state
) in self.states
.iter().enumerate() {
871 let status
= if i
== FINAL
as usize { '*' }
else { ' ' }
;
872 writeln
!(f
, "{}{:06}: {:?}", status
, i
, state
)?
;
878 impl fmt
::Debug
for State
{
879 fn fmt(&self, f
: &mut fmt
::Formatter
) -> fmt
::Result
{
883 .map(|t
| format
!("{:?}", t
))
884 .collect
::<Vec
<String
>>()
890 impl fmt
::Debug
for Transition
{
891 fn fmt(&self, f
: &mut fmt
::Formatter
) -> fmt
::Result
{
892 if self.range
.start
== self.range
.end
{
893 write
!(f
, "{:02X} => {:02X}", self.range
.start
, self.next_id
)
897 "{:02X}-{:02X} => {:02X}",
898 self.range
.start
, self.range
.end
, self.next_id
904 /// Returns true if and only if the given ranges intersect.
905 fn intersects(r1
: Utf8Range
, r2
: Utf8Range
) -> bool
{
906 !(r1
.end
< r2
.start
|| r2
.end
< r1
.start
)
911 use std
::ops
::RangeInclusive
;
913 use regex_syntax
::utf8
::Utf8Range
;
917 fn r(range
: RangeInclusive
<u8>) -> Utf8Range
{
918 Utf8Range { start: *range.start(), end: *range.end() }
922 old
: RangeInclusive
<u8>,
923 new
: RangeInclusive
<u8>,
925 Split
::new(r(old
), r(new
))
929 old
: RangeInclusive
<u8>,
930 new
: RangeInclusive
<u8>,
931 ) -> Vec
<SplitRange
> {
932 split_maybe(old
, new
).unwrap().as_slice().to_vec()
938 assert_eq
!(None
, split_maybe(0..=1, 2..=3));
940 assert_eq
!(None
, split_maybe(2..=3, 0..=1));
945 let range
= |r
: RangeInclusive
<u8>| Utf8Range
{
949 let old
= |r
| SplitRange
::Old(range(r
));
950 let new
= |r
| SplitRange
::New(range(r
));
951 let both
= |r
| SplitRange
::Both(range(r
));
954 assert_eq
!(split(0..=0, 0..=0), vec
![both(0..=0)]);
955 assert_eq
!(split(9..=9, 9..=9), vec
![both(9..=9)]);
958 assert_eq
!(split(0..=5, 0..=6), vec
![both(0..=5), new(6..=6)]);
959 assert_eq
!(split(0..=5, 0..=8), vec
![both(0..=5), new(6..=8)]);
960 assert_eq
!(split(5..=5, 5..=8), vec
![both(5..=5), new(6..=8)]);
963 assert_eq
!(split(1..=5, 0..=5), vec
![new(0..=0), both(1..=5)]);
964 assert_eq
!(split(3..=5, 0..=5), vec
![new(0..=2), both(3..=5)]);
965 assert_eq
!(split(5..=5, 0..=5), vec
![new(0..=4), both(5..=5)]);
968 assert_eq
!(split(0..=6, 0..=5), vec
![both(0..=5), old(6..=6)]);
969 assert_eq
!(split(0..=8, 0..=5), vec
![both(0..=5), old(6..=8)]);
970 assert_eq
!(split(5..=8, 5..=5), vec
![both(5..=5), old(6..=8)]);
973 assert_eq
!(split(0..=5, 1..=5), vec
![old(0..=0), both(1..=5)]);
974 assert_eq
!(split(0..=5, 3..=5), vec
![old(0..=2), both(3..=5)]);
975 assert_eq
!(split(0..=5, 5..=5), vec
![old(0..=4), both(5..=5)]);
980 vec
![new(2..=2), both(3..=6), new(7..=7)],
984 vec
![new(1..=2), both(3..=6), new(7..=8)],
990 vec
![old(2..=2), both(3..=6), old(7..=7)],
994 vec
![old(1..=2), both(3..=6), old(7..=8)],
1000 vec
![old(3..=5), both(6..=6), new(7..=7)],
1003 split(3..=6, 6..=8),
1004 vec
![old(3..=5), both(6..=6), new(7..=8)],
1007 split(5..=6, 6..=7),
1008 vec
![old(5..=5), both(6..=6), new(7..=7)],
1013 split(6..=7, 3..=6),
1014 vec
![new(3..=5), both(6..=6), old(7..=7)],
1017 split(6..=8, 3..=6),
1018 vec
![new(3..=5), both(6..=6), old(7..=8)],
1021 split(6..=7, 5..=6),
1022 vec
![new(5..=5), both(6..=6), old(7..=7)],
1027 split(3..=7, 5..=9),
1028 vec
![old(3..=4), both(5..=7), new(8..=9)],
1031 split(3..=5, 4..=6),
1032 vec
![old(3..=3), both(4..=5), new(6..=6)],
1037 split(5..=9, 3..=7),
1038 vec
![new(3..=4), both(5..=7), old(8..=9)],
1041 split(4..=6, 3..=5),
1042 vec
![new(3..=3), both(4..=5), old(6..=6)],
1046 // Arguably there should be more tests here, but in practice, this data
1047 // structure is well covered by the huge number of regex tests.