vendor/bstr/src/lib.rs

   1 /*!
   2 A byte string library.
   3
   4 Byte strings are just like standard Unicode strings with one very important
   5 difference: byte strings are only *conventionally* UTF-8 while Rust's standard
   6 Unicode strings are *guaranteed* to be valid UTF-8. The primary motivation for
   7 byte strings is for handling arbitrary bytes that are mostly UTF-8.
   8
   9 # Overview
  10
  11 This crate provides two important traits that provide string oriented methods
  12 on `&[u8]` and `Vec<u8>` types:
  13
  14 * [`ByteSlice`](trait.ByteSlice.html) extends the `[u8]` type with additional
  15   string oriented methods.
  16 * [`ByteVec`](trait.ByteVec.html) extends the `Vec<u8>` type with additional
  17   string oriented methods.
  18
  19 Additionally, this crate provides two concrete byte string types that deref to
  20 `[u8]` and `Vec<u8>`. These are useful for storing byte string types, and come
  21 with convenient `std::fmt::Debug` implementations:
  22
  23 * [`BStr`](struct.BStr.html) is a byte string slice, analogous to `str`.
  24 * [`BString`](struct.BString.html) is an owned growable byte string buffer,
  25   analogous to `String`.
  26
  27 Additionally, the free function [`B`](fn.B.html) serves as a convenient short
  28 hand for writing byte string literals.
  29
  30 # Quick examples
  31
  32 Byte strings build on the existing APIs for `Vec<u8>` and `&[u8]`, with
  33 additional string oriented methods. Operations such as iterating over
  34 graphemes, searching for substrings, replacing substrings, trimming and case
  35 conversion are examples of things not provided on the standard library `&[u8]`
  36 APIs but are provided by this crate. For example, this code iterates over all
  37 of occurrences of a substring:
  38
  39 ```
  40 use bstr::ByteSlice;
  41
  42 let s = b"foo bar foo foo quux foo";
  43
  44 let mut matches = vec![];
  45 for start in s.find_iter("foo") {
  46     matches.push(start);
  47 }
  48 assert_eq!(matches, [0, 8, 12, 21]);
  49 ```
  50
  51 Here's another example showing how to do a search and replace (and also showing
  52 use of the `B` function):
  53
  54 ```
  55 # #[cfg(feature = "alloc")] {
  56 use bstr::{B, ByteSlice};
  57
  58 let old = B("foo ☃☃☃ foo foo quux foo");
  59 let new = old.replace("foo", "hello");
  60 assert_eq!(new, B("hello ☃☃☃ hello hello quux hello"));
  61 # }
  62 ```
  63
  64 And here's an example that shows case conversion, even in the presence of
  65 invalid UTF-8:
  66
  67 ```
  68 # #[cfg(all(feature = "alloc", feature = "unicode"))] {
  69 use bstr::{ByteSlice, ByteVec};
  70
  71 let mut lower = Vec::from("hello β");
  72 lower[0] = b'\xFF';
  73 // lowercase β is uppercased to Β
  74 assert_eq!(lower.to_uppercase(), b"\xFFELLO \xCE\x92");
  75 # }
  76 ```
  77
  78 # Convenient debug representation
  79
  80 When working with byte strings, it is often useful to be able to print them
  81 as if they were byte strings and not sequences of integers. While this crate
  82 cannot affect the `std::fmt::Debug` implementations for `[u8]` and `Vec<u8>`,
  83 this crate does provide the `BStr` and `BString` types which have convenient
  84 `std::fmt::Debug` implementations.
  85
  86 For example, this
  87
  88 ```
  89 use bstr::ByteSlice;
  90
  91 let mut bytes = Vec::from("hello β");
  92 bytes[0] = b'\xFF';
  93
  94 println!("{:?}", bytes.as_bstr());
  95 ```
  96
  97 will output `"\xFFello β"`.
  98
  99 This example works because the
 100 [`ByteSlice::as_bstr`](trait.ByteSlice.html#method.as_bstr)
 101 method converts any `&[u8]` to a `&BStr`.
 102
 103 # When should I use byte strings?
 104
 105 This library reflects my belief that UTF-8 by convention is a better trade
 106 off in some circumstances than guaranteed UTF-8.
 107
 108 The first time this idea hit me was in the implementation of Rust's regex
 109 engine. In particular, very little of the internal implementation cares at all
 110 about searching valid UTF-8 encoded strings. Indeed, internally, the
 111 implementation converts `&str` from the API to `&[u8]` fairly quickly and
 112 just deals with raw bytes. UTF-8 match boundaries are then guaranteed by the
 113 finite state machine itself rather than any specific string type. This makes it
 114 possible to not only run regexes on `&str` values, but also on `&[u8]` values.
 115
 116 Why would you ever want to run a regex on a `&[u8]` though? Well, `&[u8]` is
 117 the fundamental way at which one reads data from all sorts of streams, via the
 118 standard library's [`Read`](https://doc.rust-lang.org/std/io/trait.Read.html)
 119 trait. In particular, there is no platform independent way to determine whether
 120 what you're reading from is some binary file or a human readable text file.
 121 Therefore, if you're writing a program to search files, you probably need to
 122 deal with `&[u8]` directly unless you're okay with first converting it to a
 123 `&str` and dropping any bytes that aren't valid UTF-8. (Or otherwise determine
 124 the encoding---which is often impractical---and perform a transcoding step.)
 125 Often, the simplest and most robust way to approach this is to simply treat the
 126 contents of a file as if it were mostly valid UTF-8 and pass through invalid
 127 UTF-8 untouched. This may not be the most correct approach though!
 128
 129 One case in particular exacerbates these issues, and that's memory mapping
 130 a file. When you memory map a file, that file may be gigabytes big, but all
 131 you get is a `&[u8]`. Converting that to a `&str` all in one go is generally
 132 not a good idea because of the costs associated with doing so, and also
 133 because it generally causes one to do two passes over the data instead of
 134 one, which is quite undesirable. It is of course usually possible to do it an
 135 incremental way by only parsing chunks at a time, but this is often complex to
 136 do or impractical. For example, many regex engines only accept one contiguous
 137 sequence of bytes at a time with no way to perform incremental matching.
 138
 139 # `bstr` in public APIs
 140
 141 This library is past version `1` and is expected to remain at version `1` for
 142 the foreseeable future. Therefore, it is encouraged to put types from `bstr`
 143 (like `BStr` and `BString`) in your public API if that makes sense for your
 144 crate.
 145
 146 With that said, in general, it should be possible to avoid putting anything
 147 in this crate into your public APIs. Namely, you should never need to use the
 148 `ByteSlice` or `ByteVec` traits as bounds on public APIs, since their only
 149 purpose is to extend the methods on the concrete types `[u8]` and `Vec<u8>`,
 150 respectively. Similarly, it should not be necessary to put either the `BStr` or
 151 `BString` types into public APIs. If you want to use them internally, then they
 152 can be converted to/from `[u8]`/`Vec<u8>` as needed. The conversions are free.
 153
 154 So while it shouldn't ever be 100% necessary to make `bstr` a public
 155 dependency, there may be cases where it is convenient to do so. This is an
 156 explicitly supported use case of `bstr`, and as such, major version releases
 157 should be exceptionally rare.
 158
 159
 160 # Differences with standard strings
 161
 162 The primary difference between `[u8]` and `str` is that the former is
 163 conventionally UTF-8 while the latter is guaranteed to be UTF-8. The phrase
 164 "conventionally UTF-8" means that a `[u8]` may contain bytes that do not form
 165 a valid UTF-8 sequence, but operations defined on the type in this crate are
 166 generally most useful on valid UTF-8 sequences. For example, iterating over
 167 Unicode codepoints or grapheme clusters is an operation that is only defined
 168 on valid UTF-8. Therefore, when invalid UTF-8 is encountered, the Unicode
 169 replacement codepoint is substituted. Thus, a byte string that is not UTF-8 at
 170 all is of limited utility when using these crate.
 171
 172 However, not all operations on byte strings are specifically Unicode aware. For
 173 example, substring search has no specific Unicode semantics ascribed to it. It
 174 works just as well for byte strings that are completely valid UTF-8 as for byte
 175 strings that contain no valid UTF-8 at all. Similarly for replacements and
 176 various other operations that do not need any Unicode specific tailoring.
 177
 178 Aside from the difference in how UTF-8 is handled, the APIs between `[u8]` and
 179 `str` (and `Vec<u8>` and `String`) are intentionally very similar, including
 180 maintaining the same behavior for corner cases in things like substring
 181 splitting. There are, however, some differences:
 182
 183 * Substring search is not done with `matches`, but instead, `find_iter`.
 184   In general, this crate does not define any generic
 185   [`Pattern`](https://doc.rust-lang.org/std/str/pattern/trait.Pattern.html)
 186   infrastructure, and instead prefers adding new methods for different
 187   argument types. For example, `matches` can search by a `char` or a `&str`,
 188   where as `find_iter` can only search by a byte string. `find_char` can be
 189   used for searching by a `char`.
 190 * Since `SliceConcatExt` in the standard library is unstable, it is not
 191   possible to reuse that to implement `join` and `concat` methods. Instead,
 192   [`join`](fn.join.html) and [`concat`](fn.concat.html) are provided as free
 193   functions that perform a similar task.
 194 * This library bundles in a few more Unicode operations, such as grapheme,
 195   word and sentence iterators. More operations, such as normalization and
 196   case folding, may be provided in the future.
 197 * Some `String`/`str` APIs will panic if a particular index was not on a valid
 198   UTF-8 code unit sequence boundary. Conversely, no such checking is performed
 199   in this crate, as is consistent with treating byte strings as a sequence of
 200   bytes. This means callers are responsible for maintaining a UTF-8 invariant
 201   if that's important.
 202 * Some routines provided by this crate, such as `starts_with_str`, have a
 203   `_str` suffix to differentiate them from similar routines already defined
 204   on the `[u8]` type. The difference is that `starts_with` requires its
 205   parameter to be a `&[u8]`, where as `starts_with_str` permits its parameter
 206   to by anything that implements `AsRef<[u8]>`, which is more flexible. This
 207   means you can write `bytes.starts_with_str("☃")` instead of
 208   `bytes.starts_with("☃".as_bytes())`.
 209
 210 Otherwise, you should find most of the APIs between this crate and the standard
 211 library string APIs to be very similar, if not identical.
 212
 213 # Handling of invalid UTF-8
 214
 215 Since byte strings are only *conventionally* UTF-8, there is no guarantee
 216 that byte strings contain valid UTF-8. Indeed, it is perfectly legal for a
 217 byte string to contain arbitrary bytes. However, since this library defines
 218 a *string* type, it provides many operations specified by Unicode. These
 219 operations are typically only defined over codepoints, and thus have no real
 220 meaning on bytes that are invalid UTF-8 because they do not map to a particular
 221 codepoint.
 222
 223 For this reason, whenever operations defined only on codepoints are used, this
 224 library will automatically convert invalid UTF-8 to the Unicode replacement
 225 codepoint, `U+FFFD`, which looks like this: `�`. For example, an
 226 [iterator over codepoints](struct.Chars.html) will yield a Unicode
 227 replacement codepoint whenever it comes across bytes that are not valid UTF-8:
 228
 229 ```
 230 use bstr::ByteSlice;
 231
 232 let bs = b"a\xFF\xFFz";
 233 let chars: Vec<char> = bs.chars().collect();
 234 assert_eq!(vec!['a', '\u{FFFD}', '\u{FFFD}', 'z'], chars);
 235 ```
 236
 237 There are a few ways in which invalid bytes can be substituted with a Unicode
 238 replacement codepoint. One way, not used by this crate, is to replace every
 239 individual invalid byte with a single replacement codepoint. In contrast, the
 240 approach this crate uses is called the "substitution of maximal subparts," as
 241 specified by the Unicode Standard (Chapter 3, Section 9). (This approach is
 242 also used by [W3C's Encoding Standard](https://www.w3.org/TR/encoding/).) In
 243 this strategy, a replacement codepoint is inserted whenever a byte is found
 244 that cannot possibly lead to a valid UTF-8 code unit sequence. If there were
 245 previous bytes that represented a *prefix* of a well-formed UTF-8 code unit
 246 sequence, then all of those bytes (up to 3) are substituted with a single
 247 replacement codepoint. For example:
 248
 249 ```
 250 use bstr::ByteSlice;
 251
 252 let bs = b"a\xF0\x9F\x87z";
 253 let chars: Vec<char> = bs.chars().collect();
 254 // The bytes \xF0\x9F\x87 could lead to a valid UTF-8 sequence, but 3 of them
 255 // on their own are invalid. Only one replacement codepoint is substituted,
 256 // which demonstrates the "substitution of maximal subparts" strategy.
 257 assert_eq!(vec!['a', '\u{FFFD}', 'z'], chars);
 258 ```
 259
 260 If you do need to access the raw bytes for some reason in an iterator like
 261 `Chars`, then you should use the iterator's "indices" variant, which gives
 262 the byte offsets containing the invalid UTF-8 bytes that were substituted with
 263 the replacement codepoint. For example:
 264
 265 ```
 266 use bstr::{B, ByteSlice};
 267
 268 let bs = b"a\xE2\x98z";
 269 let chars: Vec<(usize, usize, char)> = bs.char_indices().collect();
 270 // Even though the replacement codepoint is encoded as 3 bytes itself, the
 271 // byte range given here is only two bytes, corresponding to the original
 272 // raw bytes.
 273 assert_eq!(vec![(0, 1, 'a'), (1, 3, '\u{FFFD}'), (3, 4, 'z')], chars);
 274
 275 // Thus, getting the original raw bytes is as simple as slicing the original
 276 // byte string:
 277 let chars: Vec<&[u8]> = bs.char_indices().map(|(s, e, _)| &bs[s..e]).collect();
 278 assert_eq!(vec![B("a"), B(b"\xE2\x98"), B("z")], chars);
 279 ```
 280
 281 # File paths and OS strings
 282
 283 One of the premiere features of Rust's standard library is how it handles file
 284 paths. In particular, it makes it very hard to write incorrect code while
 285 simultaneously providing a correct cross platform abstraction for manipulating
 286 file paths. The key challenge that one faces with file paths across platforms
 287 is derived from the following observations:
 288
 289 * On most Unix-like systems, file paths are an arbitrary sequence of bytes.
 290 * On Windows, file paths are an arbitrary sequence of 16-bit integers.
 291
 292 (In both cases, certain sequences aren't allowed. For example a `NUL` byte is
 293 not allowed in either case. But we can ignore this for the purposes of this
 294 section.)
 295
 296 Byte strings, like the ones provided in this crate, line up really well with
 297 file paths on Unix like systems, which are themselves just arbitrary sequences
 298 of bytes. It turns out that if you treat them as "mostly UTF-8," then things
 299 work out pretty well. On the contrary, byte strings _don't_ really work
 300 that well on Windows because it's not possible to correctly roundtrip file
 301 paths between 16-bit integers and something that looks like UTF-8 _without_
 302 explicitly defining an encoding to do this for you, which is anathema to byte
 303 strings, which are just bytes.
 304
 305 Rust's standard library elegantly solves this problem by specifying an
 306 internal encoding for file paths that's only used on Windows called
 307 [WTF-8](https://simonsapin.github.io/wtf-8/). Its key properties are that they
 308 permit losslessly roundtripping file paths on Windows by extending UTF-8 to
 309 support an encoding of surrogate codepoints, while simultaneously supporting
 310 zero-cost conversion from Rust's Unicode strings to file paths. (Since UTF-8 is
 311 a proper subset of WTF-8.)
 312
 313 The fundamental point at which the above strategy fails is when you want to
 314 treat file paths as things that look like strings in a zero cost way. In most
 315 cases, this is actually the wrong thing to do, but some cases call for it,
 316 for example, glob or regex matching on file paths. This is because WTF-8 is
 317 treated as an internal implementation detail, and there is no way to access
 318 those bytes via a public API. Therefore, such consumers are limited in what
 319 they can do:
 320
 321 1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8
 322    by accessing their underlying 16-bit integer representation. Unfortunately,
 323    this isn't zero cost (it introduces a second WTF-8 decoding step) and it's
 324    not clear this is a good thing to do, since WTF-8 should ideally remain an
 325    internal implementation detail. This is roughly the approach taken by the
 326    [`os_str_bytes`](https://crates.io/crates/os_str_bytes) crate.
 327 2. One could instead declare that they will not handle paths on Windows that
 328    are not valid UTF-16, and return an error when one is encountered.
 329 3. Like (2), but instead of returning an error, lossily decode the file path
 330    on Windows that isn't valid UTF-16 into UTF-16 by replacing invalid bytes
 331    with the Unicode replacement codepoint.
 332
 333 While this library may provide facilities for (1) in the future, currently,
 334 this library only provides facilities for (2) and (3). In particular, a suite
 335 of conversion functions are provided that permit converting between byte
 336 strings, OS strings and file paths. For owned byte strings, they are:
 337
 338 * [`ByteVec::from_os_string`](trait.ByteVec.html#method.from_os_string)
 339 * [`ByteVec::from_os_str_lossy`](trait.ByteVec.html#method.from_os_str_lossy)
 340 * [`ByteVec::from_path_buf`](trait.ByteVec.html#method.from_path_buf)
 341 * [`ByteVec::from_path_lossy`](trait.ByteVec.html#method.from_path_lossy)
 342 * [`ByteVec::into_os_string`](trait.ByteVec.html#method.into_os_string)
 343 * [`ByteVec::into_os_string_lossy`](trait.ByteVec.html#method.into_os_string_lossy)
 344 * [`ByteVec::into_path_buf`](trait.ByteVec.html#method.into_path_buf)
 345 * [`ByteVec::into_path_buf_lossy`](trait.ByteVec.html#method.into_path_buf_lossy)
 346
 347 For byte string slices, they are:
 348
 349 * [`ByteSlice::from_os_str`](trait.ByteSlice.html#method.from_os_str)
 350 * [`ByteSlice::from_path`](trait.ByteSlice.html#method.from_path)
 351 * [`ByteSlice::to_os_str`](trait.ByteSlice.html#method.to_os_str)
 352 * [`ByteSlice::to_os_str_lossy`](trait.ByteSlice.html#method.to_os_str_lossy)
 353 * [`ByteSlice::to_path`](trait.ByteSlice.html#method.to_path)
 354 * [`ByteSlice::to_path_lossy`](trait.ByteSlice.html#method.to_path_lossy)
 355
 356 On Unix, all of these conversions are rigorously zero cost, which gives one
 357 a way to ergonomically deal with raw file paths exactly as they are using
 358 normal string-related functions. On Windows, these conversion routines perform
 359 a UTF-8 check and either return an error or lossily decode the file path
 360 into valid UTF-8, depending on which function you use. This means that you
 361 cannot roundtrip all file paths on Windows correctly using these conversion
 362 routines. However, this may be an acceptable downside since such file paths
 363 are exceptionally rare. Moreover, roundtripping isn't always necessary, for
 364 example, if all you're doing is filtering based on file paths.
 365
 366 The reason why using byte strings for this is potentially superior than the
 367 standard library's approach is that a lot of Rust code is already lossily
 368 converting file paths to Rust's Unicode strings, which are required to be valid
 369 UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are
 370 not terribly uncommon. If you instead use byte strings, then you're guaranteed
 371 to write correct code for Unix, at the cost of getting a corner case wrong on
 372 Windows.
 373
 374 # Cargo features
 375
 376 This crates comes with a few features that control standard library, serde
 377 and Unicode support.
 378
 379 * `std` - **Enabled** by default. This provides APIs that require the standard
 380   library, such as `Vec<u8>` and `PathBuf`. Enabling this feature also enables
 381   the `alloc` feature and any other relevant `std` features for dependencies.
 382 * `alloc` - **Enabled** by default. This provides APIs that require allocations
 383   via the `alloc` crate, such as `Vec<u8>`.
 384 * `unicode` - **Enabled** by default. This provides APIs that require sizable
 385   Unicode data compiled into the binary. This includes, but is not limited to,
 386   grapheme/word/sentence segmenters. When this is disabled, basic support such
 387   as UTF-8 decoding is still included. Note that currently, enabling this
 388   feature also requires enabling the `std` feature. It is expected that this
 389   limitation will be lifted at some point.
 390 * `serde` - Enables implementations of serde traits for `BStr`, and also
 391   `BString` when `alloc` is enabled.
 392 */
 393
 394 #![cfg_attr(not(any(feature = "std", test)), no_std)]
 395 #![cfg_attr(docsrs, feature(doc_auto_cfg))]
 396
 397 // Why do we do this? Well, in order for us to use once_cell's 'Lazy' type to
 398 // load DFAs, it requires enabling its 'std' feature. Yet, there is really
 399 // nothing about our 'unicode' feature that requires 'std'. We could declare
 400 // that 'unicode = [std, ...]', which would be fine, but once regex-automata
 401 // 0.3 is a thing, I believe we can drop once_cell altogether and thus drop
 402 // the need for 'std' to be enabled when 'unicode' is enabled. But if we make
 403 // 'unicode' also enable 'std', then it would be a breaking change to remove
 404 // 'std' from that list.
 405 //
 406 // So, for right now, we force folks to explicitly say they want 'std' if they
 407 // want 'unicode'. In the future, we should be able to relax this.
 408 #[cfg(all(feature = "unicode", not(feature = "std")))]
 409 compile_error!("enabling 'unicode' requires enabling 'std'");
 410
 411 #[cfg(feature = "alloc")]
 412 extern crate alloc;
 413
 414 pub use crate::bstr::BStr;
 415 #[cfg(feature = "alloc")]
 416 pub use crate::bstring::BString;
 417 #[cfg(feature = "unicode")]
 418 pub use crate::ext_slice::Fields;
 419 pub use crate::ext_slice::{
 420     ByteSlice, Bytes, FieldsWith, Find, FindReverse, Finder, FinderReverse,
 421     Lines, LinesWithTerminator, Split, SplitN, SplitNReverse, SplitReverse, B,
 422 };
 423 #[cfg(feature = "alloc")]
 424 pub use crate::ext_vec::{concat, join, ByteVec, DrainBytes, FromUtf8Error};
 425 #[cfg(feature = "unicode")]
 426 pub use crate::unicode::{
 427     GraphemeIndices, Graphemes, SentenceIndices, Sentences, WordIndices,
 428     Words, WordsWithBreakIndices, WordsWithBreaks,
 429 };
 430 pub use crate::utf8::{
 431     decode as decode_utf8, decode_last as decode_last_utf8, CharIndices,
 432     Chars, Utf8Chunk, Utf8Chunks, Utf8Error,
 433 };
 434
 435 mod ascii;
 436 mod bstr;
 437 #[cfg(feature = "alloc")]
 438 mod bstring;
 439 mod byteset;
 440 mod ext_slice;
 441 #[cfg(feature = "alloc")]
 442 mod ext_vec;
 443 mod impls;
 444 #[cfg(feature = "std")]
 445 pub mod io;
 446 #[cfg(all(test, feature = "std"))]
 447 mod tests;
 448 #[cfg(feature = "unicode")]
 449 mod unicode;
 450 mod utf8;
 451
 452 #[cfg(all(test, feature = "std"))]
 453 mod apitests {
 454     use crate::{
 455         bstr::BStr,
 456         bstring::BString,
 457         ext_slice::{Finder, FinderReverse},
 458     };
 459
 460     #[test]
 461     fn oibits() {
 462         use std::panic::{RefUnwindSafe, UnwindSafe};
 463
 464         fn assert_send<T: Send>() {}
 465         fn assert_sync<T: Sync>() {}
 466         fn assert_unwind_safe<T: RefUnwindSafe + UnwindSafe>() {}
 467
 468         assert_send::<&BStr>();
 469         assert_sync::<&BStr>();
 470         assert_unwind_safe::<&BStr>();
 471         assert_send::<BString>();
 472         assert_sync::<BString>();
 473         assert_unwind_safe::<BString>();
 474
 475         assert_send::<Finder<'_>>();
 476         assert_sync::<Finder<'_>>();
 477         assert_unwind_safe::<Finder<'_>>();
 478         assert_send::<FinderReverse<'_>>();
 479         assert_sync::<FinderReverse<'_>>();
 480         assert_unwind_safe::<FinderReverse<'_>>();
 481     }
 482 }