[rustc.git] / vendor / bstr / README.md

bstr
====
This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable
their use as byte strings, where byte strings are _conventionally_ UTF-8. This
differs from the standard library's `String` and `str` types in that they are
not required to be valid UTF-8, but may be fully or partially valid UTF-8.

[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions)
[![](http://meritbadge.herokuapp.com/bstr)](https://crates.io/crates/bstr)


### Documentation

https://docs.rs/bstr


### When should I use byte strings?

See this part of the documentation for more details:
https://docs.rs/bstr/0.2.0/bstr/#when-should-i-use-byte-strings.

The short story is that byte strings are useful when it is inconvenient or
incorrect to require valid UTF-8.


### Usage

Add this to your `Cargo.toml`:

```toml
[dependencies]
bstr = "0.2"
```


### Examples

The following two examples exhibit both the API features of byte strings and
the I/O convenience functions provided for reading line-by-line quickly.

This first example simply shows how to efficiently iterate over lines in
stdin, and print out lines containing a particular substring:

```rust
use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        if line.contains_str("Dimension") {
            stdout.write_all(line)?;
        }
        Ok(true)
    })?;
    Ok(())
}
```

This example shows how to count all of the words (Unicode-aware) in stdin,
line-by-line:

```rust
use std::error::Error;
use std::io;

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut words = 0;
    stdin.lock().for_byte_line_with_terminator(|line| {
        words += line.words().count();
        Ok(true)
    })?;
    println!("{}", words);
    Ok(())
}
```

This example shows how to convert a stream on stdin to uppercase without
performing UTF-8 validation _and_ amortizing allocation. On standard ASCII
text, this is quite a bit faster than what you can (easily) do with standard
library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)

```rust
use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    let mut upper = vec![];
    stdin.lock().for_byte_line_with_terminator(|line| {
        upper.clear();
        line.to_uppercase_into(&mut upper);
        stdout.write_all(&upper)?;
        Ok(true)
    })?;
    Ok(())
}
```

This example shows how to extract the first 10 visual characters (as grapheme
clusters) from each line, where invalid UTF-8 sequences are generally treated
as a single character and are passed through correctly:

```rust
use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        let end = line
            .grapheme_indices()
            .map(|(_, end, _)| end)
            .take(10)
            .last()
            .unwrap_or(line.len());
        stdout.write_all(line[..end].trim_end())?;
        stdout.write_all(b"\n")?;
        Ok(true)
    })?;
    Ok(())
}
```


### Cargo features

This crates comes with a few features that control standard library, serde
and Unicode support.

* `std` - **Enabled** by default. This provides APIs that require the standard
  library, such as `Vec<u8>`.
* `unicode` - **Enabled** by default. This provides APIs that require sizable
  Unicode data compiled into the binary. This includes, but is not limited to,
  grapheme/word/sentence segmenters. When this is disabled, basic support such
  as UTF-8 decoding is still included.
* `serde1` - **Disabled** by default. Enables implementations of serde traits
  for the `BStr` and `BString` types.
* `serde1-nostd` - **Disabled** by default. Enables implementations of serde
  traits for the `BStr` type only, intended for use without the standard
  library. Generally, you either want `serde1` or `serde1-nostd`, not both.


### Minimum Rust version policy

This crate's minimum supported `rustc` version (MSRV) is `1.28.0`.

In general, this crate will be conservative with respect to the minimum
supported version of Rust. MSRV may be bumped in minor version releases.


### Future work

Since this is meant to be a core crate, getting a `1.0` release is a priority.
My hope is to move to `1.0` within the next year and commit to its API so that
`bstr` can be used as a public dependency.

A large part of the API surface area was taken from the standard library, so
from an API design perspective, a good portion of this crate should be mature.
The main differences from the standard library are in how the various substring
search routines work. The standard library provides generic infrastructure for
supporting different types of searches with a single method, where as this
library prefers to define new methods for each type of search and drop the
generic infrastructure.

Some _probable_ future considerations for APIs include, but are not limited to:

* A convenience layer on top of the `aho-corasick` crate.
* Unicode normalization.
* More sophisticated support for dealing with Unicode case, perhaps by
  combining the use cases supported by [`caseless`](http://docs.rs/caseless)
  and [`unicase`](https://docs.rs/unicase).
* Add facilities for dealing with OS strings and file paths, probably via
  simple conversion routines.

Here are some examples that are _probably_ out of scope for this crate:

* Regular expressions.
* Unicode collation.

The exact scope isn't quite clear, but I expect we can iterate on it.

In general, as stated below, this crate is an experiment in bringing lots of
related APIs together into a single crate while simultaneously attempting to
keep the total number of dependencies low. Indeed, every dependency of `bstr`,
except for `memchr`, is optional.


### High level motivation

Strictly speaking, the `bstr` crate provides very little that can't already be
achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of
library crates. For example:

* The standard library's
  [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html)
  can be used for incremental lossy decoding of `&[u8]`.
* The
  [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html)
  crate can be used for iterating over graphemes (or words), but is only
  implemented for `&str` types. One could use `Utf8Error` above to implement
  grapheme iteration with the same semantics as what `bstr` provides (automatic
  Unicode replacement codepoint substitution).
* The [`twoway`](https://docs.rs/twoway/0.2.0/twoway/) crate can be used for
  fast substring searching on `&[u8]`.

So why create `bstr`? Part of the point of the `bstr` crate is to provide a
uniform API of coupled components instead of relying on users to piece together
loosely coupled components from the crate ecosystem. For example, if you wanted
to perform a search and replace in a `Vec<u8>`, then writing the code to do
that with the `twoway` crate is not that difficult, but it's still additional
glue code you have to write. This work adds up depending on what you're doing.
Consider, for example, trimming and splitting, along with their different
variants.

In other words, `bstr` is partially a way of pushing back against the
micro-crate ecosystem that appears to be evolving. It's not clear to me whether
this experiment will be successful or not, but it is definitely a goal of
`bstr` to keep its dependency list lightweight. For example, `serde` is an
optional dependency because there is no feasible alternative, but `twoway` is
not, where we instead prefer to implement our own substring search. In service
of this philosophy, currently, the only required dependency of `bstr` is
`memchr`.


### License

This project is licensed under either of

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
   http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
   http://opensource.org/licenses/MIT)

at your option.

The data in `src/unicode/data/` is licensed under the Unicode License Agreement
([LICENSE-UNICODE](http://www.unicode.org/copyright.html#License)), although
this data is only used in tests.
Commit	Line	Data
dfeec247 XL	1	bstr
dfeec247 XL	2	====
f035d41b XL	3	This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable
	4	their use as byte strings, where byte strings are _conventionally_ UTF-8. This
	5	differs from the standard library's `String` and `str` types in that they are
	6	not required to be valid UTF-8, but may be fully or partially valid UTF-8.
dfeec247	7
f035d41b	8	[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions)
dfeec247 XL	9	[![](http://meritbadge.herokuapp.com/bstr)](https://crates.io/crates/bstr)
	10
	11
	12	### Documentation
	13
	14	https://docs.rs/bstr
	15
	16
	17	### When should I use byte strings?
	18
	19	See this part of the documentation for more details:
f035d41b	20	https://docs.rs/bstr/0.2.0/bstr/#when-should-i-use-byte-strings.
dfeec247 XL	21
	22	The short story is that byte strings are useful when it is inconvenient or
	23	incorrect to require valid UTF-8.
	24
	25
	26	### Usage
	27
	28	Add this to your `Cargo.toml`:
	29
	30	```toml
	31	[dependencies]
f035d41b	32	bstr = "0.2"
dfeec247 XL	33	```
	34
	35
	36	### Examples
	37
	38	The following two examples exhibit both the API features of byte strings and
	39	the I/O convenience functions provided for reading line-by-line quickly.
	40
	41	This first example simply shows how to efficiently iterate over lines in
	42	stdin, and print out lines containing a particular substring:
	43
	44	```rust
	45	use std::error::Error;
	46	use std::io::{self, Write};
	47
f035d41b	48	use bstr::{ByteSlice, io::BufReadExt};
dfeec247	49
f035d41b	50	fn main() -> Result<(), Box<dyn Error>> {
dfeec247 XL	51	let stdin = io::stdin();
	52	let mut stdout = io::BufWriter::new(io::stdout());
	53
	54	stdin.lock().for_byte_line_with_terminator(\|line\| {
f035d41b XL	55	if line.contains_str("Dimension") {
f035d41b XL	56	stdout.write_all(line)?;
dfeec247 XL	57	}
	58	Ok(true)
	59	})?;
	60	Ok(())
	61	}
	62	```
	63
	64	This example shows how to count all of the words (Unicode-aware) in stdin,
	65	line-by-line:
	66
	67	```rust
	68	use std::error::Error;
	69	use std::io;
	70
f035d41b	71	use bstr::{ByteSlice, io::BufReadExt};
dfeec247	72
f035d41b	73	fn main() -> Result<(), Box<dyn Error>> {
dfeec247 XL	74	let stdin = io::stdin();
	75	let mut words = 0;
	76	stdin.lock().for_byte_line_with_terminator(\|line\| {
	77	words += line.words().count();
	78	Ok(true)
	79	})?;
	80	println!("{}", words);
	81	Ok(())
	82	}
	83	```
	84
	85	This example shows how to convert a stream on stdin to uppercase without
	86	performing UTF-8 validation _and_ amortizing allocation. On standard ASCII
	87	text, this is quite a bit faster than what you can (easily) do with standard
	88	library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)
	89
	90	```rust
	91	use std::error::Error;
	92	use std::io::{self, Write};
	93
f035d41b	94	use bstr::{ByteSlice, io::BufReadExt};
dfeec247	95
f035d41b	96	fn main() -> Result<(), Box<dyn Error>> {
dfeec247 XL	97	let stdin = io::stdin();
	98	let mut stdout = io::BufWriter::new(io::stdout());
	99
f035d41b	100	let mut upper = vec![];
dfeec247 XL	101	stdin.lock().for_byte_line_with_terminator(\|line\| {
	102	upper.clear();
	103	line.to_uppercase_into(&mut upper);
f035d41b	104	stdout.write_all(&upper)?;
dfeec247 XL	105	Ok(true)
	106	})?;
	107	Ok(())
	108	}
	109	```
	110
	111	This example shows how to extract the first 10 visual characters (as grapheme
	112	clusters) from each line, where invalid UTF-8 sequences are generally treated
	113	as a single character and are passed through correctly:
	114
	115	```rust
	116	use std::error::Error;
	117	use std::io::{self, Write};
	118
f035d41b	119	use bstr::{ByteSlice, io::BufReadExt};
dfeec247	120
f035d41b	121	fn main() -> Result<(), Box<dyn Error>> {
dfeec247 XL	122	let stdin = io::stdin();
	123	let mut stdout = io::BufWriter::new(io::stdout());
	124
	125	stdin.lock().for_byte_line_with_terminator(\|line\| {
	126	let end = line
	127	.grapheme_indices()
	128	.map(\|(_, end, _)\| end)
	129	.take(10)
	130	.last()
	131	.unwrap_or(line.len());
f035d41b	132	stdout.write_all(line[..end].trim_end())?;
dfeec247 XL	133	stdout.write_all(b"\n")?;
	134	Ok(true)
	135	})?;
	136	Ok(())
	137	}
	138	```
	139
	140
	141	### Cargo features
	142
	143	This crates comes with a few features that control standard library, serde
	144	and Unicode support.
	145
	146	* `std` - Enabled by default. This provides APIs that require the standard
f035d41b	147	library, such as `Vec<u8>`.
dfeec247 XL	148	* `unicode` - Enabled by default. This provides APIs that require sizable
	149	Unicode data compiled into the binary. This includes, but is not limited to,
	150	grapheme/word/sentence segmenters. When this is disabled, basic support such
	151	as UTF-8 decoding is still included.
	152	* `serde1` - Disabled by default. Enables implementations of serde traits
	153	for the `BStr` and `BString` types.
	154	* `serde1-nostd` - Disabled by default. Enables implementations of serde
	155	traits for the `BStr` type only, intended for use without the standard
	156	library. Generally, you either want `serde1` or `serde1-nostd`, not both.
	157
	158
	159	### Minimum Rust version policy
	160
	161	This crate's minimum supported `rustc` version (MSRV) is `1.28.0`.
	162
	163	In general, this crate will be conservative with respect to the minimum
	164	supported version of Rust. MSRV may be bumped in minor version releases.
	165
	166
	167	### Future work
	168
	169	Since this is meant to be a core crate, getting a `1.0` release is a priority.
	170	My hope is to move to `1.0` within the next year and commit to its API so that
	171	`bstr` can be used as a public dependency.
	172
	173	A large part of the API surface area was taken from the standard library, so
	174	from an API design perspective, a good portion of this crate should be mature.
	175	The main differences from the standard library are in how the various substring
	176	search routines work. The standard library provides generic infrastructure for
	177	supporting different types of searches with a single method, where as this
	178	library prefers to define new methods for each type of search and drop the
	179	generic infrastructure.
	180
	181	Some _probable_ future considerations for APIs include, but are not limited to:
	182
	183	* A convenience layer on top of the `aho-corasick` crate.
	184	* Unicode normalization.
	185	* More sophisticated support for dealing with Unicode case, perhaps by
	186	combining the use cases supported by [`caseless`](http://docs.rs/caseless)
	187	and [`unicase`](https://docs.rs/unicase).
	188	* Add facilities for dealing with OS strings and file paths, probably via
	189	simple conversion routines.
	190
	191	Here are some examples that are _probably_ out of scope for this crate:
	192
	193	* Regular expressions.
	194	* Unicode collation.
	195
	196	The exact scope isn't quite clear, but I expect we can iterate on it.
	197
	198	In general, as stated below, this crate is an experiment in bringing lots of
	199	related APIs together into a single crate while simultaneously attempting to
	200	keep the total number of dependencies low. Indeed, every dependency of `bstr`,
	201	except for `memchr`, is optional.
	202
	203
	204	### High level motivation
	205
	206	Strictly speaking, the `bstr` crate provides very little that can't already be
f035d41b XL	207	achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of
f035d41b XL	208	library crates. For example:
dfeec247 XL	209
	210	* The standard library's
	211	[`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html)
	212	can be used for incremental lossy decoding of `&[u8]`.
	213	* The
	214	[`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html)
	215	crate can be used for iterating over graphemes (or words), but is only
	216	implemented for `&str` types. One could use `Utf8Error` above to implement
	217	grapheme iteration with the same semantics as what `bstr` provides (automatic
	218	Unicode replacement codepoint substitution).
	219	* The [`twoway`](https://docs.rs/twoway/0.2.0/twoway/) crate can be used for
	220	fast substring searching on `&[u8]`.
	221
	222	So why create `bstr`? Part of the point of the `bstr` crate is to provide a
	223	uniform API of coupled components instead of relying on users to piece together
	224	loosely coupled components from the crate ecosystem. For example, if you wanted
	225	to perform a search and replace in a `Vec<u8>`, then writing the code to do
	226	that with the `twoway` crate is not that difficult, but it's still additional
	227	glue code you have to write. This work adds up depending on what you're doing.
	228	Consider, for example, trimming and splitting, along with their different
	229	variants.
	230
	231	In other words, `bstr` is partially a way of pushing back against the
	232	micro-crate ecosystem that appears to be evolving. It's not clear to me whether
	233	this experiment will be successful or not, but it is definitely a goal of
	234	`bstr` to keep its dependency list lightweight. For example, `serde` is an
	235	optional dependency because there is no feasible alternative, but `twoway` is
	236	not, where we instead prefer to implement our own substring search. In service
	237	of this philosophy, currently, the only required dependency of `bstr` is
	238	`memchr`.
	239
	240
	241	### License
	242
	243	This project is licensed under either of
	244
	245	* Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
	246	http://www.apache.org/licenses/LICENSE-2.0)
	247	* MIT license ([LICENSE-MIT](LICENSE-MIT) or
	248	http://opensource.org/licenses/MIT)
	249
	250	at your option.
f035d41b XL	251
	252	The data in `src/unicode/data/` is licensed under the Unicode License Agreement
	253	([LICENSE-UNICODE](http://www.unicode.org/copyright.html#License)), although
	254	this data is only used in tests.