]>
Commit | Line | Data |
---|---|---|
dfeec247 XL |
1 | bstr |
2 | ==== | |
f035d41b XL |
3 | This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable |
4 | their use as byte strings, where byte strings are _conventionally_ UTF-8. This | |
5 | differs from the standard library's `String` and `str` types in that they are | |
6 | not required to be valid UTF-8, but may be fully or partially valid UTF-8. | |
dfeec247 | 7 | |
f035d41b | 8 | [![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions) |
dfeec247 XL |
9 | [![](http://meritbadge.herokuapp.com/bstr)](https://crates.io/crates/bstr) |
10 | ||
11 | ||
12 | ### Documentation | |
13 | ||
14 | https://docs.rs/bstr | |
15 | ||
16 | ||
17 | ### When should I use byte strings? | |
18 | ||
19 | See this part of the documentation for more details: | |
f035d41b | 20 | https://docs.rs/bstr/0.2.0/bstr/#when-should-i-use-byte-strings. |
dfeec247 XL |
21 | |
22 | The short story is that byte strings are useful when it is inconvenient or | |
23 | incorrect to require valid UTF-8. | |
24 | ||
25 | ||
26 | ### Usage | |
27 | ||
28 | Add this to your `Cargo.toml`: | |
29 | ||
30 | ```toml | |
31 | [dependencies] | |
f035d41b | 32 | bstr = "0.2" |
dfeec247 XL |
33 | ``` |
34 | ||
35 | ||
36 | ### Examples | |
37 | ||
38 | The following two examples exhibit both the API features of byte strings and | |
39 | the I/O convenience functions provided for reading line-by-line quickly. | |
40 | ||
41 | This first example simply shows how to efficiently iterate over lines in | |
42 | stdin, and print out lines containing a particular substring: | |
43 | ||
44 | ```rust | |
45 | use std::error::Error; | |
46 | use std::io::{self, Write}; | |
47 | ||
f035d41b | 48 | use bstr::{ByteSlice, io::BufReadExt}; |
dfeec247 | 49 | |
f035d41b | 50 | fn main() -> Result<(), Box<dyn Error>> { |
dfeec247 XL |
51 | let stdin = io::stdin(); |
52 | let mut stdout = io::BufWriter::new(io::stdout()); | |
53 | ||
54 | stdin.lock().for_byte_line_with_terminator(|line| { | |
f035d41b XL |
55 | if line.contains_str("Dimension") { |
56 | stdout.write_all(line)?; | |
dfeec247 XL |
57 | } |
58 | Ok(true) | |
59 | })?; | |
60 | Ok(()) | |
61 | } | |
62 | ``` | |
63 | ||
64 | This example shows how to count all of the words (Unicode-aware) in stdin, | |
65 | line-by-line: | |
66 | ||
67 | ```rust | |
68 | use std::error::Error; | |
69 | use std::io; | |
70 | ||
f035d41b | 71 | use bstr::{ByteSlice, io::BufReadExt}; |
dfeec247 | 72 | |
f035d41b | 73 | fn main() -> Result<(), Box<dyn Error>> { |
dfeec247 XL |
74 | let stdin = io::stdin(); |
75 | let mut words = 0; | |
76 | stdin.lock().for_byte_line_with_terminator(|line| { | |
77 | words += line.words().count(); | |
78 | Ok(true) | |
79 | })?; | |
80 | println!("{}", words); | |
81 | Ok(()) | |
82 | } | |
83 | ``` | |
84 | ||
85 | This example shows how to convert a stream on stdin to uppercase without | |
86 | performing UTF-8 validation _and_ amortizing allocation. On standard ASCII | |
87 | text, this is quite a bit faster than what you can (easily) do with standard | |
88 | library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.) | |
89 | ||
90 | ```rust | |
91 | use std::error::Error; | |
92 | use std::io::{self, Write}; | |
93 | ||
f035d41b | 94 | use bstr::{ByteSlice, io::BufReadExt}; |
dfeec247 | 95 | |
f035d41b | 96 | fn main() -> Result<(), Box<dyn Error>> { |
dfeec247 XL |
97 | let stdin = io::stdin(); |
98 | let mut stdout = io::BufWriter::new(io::stdout()); | |
99 | ||
f035d41b | 100 | let mut upper = vec![]; |
dfeec247 XL |
101 | stdin.lock().for_byte_line_with_terminator(|line| { |
102 | upper.clear(); | |
103 | line.to_uppercase_into(&mut upper); | |
f035d41b | 104 | stdout.write_all(&upper)?; |
dfeec247 XL |
105 | Ok(true) |
106 | })?; | |
107 | Ok(()) | |
108 | } | |
109 | ``` | |
110 | ||
111 | This example shows how to extract the first 10 visual characters (as grapheme | |
112 | clusters) from each line, where invalid UTF-8 sequences are generally treated | |
113 | as a single character and are passed through correctly: | |
114 | ||
115 | ```rust | |
116 | use std::error::Error; | |
117 | use std::io::{self, Write}; | |
118 | ||
f035d41b | 119 | use bstr::{ByteSlice, io::BufReadExt}; |
dfeec247 | 120 | |
f035d41b | 121 | fn main() -> Result<(), Box<dyn Error>> { |
dfeec247 XL |
122 | let stdin = io::stdin(); |
123 | let mut stdout = io::BufWriter::new(io::stdout()); | |
124 | ||
125 | stdin.lock().for_byte_line_with_terminator(|line| { | |
126 | let end = line | |
127 | .grapheme_indices() | |
128 | .map(|(_, end, _)| end) | |
129 | .take(10) | |
130 | .last() | |
131 | .unwrap_or(line.len()); | |
f035d41b | 132 | stdout.write_all(line[..end].trim_end())?; |
dfeec247 XL |
133 | stdout.write_all(b"\n")?; |
134 | Ok(true) | |
135 | })?; | |
136 | Ok(()) | |
137 | } | |
138 | ``` | |
139 | ||
140 | ||
141 | ### Cargo features | |
142 | ||
143 | This crates comes with a few features that control standard library, serde | |
144 | and Unicode support. | |
145 | ||
146 | * `std` - **Enabled** by default. This provides APIs that require the standard | |
f035d41b | 147 | library, such as `Vec<u8>`. |
dfeec247 XL |
148 | * `unicode` - **Enabled** by default. This provides APIs that require sizable |
149 | Unicode data compiled into the binary. This includes, but is not limited to, | |
150 | grapheme/word/sentence segmenters. When this is disabled, basic support such | |
151 | as UTF-8 decoding is still included. | |
152 | * `serde1` - **Disabled** by default. Enables implementations of serde traits | |
153 | for the `BStr` and `BString` types. | |
154 | * `serde1-nostd` - **Disabled** by default. Enables implementations of serde | |
155 | traits for the `BStr` type only, intended for use without the standard | |
156 | library. Generally, you either want `serde1` or `serde1-nostd`, not both. | |
157 | ||
158 | ||
159 | ### Minimum Rust version policy | |
160 | ||
161 | This crate's minimum supported `rustc` version (MSRV) is `1.28.0`. | |
162 | ||
163 | In general, this crate will be conservative with respect to the minimum | |
164 | supported version of Rust. MSRV may be bumped in minor version releases. | |
165 | ||
166 | ||
167 | ### Future work | |
168 | ||
169 | Since this is meant to be a core crate, getting a `1.0` release is a priority. | |
170 | My hope is to move to `1.0` within the next year and commit to its API so that | |
171 | `bstr` can be used as a public dependency. | |
172 | ||
173 | A large part of the API surface area was taken from the standard library, so | |
174 | from an API design perspective, a good portion of this crate should be mature. | |
175 | The main differences from the standard library are in how the various substring | |
176 | search routines work. The standard library provides generic infrastructure for | |
177 | supporting different types of searches with a single method, where as this | |
178 | library prefers to define new methods for each type of search and drop the | |
179 | generic infrastructure. | |
180 | ||
181 | Some _probable_ future considerations for APIs include, but are not limited to: | |
182 | ||
183 | * A convenience layer on top of the `aho-corasick` crate. | |
184 | * Unicode normalization. | |
185 | * More sophisticated support for dealing with Unicode case, perhaps by | |
186 | combining the use cases supported by [`caseless`](http://docs.rs/caseless) | |
187 | and [`unicase`](https://docs.rs/unicase). | |
188 | * Add facilities for dealing with OS strings and file paths, probably via | |
189 | simple conversion routines. | |
190 | ||
191 | Here are some examples that are _probably_ out of scope for this crate: | |
192 | ||
193 | * Regular expressions. | |
194 | * Unicode collation. | |
195 | ||
196 | The exact scope isn't quite clear, but I expect we can iterate on it. | |
197 | ||
198 | In general, as stated below, this crate is an experiment in bringing lots of | |
199 | related APIs together into a single crate while simultaneously attempting to | |
200 | keep the total number of dependencies low. Indeed, every dependency of `bstr`, | |
201 | except for `memchr`, is optional. | |
202 | ||
203 | ||
204 | ### High level motivation | |
205 | ||
206 | Strictly speaking, the `bstr` crate provides very little that can't already be | |
f035d41b XL |
207 | achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of |
208 | library crates. For example: | |
dfeec247 XL |
209 | |
210 | * The standard library's | |
211 | [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html) | |
212 | can be used for incremental lossy decoding of `&[u8]`. | |
213 | * The | |
214 | [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html) | |
215 | crate can be used for iterating over graphemes (or words), but is only | |
216 | implemented for `&str` types. One could use `Utf8Error` above to implement | |
217 | grapheme iteration with the same semantics as what `bstr` provides (automatic | |
218 | Unicode replacement codepoint substitution). | |
219 | * The [`twoway`](https://docs.rs/twoway/0.2.0/twoway/) crate can be used for | |
220 | fast substring searching on `&[u8]`. | |
221 | ||
222 | So why create `bstr`? Part of the point of the `bstr` crate is to provide a | |
223 | uniform API of coupled components instead of relying on users to piece together | |
224 | loosely coupled components from the crate ecosystem. For example, if you wanted | |
225 | to perform a search and replace in a `Vec<u8>`, then writing the code to do | |
226 | that with the `twoway` crate is not that difficult, but it's still additional | |
227 | glue code you have to write. This work adds up depending on what you're doing. | |
228 | Consider, for example, trimming and splitting, along with their different | |
229 | variants. | |
230 | ||
231 | In other words, `bstr` is partially a way of pushing back against the | |
232 | micro-crate ecosystem that appears to be evolving. It's not clear to me whether | |
233 | this experiment will be successful or not, but it is definitely a goal of | |
234 | `bstr` to keep its dependency list lightweight. For example, `serde` is an | |
235 | optional dependency because there is no feasible alternative, but `twoway` is | |
236 | not, where we instead prefer to implement our own substring search. In service | |
237 | of this philosophy, currently, the only required dependency of `bstr` is | |
238 | `memchr`. | |
239 | ||
240 | ||
241 | ### License | |
242 | ||
243 | This project is licensed under either of | |
244 | ||
245 | * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or | |
246 | http://www.apache.org/licenses/LICENSE-2.0) | |
247 | * MIT license ([LICENSE-MIT](LICENSE-MIT) or | |
248 | http://opensource.org/licenses/MIT) | |
249 | ||
250 | at your option. | |
f035d41b XL |
251 | |
252 | The data in `src/unicode/data/` is licensed under the Unicode License Agreement | |
253 | ([LICENSE-UNICODE](http://www.unicode.org/copyright.html#License)), although | |
254 | this data is only used in tests. |