1 ## Storing UTF-8 Encoded Text with Strings
3 We talked about strings in Chapter 4, but we’ll look at them in more depth now.
4 New Rustaceans commonly get stuck on strings for a combination of three
5 reasons: Rust’s propensity for exposing possible errors, strings being a more
6 complicated data structure than many programmers give them credit for, and
7 UTF-8. These factors combine in a way that can seem difficult when you’re
8 coming from other programming languages.
10 We discuss strings in the context of collections because strings are
11 implemented as a collection of bytes, plus some methods to provide useful
12 functionality when those bytes are interpreted as text. In this section, we’ll
13 talk about the operations on `String` that every collection type has, such as
14 creating, updating, and reading. We’ll also discuss the ways in which `String`
15 is different from the other collections, namely how indexing into a `String` is
16 complicated by the differences between how people and computers interpret
21 We’ll first define what we mean by the term *string*. Rust has only one string
22 type in the core language, which is the string slice `str` that is usually seen
23 in its borrowed form `&str`. In Chapter 4, we talked about *string slices*,
24 which are references to some UTF-8 encoded string data stored elsewhere. String
25 literals, for example, are stored in the program’s binary and are therefore
28 The `String` type, which is provided by Rust’s standard library rather than
29 coded into the core language, is a growable, mutable, owned, UTF-8 encoded
30 string type. When Rustaceans refer to “strings” in Rust, they might be
31 referring to either the `String` or the string slice `&str` types, not just one
32 of those types. Although this section is largely about `String`, both types are
33 used heavily in Rust’s standard library, and both `String` and string slices
36 ### Creating a New String
38 Many of the same operations available with `Vec<T>` are available with `String`
39 as well, because `String` is actually implemented as a wrapper around a vector
40 of bytes with some extra guarantees, restrictions, and capabilities. An example
41 of a function that works the same way with `Vec<T>` and `String` is the `new`
42 function to create an instance, shown in Listing 8-11.
45 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-11/src/main.rs:here}}
48 <span class="caption">Listing 8-11: Creating a new, empty `String`</span>
50 This line creates a new empty string called `s`, which we can then load data
51 into. Often, we’ll have some initial data that we want to start the string
52 with. For that, we use the `to_string` method, which is available on any type
53 that implements the `Display` trait, as string literals do. Listing 8-12 shows
57 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-12/src/main.rs:here}}
60 <span class="caption">Listing 8-12: Using the `to_string` method to create a
61 `String` from a string literal</span>
63 This code creates a string containing `initial contents`.
65 We can also use the function `String::from` to create a `String` from a string
66 literal. The code in Listing 8-13 is equivalent to the code from Listing 8-12
67 that uses `to_string`.
70 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-13/src/main.rs:here}}
73 <span class="caption">Listing 8-13: Using the `String::from` function to create
74 a `String` from a string literal</span>
76 Because strings are used for so many things, we can use many different generic
77 APIs for strings, providing us with a lot of options. Some of them can seem
78 redundant, but they all have their place! In this case, `String::from` and
79 `to_string` do the same thing, so which you choose is a matter of style and
82 Remember that strings are UTF-8 encoded, so we can include any properly encoded
83 data in them, as shown in Listing 8-14.
86 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:here}}
89 <span class="caption">Listing 8-14: Storing greetings in different languages in
92 All of these are valid `String` values.
96 A `String` can grow in size and its contents can change, just like the contents
97 of a `Vec<T>`, if you push more data into it. In addition, you can conveniently
98 use the `+` operator or the `format!` macro to concatenate `String` values.
100 #### Appending to a String with `push_str` and `push`
102 We can grow a `String` by using the `push_str` method to append a string slice,
103 as shown in Listing 8-15.
106 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-15/src/main.rs:here}}
109 <span class="caption">Listing 8-15: Appending a string slice to a `String`
110 using the `push_str` method</span>
112 After these two lines, `s` will contain `foobar`. The `push_str` method takes a
113 string slice because we don’t necessarily want to take ownership of the
114 parameter. For example, in the code in Listing 8-16, we want to be able to use
115 `s2` after appending its contents to `s1`.
118 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-16/src/main.rs:here}}
121 <span class="caption">Listing 8-16: Using a string slice after appending its
122 contents to a `String`</span>
124 If the `push_str` method took ownership of `s2`, we wouldn’t be able to print
125 its value on the last line. However, this code works as we’d expect!
127 The `push` method takes a single character as a parameter and adds it to the
128 `String`. Listing 8-17 adds the letter “l” to a `String` using the `push`
132 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-17/src/main.rs:here}}
135 <span class="caption">Listing 8-17: Adding one character to a `String` value
138 As a result, `s` will contain `lol`.
140 #### Concatenation with the `+` Operator or the `format!` Macro
142 Often, you’ll want to combine two existing strings. One way to do so is to use
143 the `+` operator, as shown in Listing 8-18.
146 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-18/src/main.rs:here}}
149 <span class="caption">Listing 8-18: Using the `+` operator to combine two
150 `String` values into a new `String` value</span>
152 The string `s3` will contain `Hello, world!`. The reason `s1` is no longer
153 valid after the addition, and the reason we used a reference to `s2`, has to do
154 with the signature of the method that’s called when we use the `+` operator.
155 The `+` operator uses the `add` method, whose signature looks something like
159 fn add(self, s: &str) -> String {
162 In the standard library, you'll see `add` defined using generics and associated
163 types. Here, we’ve substituted in concrete types, which is what happens when we
164 call this method with `String` values. We’ll discuss generics in Chapter 10.
165 This signature gives us the clues we need to understand the tricky bits of the
168 First, `s2` has an `&`, meaning that we’re adding a *reference* of the second
169 string to the first string. This is because of the `s` parameter in the `add`
170 function: we can only add a `&str` to a `String`; we can’t add two `String`
171 values together. But wait—the type of `&s2` is `&String`, not `&str`, as
172 specified in the second parameter to `add`. So why does Listing 8-18 compile?
174 The reason we’re able to use `&s2` in the call to `add` is that the compiler
175 can *coerce* the `&String` argument into a `&str`. When we call the `add`
176 method, Rust uses a *deref coercion*, which here turns `&s2` into `&s2[..]`.
177 We’ll discuss deref coercion in more depth in Chapter 15. Because `add` does
178 not take ownership of the `s` parameter, `s2` will still be a valid `String`
179 after this operation.
181 Second, we can see in the signature that `add` takes ownership of `self`,
182 because `self` does *not* have an `&`. This means `s1` in Listing 8-18 will be
183 moved into the `add` call and will no longer be valid after that. So although
184 `let s3 = s1 + &s2;` looks like it will copy both strings and create a new one,
185 this statement actually takes ownership of `s1`, appends a copy of the contents
186 of `s2`, and then returns ownership of the result. In other words, it looks
187 like it’s making a lot of copies but isn’t; the implementation is more
188 efficient than copying.
190 If we need to concatenate multiple strings, the behavior of the `+` operator
194 {{#rustdoc_include ../listings/ch08-common-collections/no-listing-01-concat-multiple-strings/src/main.rs:here}}
197 At this point, `s` will be `tic-tac-toe`. With all of the `+` and `"`
198 characters, it’s difficult to see what’s going on. For more complicated string
199 combining, we can instead use the `format!` macro:
202 {{#rustdoc_include ../listings/ch08-common-collections/no-listing-02-format/src/main.rs:here}}
205 This code also sets `s` to `tic-tac-toe`. The `format!` macro works like
206 `println!`, but instead of printing the output to the screen, it returns a
207 `String` with the contents. The version of the code using `format!` is much
208 easier to read, and the code generated by the `format!` macro uses references
209 so that this call doesn’t take ownership of any of its parameters.
211 ### Indexing into Strings
213 In many other programming languages, accessing individual characters in a
214 string by referencing them by index is a valid and common operation. However,
215 if you try to access parts of a `String` using indexing syntax in Rust, you’ll
216 get an error. Consider the invalid code in Listing 8-19.
218 ```rust,ignore,does_not_compile
219 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-19/src/main.rs:here}}
222 <span class="caption">Listing 8-19: Attempting to use indexing syntax with a
225 This code will result in the following error:
228 {{#include ../listings/ch08-common-collections/listing-08-19/output.txt}}
231 The error and the note tell the story: Rust strings don’t support indexing. But
232 why not? To answer that question, we need to discuss how Rust stores strings in
235 #### Internal Representation
237 A `String` is a wrapper over a `Vec<u8>`. Let’s look at some of our properly
238 encoded UTF-8 example strings from Listing 8-14. First, this one:
241 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:spanish}}
244 In this case, `len` will be 4, which means the vector storing the string “Hola”
245 is 4 bytes long. Each of these letters takes 1 byte when encoded in UTF-8. The
246 following line, however, may surprise you. (Note that this string begins with
247 the capital Cyrillic letter Ze, not the Arabic number 3.)
250 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:russian}}
253 Asked how long the string is, you might say 12. In fact, Rust’s answer is 24:
254 that’s the number of bytes it takes to encode “Здравствуйте” in UTF-8, because
255 each Unicode scalar value in that string takes 2 bytes of storage. Therefore,
256 an index into the string’s bytes will not always correlate to a valid Unicode
257 scalar value. To demonstrate, consider this invalid Rust code:
259 ```rust,ignore,does_not_compile
260 let hello = "Здравствуйте";
261 let answer = &hello[0];
264 You already know that `answer` will not be `З`, the first letter. When encoded
265 in UTF-8, the first byte of `З` is `208` and the second is `151`, so it would
266 seem that `answer` should in fact be `208`, but `208` is not a valid character
267 on its own. Returning `208` is likely not what a user would want if they asked
268 for the first letter of this string; however, that’s the only data that Rust
269 has at byte index 0. Users generally don’t want the byte value returned, even
270 if the string contains only Latin letters: if `&"hello"[0]` were valid code
271 that returned the byte value, it would return `104`, not `h`.
273 The answer, then, is that to avoid returning an unexpected value and causing
274 bugs that might not be discovered immediately, Rust doesn’t compile this code
275 at all and prevents misunderstandings early in the development process.
277 #### Bytes and Scalar Values and Grapheme Clusters! Oh My!
279 Another point about UTF-8 is that there are actually three relevant ways to
280 look at strings from Rust’s perspective: as bytes, scalar values, and grapheme
281 clusters (the closest thing to what we would call *letters*).
283 If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is
284 stored as a vector of `u8` values that looks like this:
287 [224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
291 That’s 18 bytes and is how computers ultimately store this data. If we look at
292 them as Unicode scalar values, which are what Rust’s `char` type is, those
293 bytes look like this:
296 ['न', 'म', 'स', '्', 'त', 'े']
299 There are six `char` values here, but the fourth and sixth are not letters:
300 they’re diacritics that don’t make sense on their own. Finally, if we look at
301 them as grapheme clusters, we’d get what a person would call the four letters
302 that make up the Hindi word:
305 ["न", "म", "स्", "ते"]
308 Rust provides different ways of interpreting the raw string data that computers
309 store so that each program can choose the interpretation it needs, no matter
310 what human language the data is in.
312 A final reason Rust doesn’t allow us to index into a `String` to get a
313 character is that indexing operations are expected to always take constant time
314 (O(1)). But it isn’t possible to guarantee that performance with a `String`,
315 because Rust would have to walk through the contents from the beginning to the
316 index to determine how many valid characters there were.
320 Indexing into a string is often a bad idea because it’s not clear what the
321 return type of the string-indexing operation should be: a byte value, a
322 character, a grapheme cluster, or a string slice. If you really need to use
323 indices to create string slices, therefore, Rust asks you to be more specific.
325 Rather than indexing using `[]` with a single number, you can use `[]` with a
326 range to create a string slice containing particular bytes:
329 let hello = "Здравствуйте";
331 let s = &hello[0..4];
334 Here, `s` will be a `&str` that contains the first 4 bytes of the string.
335 Earlier, we mentioned that each of these characters was 2 bytes, which means
338 If we were to try to slice only part of a character’s bytes with something like
339 `&hello[0..1]`, Rust would panic at runtime in the same way as if an invalid
340 index were accessed in a vector:
343 {{#include ../listings/ch08-common-collections/output-only-01-not-char-boundary/output.txt}}
346 You should use ranges to create string slices with caution, because doing so
347 can crash your program.
349 ### Methods for Iterating Over Strings
351 The best way to operate on pieces of strings is to be explicit about whether
352 you want characters or bytes. For individual Unicode scalar values, use the
353 `chars` method. Calling `chars` on “Зд” separates out and returns two values
354 of type `char`, and you can iterate over the result to access each element:
357 for c in "Зд".chars() {
362 This code will print the following:
369 Alternatively, the `bytes` method returns each raw byte, which might be
370 appropriate for your domain:
373 for b in "Зд".bytes() {
378 This code will print the four bytes that make up this string:
387 But be sure to remember that valid Unicode scalar values may be made up of more
390 Getting grapheme clusters from strings as with the Devanagari script is
391 complex, so this functionality is not provided by the standard library. Crates
392 are available on [crates.io](https://crates.io/)<!-- ignore --> if this is the
393 functionality you need.
395 ### Strings Are Not So Simple
397 To summarize, strings are complicated. Different programming languages make
398 different choices about how to present this complexity to the programmer. Rust
399 has chosen to make the correct handling of `String` data the default behavior
400 for all Rust programs, which means programmers have to put more thought into
401 handling UTF-8 data upfront. This trade-off exposes more of the complexity of
402 strings than is apparent in other programming languages, but it prevents you
403 from having to handle errors involving non-ASCII characters later in your
404 development life cycle.
406 The good news is that the standard library offers a lot of functionality built
407 off the `String` and `&str` types to help handle these complex situations
408 correctly. Be sure to check out the documentation for useful methods like
409 `contains` for searching in a string and `replace` for substituting parts of a
410 string with another string.
412 Let’s switch to something a bit less complex: hash maps!