]>
Commit | Line | Data |
---|---|---|
13cf67c4 XL |
1 | ## Storing UTF-8 Encoded Text with Strings |
2 | ||
3 | We talked about strings in Chapter 4, but we’ll look at them in more depth now. | |
69743fb6 | 4 | New Rustaceans commonly get stuck on strings for a combination of three |
13cf67c4 XL |
5 | reasons: Rust’s propensity for exposing possible errors, strings being a more |
6 | complicated data structure than many programmers give them credit for, and | |
7 | UTF-8. These factors combine in a way that can seem difficult when you’re | |
8 | coming from other programming languages. | |
9 | ||
10 | It’s useful to discuss strings in the context of collections because strings | |
11 | are implemented as a collection of bytes, plus some methods to provide useful | |
12 | functionality when those bytes are interpreted as text. In this section, we’ll | |
13 | talk about the operations on `String` that every collection type has, such as | |
14 | creating, updating, and reading. We’ll also discuss the ways in which `String` | |
15 | is different from the other collections, namely how indexing into a `String` is | |
16 | complicated by the differences between how people and computers interpret | |
17 | `String` data. | |
18 | ||
19 | ### What Is a String? | |
20 | ||
21 | We’ll first define what we mean by the term *string*. Rust has only one string | |
22 | type in the core language, which is the string slice `str` that is usually seen | |
23 | in its borrowed form `&str`. In Chapter 4, we talked about *string slices*, | |
24 | which are references to some UTF-8 encoded string data stored elsewhere. String | |
532ac7d7 XL |
25 | literals, for example, are stored in the program’s binary and are therefore |
26 | string slices. | |
13cf67c4 XL |
27 | |
28 | The `String` type, which is provided by Rust’s standard library rather than | |
29 | coded into the core language, is a growable, mutable, owned, UTF-8 encoded | |
30 | string type. When Rustaceans refer to “strings” in Rust, they usually mean the | |
31 | `String` and the string slice `&str` types, not just one of those types. | |
32 | Although this section is largely about `String`, both types are used heavily in | |
33 | Rust’s standard library, and both `String` and string slices are UTF-8 encoded. | |
34 | ||
35 | Rust’s standard library also includes a number of other string types, such as | |
36 | `OsString`, `OsStr`, `CString`, and `CStr`. Library crates can provide even | |
37 | more options for storing string data. See how those names all end in `String` | |
38 | or `Str`? They refer to owned and borrowed variants, just like the `String` and | |
39 | `str` types you’ve seen previously. These string types can store text in | |
40 | different encodings or be represented in memory in a different way, for | |
41 | example. We won’t discuss these other string types in this chapter; see their | |
42 | API documentation for more about how to use them and when each is appropriate. | |
43 | ||
44 | ### Creating a New String | |
45 | ||
46 | Many of the same operations available with `Vec<T>` are available with `String` | |
47 | as well, starting with the `new` function to create a string, shown in Listing | |
69743fb6 | 48 | 8-11. |
13cf67c4 XL |
49 | |
50 | ```rust | |
51 | let mut s = String::new(); | |
52 | ``` | |
53 | ||
54 | <span class="caption">Listing 8-11: Creating a new, empty `String`</span> | |
55 | ||
56 | This line creates a new empty string called `s`, which we can then load data | |
57 | into. Often, we’ll have some initial data that we want to start the string | |
58 | with. For that, we use the `to_string` method, which is available on any type | |
59 | that implements the `Display` trait, as string literals do. Listing 8-12 shows | |
69743fb6 | 60 | two examples. |
13cf67c4 XL |
61 | |
62 | ```rust | |
63 | let data = "initial contents"; | |
64 | ||
65 | let s = data.to_string(); | |
66 | ||
67 | // the method also works on a literal directly: | |
68 | let s = "initial contents".to_string(); | |
69 | ``` | |
70 | ||
71 | <span class="caption">Listing 8-12: Using the `to_string` method to create a | |
72 | `String` from a string literal</span> | |
73 | ||
74 | This code creates a string containing `initial contents`. | |
75 | ||
76 | We can also use the function `String::from` to create a `String` from a string | |
77 | literal. The code in Listing 8-13 is equivalent to the code from Listing 8-12 | |
69743fb6 | 78 | that uses `to_string`. |
13cf67c4 XL |
79 | |
80 | ```rust | |
81 | let s = String::from("initial contents"); | |
82 | ``` | |
83 | ||
84 | <span class="caption">Listing 8-13: Using the `String::from` function to create | |
85 | a `String` from a string literal</span> | |
86 | ||
87 | Because strings are used for so many things, we can use many different generic | |
88 | APIs for strings, providing us with a lot of options. Some of them can seem | |
89 | redundant, but they all have their place! In this case, `String::from` and | |
90 | `to_string` do the same thing, so which you choose is a matter of style. | |
91 | ||
92 | Remember that strings are UTF-8 encoded, so we can include any properly encoded | |
69743fb6 | 93 | data in them, as shown in Listing 8-14. |
13cf67c4 XL |
94 | |
95 | ```rust | |
96 | let hello = String::from("السلام عليكم"); | |
97 | let hello = String::from("Dobrý den"); | |
98 | let hello = String::from("Hello"); | |
99 | let hello = String::from("שָׁלוֹם"); | |
100 | let hello = String::from("नमस्ते"); | |
101 | let hello = String::from("こんにちは"); | |
102 | let hello = String::from("안녕하세요"); | |
103 | let hello = String::from("你好"); | |
104 | let hello = String::from("Olá"); | |
105 | let hello = String::from("Здравствуйте"); | |
106 | let hello = String::from("Hola"); | |
107 | ``` | |
108 | ||
109 | <span class="caption">Listing 8-14: Storing greetings in different languages in | |
110 | strings</span> | |
111 | ||
112 | All of these are valid `String` values. | |
113 | ||
114 | ### Updating a String | |
115 | ||
116 | A `String` can grow in size and its contents can change, just like the contents | |
117 | of a `Vec<T>`, if you push more data into it. In addition, you can conveniently | |
118 | use the `+` operator or the `format!` macro to concatenate `String` values. | |
119 | ||
120 | #### Appending to a String with `push_str` and `push` | |
121 | ||
122 | We can grow a `String` by using the `push_str` method to append a string slice, | |
69743fb6 | 123 | as shown in Listing 8-15. |
13cf67c4 XL |
124 | |
125 | ```rust | |
126 | let mut s = String::from("foo"); | |
127 | s.push_str("bar"); | |
128 | ``` | |
129 | ||
130 | <span class="caption">Listing 8-15: Appending a string slice to a `String` | |
131 | using the `push_str` method</span> | |
132 | ||
133 | After these two lines, `s` will contain `foobar`. The `push_str` method takes a | |
134 | string slice because we don’t necessarily want to take ownership of the | |
135 | parameter. For example, the code in Listing 8-16 shows that it would be | |
69743fb6 | 136 | unfortunate if we weren’t able to use `s2` after appending its contents to `s1`. |
13cf67c4 XL |
137 | |
138 | ```rust | |
139 | let mut s1 = String::from("foo"); | |
140 | let s2 = "bar"; | |
141 | s1.push_str(s2); | |
142 | println!("s2 is {}", s2); | |
143 | ``` | |
144 | ||
145 | <span class="caption">Listing 8-16: Using a string slice after appending its | |
146 | contents to a `String`</span> | |
147 | ||
148 | If the `push_str` method took ownership of `s2`, we wouldn’t be able to print | |
149 | its value on the last line. However, this code works as we’d expect! | |
150 | ||
151 | The `push` method takes a single character as a parameter and adds it to the | |
69743fb6 XL |
152 | `String`. Listing 8-17 shows code that adds the letter *l* to a `String` using |
153 | the `push` method. | |
13cf67c4 XL |
154 | |
155 | ```rust | |
156 | let mut s = String::from("lo"); | |
157 | s.push('l'); | |
158 | ``` | |
159 | ||
160 | <span class="caption">Listing 8-17: Adding one character to a `String` value | |
161 | using `push`</span> | |
162 | ||
163 | As a result of this code, `s` will contain `lol`. | |
164 | ||
165 | #### Concatenation with the `+` Operator or the `format!` Macro | |
166 | ||
167 | Often, you’ll want to combine two existing strings. One way is to use the `+` | |
69743fb6 | 168 | operator, as shown in Listing 8-18. |
13cf67c4 XL |
169 | |
170 | ```rust | |
171 | let s1 = String::from("Hello, "); | |
172 | let s2 = String::from("world!"); | |
69743fb6 | 173 | let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used |
13cf67c4 XL |
174 | ``` |
175 | ||
176 | <span class="caption">Listing 8-18: Using the `+` operator to combine two | |
177 | `String` values into a new `String` value</span> | |
178 | ||
179 | The string `s3` will contain `Hello, world!` as a result of this code. The | |
180 | reason `s1` is no longer valid after the addition and the reason we used a | |
181 | reference to `s2` has to do with the signature of the method that gets called | |
182 | when we use the `+` operator. The `+` operator uses the `add` method, whose | |
183 | signature looks something like this: | |
184 | ||
185 | ```rust,ignore | |
186 | fn add(self, s: &str) -> String { | |
187 | ``` | |
188 | ||
189 | This isn’t the exact signature that’s in the standard library: in the standard | |
190 | library, `add` is defined using generics. Here, we’re looking at the signature | |
191 | of `add` with concrete types substituted for the generic ones, which is what | |
192 | happens when we call this method with `String` values. We’ll discuss generics | |
193 | in Chapter 10. This signature gives us the clues we need to understand the | |
194 | tricky bits of the `+` operator. | |
195 | ||
196 | First, `s2` has an `&`, meaning that we’re adding a *reference* of the second | |
197 | string to the first string because of the `s` parameter in the `add` function: | |
198 | we can only add a `&str` to a `String`; we can’t add two `String` values | |
199 | together. But wait—the type of `&s2` is `&String`, not `&str`, as specified in | |
200 | the second parameter to `add`. So why does Listing 8-18 compile? | |
201 | ||
202 | The reason we’re able to use `&s2` in the call to `add` is that the compiler | |
203 | can *coerce* the `&String` argument into a `&str`. When we call the `add` | |
204 | method, Rust uses a *deref coercion*, which here turns `&s2` into `&s2[..]`. | |
205 | We’ll discuss deref coercion in more depth in Chapter 15. Because `add` does | |
206 | not take ownership of the `s` parameter, `s2` will still be a valid `String` | |
207 | after this operation. | |
208 | ||
209 | Second, we can see in the signature that `add` takes ownership of `self`, | |
210 | because `self` does *not* have an `&`. This means `s1` in Listing 8-18 will be | |
211 | moved into the `add` call and no longer be valid after that. So although `let | |
212 | s3 = s1 + &s2;` looks like it will copy both strings and create a new one, this | |
213 | statement actually takes ownership of `s1`, appends a copy of the contents of | |
214 | `s2`, and then returns ownership of the result. In other words, it looks like | |
215 | it’s making a lot of copies but isn’t; the implementation is more efficient | |
216 | than copying. | |
217 | ||
218 | If we need to concatenate multiple strings, the behavior of the `+` operator | |
219 | gets unwieldy: | |
220 | ||
221 | ```rust | |
222 | let s1 = String::from("tic"); | |
223 | let s2 = String::from("tac"); | |
224 | let s3 = String::from("toe"); | |
225 | ||
226 | let s = s1 + "-" + &s2 + "-" + &s3; | |
227 | ``` | |
228 | ||
229 | At this point, `s` will be `tic-tac-toe`. With all of the `+` and `"` | |
230 | characters, it’s difficult to see what’s going on. For more complicated string | |
231 | combining, we can use the `format!` macro: | |
232 | ||
233 | ```rust | |
234 | let s1 = String::from("tic"); | |
235 | let s2 = String::from("tac"); | |
236 | let s3 = String::from("toe"); | |
237 | ||
238 | let s = format!("{}-{}-{}", s1, s2, s3); | |
239 | ``` | |
240 | ||
241 | This code also sets `s` to `tic-tac-toe`. The `format!` macro works in the same | |
242 | way as `println!`, but instead of printing the output to the screen, it returns | |
243 | a `String` with the contents. The version of the code using `format!` is much | |
244 | easier to read and doesn’t take ownership of any of its parameters. | |
245 | ||
246 | ### Indexing into Strings | |
247 | ||
248 | In many other programming languages, accessing individual characters in a | |
249 | string by referencing them by index is a valid and common operation. However, | |
250 | if you try to access parts of a `String` using indexing syntax in Rust, you’ll | |
69743fb6 | 251 | get an error. Consider the invalid code in Listing 8-19. |
13cf67c4 XL |
252 | |
253 | ```rust,ignore,does_not_compile | |
254 | let s1 = String::from("hello"); | |
255 | let h = s1[0]; | |
256 | ``` | |
257 | ||
258 | <span class="caption">Listing 8-19: Attempting to use indexing syntax with a | |
259 | String</span> | |
260 | ||
261 | This code will result in the following error: | |
262 | ||
263 | ```text | |
264 | error[E0277]: the trait bound `std::string::String: std::ops::Index<{integer}>` is not satisfied | |
265 | --> | |
266 | | | |
267 | 3 | let h = s1[0]; | |
268 | | ^^^^^ the type `std::string::String` cannot be indexed by `{integer}` | |
269 | | | |
270 | = help: the trait `std::ops::Index<{integer}>` is not implemented for `std::string::String` | |
271 | ``` | |
272 | ||
273 | The error and the note tell the story: Rust strings don’t support indexing. But | |
274 | why not? To answer that question, we need to discuss how Rust stores strings in | |
275 | memory. | |
276 | ||
277 | #### Internal Representation | |
278 | ||
279 | A `String` is a wrapper over a `Vec<u8>`. Let’s look at some of our properly | |
280 | encoded UTF-8 example strings from Listing 8-14. First, this one: | |
281 | ||
282 | ```rust | |
283 | let len = String::from("Hola").len(); | |
284 | ``` | |
285 | ||
286 | In this case, `len` will be 4, which means the vector storing the string “Hola” | |
287 | is 4 bytes long. Each of these letters takes 1 byte when encoded in UTF-8. But | |
69743fb6 | 288 | what about the following line? (Note that this string begins with the capital |
13cf67c4 XL |
289 | Cyrillic letter Ze, not the Arabic number 3.) |
290 | ||
291 | ```rust | |
292 | let len = String::from("Здравствуйте").len(); | |
293 | ``` | |
294 | ||
295 | Asked how long the string is, you might say 12. However, Rust’s answer is 24: | |
296 | that’s the number of bytes it takes to encode “Здравствуйте” in UTF-8, because | |
297 | each Unicode scalar value in that string takes 2 bytes of storage. Therefore, | |
298 | an index into the string’s bytes will not always correlate to a valid Unicode | |
299 | scalar value. To demonstrate, consider this invalid Rust code: | |
300 | ||
301 | ```rust,ignore,does_not_compile | |
302 | let hello = "Здравствуйте"; | |
303 | let answer = &hello[0]; | |
304 | ``` | |
305 | ||
306 | What should the value of `answer` be? Should it be `З`, the first letter? When | |
307 | encoded in UTF-8, the first byte of `З` is `208` and the second is `151`, so | |
308 | `answer` should in fact be `208`, but `208` is not a valid character on its | |
309 | own. Returning `208` is likely not what a user would want if they asked for the | |
310 | first letter of this string; however, that’s the only data that Rust has at | |
311 | byte index 0. Users generally don’t want the byte value returned, even if the | |
312 | string contains only Latin letters: if `&"hello"[0]` were valid code that | |
313 | returned the byte value, it would return `104`, not `h`. To avoid returning an | |
314 | unexpected value and causing bugs that might not be discovered immediately, | |
315 | Rust doesn’t compile this code at all and prevents misunderstandings early in | |
316 | the development process. | |
317 | ||
318 | #### Bytes and Scalar Values and Grapheme Clusters! Oh My! | |
319 | ||
320 | Another point about UTF-8 is that there are actually three relevant ways to | |
321 | look at strings from Rust’s perspective: as bytes, scalar values, and grapheme | |
322 | clusters (the closest thing to what we would call *letters*). | |
323 | ||
324 | If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is | |
325 | stored as a vector of `u8` values that looks like this: | |
326 | ||
327 | ```text | |
328 | [224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, | |
329 | 224, 165, 135] | |
330 | ``` | |
331 | ||
332 | That’s 18 bytes and is how computers ultimately store this data. If we look at | |
333 | them as Unicode scalar values, which are what Rust’s `char` type is, those | |
334 | bytes look like this: | |
335 | ||
336 | ```text | |
337 | ['न', 'म', 'स', '्', 'त', 'े'] | |
338 | ``` | |
339 | ||
340 | There are six `char` values here, but the fourth and sixth are not letters: | |
341 | they’re diacritics that don’t make sense on their own. Finally, if we look at | |
342 | them as grapheme clusters, we’d get what a person would call the four letters | |
343 | that make up the Hindi word: | |
344 | ||
345 | ```text | |
346 | ["न", "म", "स्", "ते"] | |
347 | ``` | |
348 | ||
349 | Rust provides different ways of interpreting the raw string data that computers | |
350 | store so that each program can choose the interpretation it needs, no matter | |
351 | what human language the data is in. | |
352 | ||
353 | A final reason Rust doesn’t allow us to index into a `String` to get a | |
354 | character is that indexing operations are expected to always take constant time | |
355 | (O(1)). But it isn’t possible to guarantee that performance with a `String`, | |
356 | because Rust would have to walk through the contents from the beginning to the | |
357 | index to determine how many valid characters there were. | |
358 | ||
359 | ### Slicing Strings | |
360 | ||
361 | Indexing into a string is often a bad idea because it’s not clear what the | |
362 | return type of the string-indexing operation should be: a byte value, a | |
363 | character, a grapheme cluster, or a string slice. Therefore, Rust asks you to | |
364 | be more specific if you really need to use indices to create string slices. To | |
365 | be more specific in your indexing and indicate that you want a string slice, | |
366 | rather than indexing using `[]` with a single number, you can use `[]` with a | |
367 | range to create a string slice containing particular bytes: | |
368 | ||
369 | ```rust | |
370 | let hello = "Здравствуйте"; | |
371 | ||
372 | let s = &hello[0..4]; | |
373 | ``` | |
374 | ||
375 | Here, `s` will be a `&str` that contains the first 4 bytes of the string. | |
376 | Earlier, we mentioned that each of these characters was 2 bytes, which means | |
377 | `s` will be `Зд`. | |
378 | ||
379 | What would happen if we used `&hello[0..1]`? The answer: Rust would panic at | |
380 | runtime in the same way as if an invalid index were accessed in a vector: | |
381 | ||
382 | ```text | |
383 | thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'З' (bytes 0..2) of `Здравствуйте`', src/libcore/str/mod.rs:2188:4 | |
384 | ``` | |
385 | ||
386 | You should use ranges to create string slices with caution, because doing so | |
387 | can crash your program. | |
388 | ||
389 | ### Methods for Iterating Over Strings | |
390 | ||
391 | Fortunately, you can access elements in a string in other ways. | |
392 | ||
393 | If you need to perform operations on individual Unicode scalar values, the best | |
394 | way to do so is to use the `chars` method. Calling `chars` on “नमस्ते” separates | |
395 | out and returns six values of type `char`, and you can iterate over the result | |
69743fb6 | 396 | to access each element: |
13cf67c4 XL |
397 | |
398 | ```rust | |
399 | for c in "नमस्ते".chars() { | |
400 | println!("{}", c); | |
401 | } | |
402 | ``` | |
403 | ||
404 | This code will print the following: | |
405 | ||
406 | ```text | |
407 | न | |
408 | म | |
409 | स | |
410 | ् | |
411 | त | |
412 | े | |
413 | ``` | |
414 | ||
415 | The `bytes` method returns each raw byte, which might be appropriate for your | |
416 | domain: | |
417 | ||
418 | ```rust | |
419 | for b in "नमस्ते".bytes() { | |
420 | println!("{}", b); | |
421 | } | |
422 | ``` | |
423 | ||
424 | This code will print the 18 bytes that make up this `String`: | |
425 | ||
426 | ```text | |
427 | 224 | |
428 | 164 | |
429 | // --snip-- | |
430 | 165 | |
431 | 135 | |
432 | ``` | |
433 | ||
434 | But be sure to remember that valid Unicode scalar values may be made up of more | |
435 | than 1 byte. | |
436 | ||
437 | Getting grapheme clusters from strings is complex, so this functionality is not | |
438 | provided by the standard library. Crates are available on | |
dc9dc135 | 439 | [crates.io](https://crates.io/) if this is the functionality you need. |
13cf67c4 XL |
440 | |
441 | ### Strings Are Not So Simple | |
442 | ||
443 | To summarize, strings are complicated. Different programming languages make | |
444 | different choices about how to present this complexity to the programmer. Rust | |
445 | has chosen to make the correct handling of `String` data the default behavior | |
446 | for all Rust programs, which means programmers have to put more thought into | |
447 | handling UTF-8 data upfront. This trade-off exposes more of the complexity of | |
448 | strings than is apparent in other programming languages, but it prevents you | |
449 | from having to handle errors involving non-ASCII characters later in your | |
450 | development life cycle. | |
451 | ||
452 | Let’s switch to something a bit less complex: hash maps! |