]> git.proxmox.com Git - rustc.git/blob - src/doc/book/src/ch08-02-strings.md
New upstream version 1.63.0+dfsg1
[rustc.git] / src / doc / book / src / ch08-02-strings.md
1 ## Storing UTF-8 Encoded Text with Strings
2
3 We talked about strings in Chapter 4, but we’ll look at them in more depth now.
4 New Rustaceans commonly get stuck on strings for a combination of three
5 reasons: Rust’s propensity for exposing possible errors, strings being a more
6 complicated data structure than many programmers give them credit for, and
7 UTF-8. These factors combine in a way that can seem difficult when you’re
8 coming from other programming languages.
9
10 We discuss strings in the context of collections because strings are
11 implemented as a collection of bytes, plus some methods to provide useful
12 functionality when those bytes are interpreted as text. In this section, we’ll
13 talk about the operations on `String` that every collection type has, such as
14 creating, updating, and reading. We’ll also discuss the ways in which `String`
15 is different from the other collections, namely how indexing into a `String` is
16 complicated by the differences between how people and computers interpret
17 `String` data.
18
19 ### What Is a String?
20
21 We’ll first define what we mean by the term *string*. Rust has only one string
22 type in the core language, which is the string slice `str` that is usually seen
23 in its borrowed form `&str`. In Chapter 4, we talked about *string slices*,
24 which are references to some UTF-8 encoded string data stored elsewhere. String
25 literals, for example, are stored in the program’s binary and are therefore
26 string slices.
27
28 The `String` type, which is provided by Rust’s standard library rather than
29 coded into the core language, is a growable, mutable, owned, UTF-8 encoded
30 string type. When Rustaceans refer to “strings” in Rust, they might be
31 referring to either the `String` or the string slice `&str` types, not just one
32 of those types. Although this section is largely about `String`, both types are
33 used heavily in Rust’s standard library, and both `String` and string slices
34 are UTF-8 encoded.
35
36 ### Creating a New String
37
38 Many of the same operations available with `Vec<T>` are available with `String`
39 as well, because `String` is actually implemented as a wrapper around a vector
40 of bytes with some extra guarantees, restrictions, and capabilities. An example
41 of a function that works the same way with `Vec<T>` and `String` is the `new`
42 function to create an instance, shown in Listing 8-11.
43
44 ```rust
45 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-11/src/main.rs:here}}
46 ```
47
48 <span class="caption">Listing 8-11: Creating a new, empty `String`</span>
49
50 This line creates a new empty string called `s`, which we can then load data
51 into. Often, we’ll have some initial data that we want to start the string
52 with. For that, we use the `to_string` method, which is available on any type
53 that implements the `Display` trait, as string literals do. Listing 8-12 shows
54 two examples.
55
56 ```rust
57 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-12/src/main.rs:here}}
58 ```
59
60 <span class="caption">Listing 8-12: Using the `to_string` method to create a
61 `String` from a string literal</span>
62
63 This code creates a string containing `initial contents`.
64
65 We can also use the function `String::from` to create a `String` from a string
66 literal. The code in Listing 8-13 is equivalent to the code from Listing 8-12
67 that uses `to_string`.
68
69 ```rust
70 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-13/src/main.rs:here}}
71 ```
72
73 <span class="caption">Listing 8-13: Using the `String::from` function to create
74 a `String` from a string literal</span>
75
76 Because strings are used for so many things, we can use many different generic
77 APIs for strings, providing us with a lot of options. Some of them can seem
78 redundant, but they all have their place! In this case, `String::from` and
79 `to_string` do the same thing, so which you choose is a matter of style and
80 readability.
81
82 Remember that strings are UTF-8 encoded, so we can include any properly encoded
83 data in them, as shown in Listing 8-14.
84
85 ```rust
86 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:here}}
87 ```
88
89 <span class="caption">Listing 8-14: Storing greetings in different languages in
90 strings</span>
91
92 All of these are valid `String` values.
93
94 ### Updating a String
95
96 A `String` can grow in size and its contents can change, just like the contents
97 of a `Vec<T>`, if you push more data into it. In addition, you can conveniently
98 use the `+` operator or the `format!` macro to concatenate `String` values.
99
100 #### Appending to a String with `push_str` and `push`
101
102 We can grow a `String` by using the `push_str` method to append a string slice,
103 as shown in Listing 8-15.
104
105 ```rust
106 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-15/src/main.rs:here}}
107 ```
108
109 <span class="caption">Listing 8-15: Appending a string slice to a `String`
110 using the `push_str` method</span>
111
112 After these two lines, `s` will contain `foobar`. The `push_str` method takes a
113 string slice because we don’t necessarily want to take ownership of the
114 parameter. For example, in the code in Listing 8-16, we want to be able to use
115 `s2` after appending its contents to `s1`.
116
117 ```rust
118 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-16/src/main.rs:here}}
119 ```
120
121 <span class="caption">Listing 8-16: Using a string slice after appending its
122 contents to a `String`</span>
123
124 If the `push_str` method took ownership of `s2`, we wouldn’t be able to print
125 its value on the last line. However, this code works as we’d expect!
126
127 The `push` method takes a single character as a parameter and adds it to the
128 `String`. Listing 8-17 adds the letter “l” to a `String` using the `push`
129 method.
130
131 ```rust
132 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-17/src/main.rs:here}}
133 ```
134
135 <span class="caption">Listing 8-17: Adding one character to a `String` value
136 using `push`</span>
137
138 As a result, `s` will contain `lol`.
139
140 #### Concatenation with the `+` Operator or the `format!` Macro
141
142 Often, you’ll want to combine two existing strings. One way to do so is to use
143 the `+` operator, as shown in Listing 8-18.
144
145 ```rust
146 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-18/src/main.rs:here}}
147 ```
148
149 <span class="caption">Listing 8-18: Using the `+` operator to combine two
150 `String` values into a new `String` value</span>
151
152 The string `s3` will contain `Hello, world!`. The reason `s1` is no longer
153 valid after the addition, and the reason we used a reference to `s2`, has to do
154 with the signature of the method that’s called when we use the `+` operator.
155 The `+` operator uses the `add` method, whose signature looks something like
156 this:
157
158 ```rust,ignore
159 fn add(self, s: &str) -> String {
160 ```
161
162 In the standard library, you'll see `add` defined using generics and associated
163 types. Here, we’ve substituted in concrete types, which is what happens when we
164 call this method with `String` values. We’ll discuss generics in Chapter 10.
165 This signature gives us the clues we need to understand the tricky bits of the
166 `+` operator.
167
168 First, `s2` has an `&`, meaning that we’re adding a *reference* of the second
169 string to the first string. This is because of the `s` parameter in the `add`
170 function: we can only add a `&str` to a `String`; we can’t add two `String`
171 values together. But wait—the type of `&s2` is `&String`, not `&str`, as
172 specified in the second parameter to `add`. So why does Listing 8-18 compile?
173
174 The reason we’re able to use `&s2` in the call to `add` is that the compiler
175 can *coerce* the `&String` argument into a `&str`. When we call the `add`
176 method, Rust uses a *deref coercion*, which here turns `&s2` into `&s2[..]`.
177 We’ll discuss deref coercion in more depth in Chapter 15. Because `add` does
178 not take ownership of the `s` parameter, `s2` will still be a valid `String`
179 after this operation.
180
181 Second, we can see in the signature that `add` takes ownership of `self`,
182 because `self` does *not* have an `&`. This means `s1` in Listing 8-18 will be
183 moved into the `add` call and will no longer be valid after that. So although
184 `let s3 = s1 + &s2;` looks like it will copy both strings and create a new one,
185 this statement actually takes ownership of `s1`, appends a copy of the contents
186 of `s2`, and then returns ownership of the result. In other words, it looks
187 like it’s making a lot of copies but isn’t; the implementation is more
188 efficient than copying.
189
190 If we need to concatenate multiple strings, the behavior of the `+` operator
191 gets unwieldy:
192
193 ```rust
194 {{#rustdoc_include ../listings/ch08-common-collections/no-listing-01-concat-multiple-strings/src/main.rs:here}}
195 ```
196
197 At this point, `s` will be `tic-tac-toe`. With all of the `+` and `"`
198 characters, it’s difficult to see what’s going on. For more complicated string
199 combining, we can instead use the `format!` macro:
200
201 ```rust
202 {{#rustdoc_include ../listings/ch08-common-collections/no-listing-02-format/src/main.rs:here}}
203 ```
204
205 This code also sets `s` to `tic-tac-toe`. The `format!` macro works like
206 `println!`, but instead of printing the output to the screen, it returns a
207 `String` with the contents. The version of the code using `format!` is much
208 easier to read, and the code generated by the `format!` macro uses references
209 so that this call doesn’t take ownership of any of its parameters.
210
211 ### Indexing into Strings
212
213 In many other programming languages, accessing individual characters in a
214 string by referencing them by index is a valid and common operation. However,
215 if you try to access parts of a `String` using indexing syntax in Rust, you’ll
216 get an error. Consider the invalid code in Listing 8-19.
217
218 ```rust,ignore,does_not_compile
219 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-19/src/main.rs:here}}
220 ```
221
222 <span class="caption">Listing 8-19: Attempting to use indexing syntax with a
223 String</span>
224
225 This code will result in the following error:
226
227 ```console
228 {{#include ../listings/ch08-common-collections/listing-08-19/output.txt}}
229 ```
230
231 The error and the note tell the story: Rust strings don’t support indexing. But
232 why not? To answer that question, we need to discuss how Rust stores strings in
233 memory.
234
235 #### Internal Representation
236
237 A `String` is a wrapper over a `Vec<u8>`. Let’s look at some of our properly
238 encoded UTF-8 example strings from Listing 8-14. First, this one:
239
240 ```rust
241 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:spanish}}
242 ```
243
244 In this case, `len` will be 4, which means the vector storing the string “Hola”
245 is 4 bytes long. Each of these letters takes 1 byte when encoded in UTF-8. The
246 following line, however, may surprise you. (Note that this string begins with
247 the capital Cyrillic letter Ze, not the Arabic number 3.)
248
249 ```rust
250 {{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:russian}}
251 ```
252
253 Asked how long the string is, you might say 12. In fact, Rust’s answer is 24:
254 that’s the number of bytes it takes to encode “Здравствуйте” in UTF-8, because
255 each Unicode scalar value in that string takes 2 bytes of storage. Therefore,
256 an index into the string’s bytes will not always correlate to a valid Unicode
257 scalar value. To demonstrate, consider this invalid Rust code:
258
259 ```rust,ignore,does_not_compile
260 let hello = "Здравствуйте";
261 let answer = &hello[0];
262 ```
263
264 You already know that `answer` will not be `З`, the first letter. When encoded
265 in UTF-8, the first byte of `З` is `208` and the second is `151`, so it would
266 seem that `answer` should in fact be `208`, but `208` is not a valid character
267 on its own. Returning `208` is likely not what a user would want if they asked
268 for the first letter of this string; however, that’s the only data that Rust
269 has at byte index 0. Users generally don’t want the byte value returned, even
270 if the string contains only Latin letters: if `&"hello"[0]` were valid code
271 that returned the byte value, it would return `104`, not `h`.
272
273 The answer, then, is that to avoid returning an unexpected value and causing
274 bugs that might not be discovered immediately, Rust doesn’t compile this code
275 at all and prevents misunderstandings early in the development process.
276
277 #### Bytes and Scalar Values and Grapheme Clusters! Oh My!
278
279 Another point about UTF-8 is that there are actually three relevant ways to
280 look at strings from Rust’s perspective: as bytes, scalar values, and grapheme
281 clusters (the closest thing to what we would call *letters*).
282
283 If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is
284 stored as a vector of `u8` values that looks like this:
285
286 ```text
287 [224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
288 224, 165, 135]
289 ```
290
291 That’s 18 bytes and is how computers ultimately store this data. If we look at
292 them as Unicode scalar values, which are what Rust’s `char` type is, those
293 bytes look like this:
294
295 ```text
296 ['न', 'म', 'स', '्', 'त', 'े']
297 ```
298
299 There are six `char` values here, but the fourth and sixth are not letters:
300 they’re diacritics that don’t make sense on their own. Finally, if we look at
301 them as grapheme clusters, we’d get what a person would call the four letters
302 that make up the Hindi word:
303
304 ```text
305 ["न", "म", "स्", "ते"]
306 ```
307
308 Rust provides different ways of interpreting the raw string data that computers
309 store so that each program can choose the interpretation it needs, no matter
310 what human language the data is in.
311
312 A final reason Rust doesn’t allow us to index into a `String` to get a
313 character is that indexing operations are expected to always take constant time
314 (O(1)). But it isn’t possible to guarantee that performance with a `String`,
315 because Rust would have to walk through the contents from the beginning to the
316 index to determine how many valid characters there were.
317
318 ### Slicing Strings
319
320 Indexing into a string is often a bad idea because it’s not clear what the
321 return type of the string-indexing operation should be: a byte value, a
322 character, a grapheme cluster, or a string slice. If you really need to use
323 indices to create string slices, therefore, Rust asks you to be more specific.
324
325 Rather than indexing using `[]` with a single number, you can use `[]` with a
326 range to create a string slice containing particular bytes:
327
328 ```rust
329 let hello = "Здравствуйте";
330
331 let s = &hello[0..4];
332 ```
333
334 Here, `s` will be a `&str` that contains the first 4 bytes of the string.
335 Earlier, we mentioned that each of these characters was 2 bytes, which means
336 `s` will be `Зд`.
337
338 If we were to try to slice only part of a character’s bytes with something like
339 `&hello[0..1]`, Rust would panic at runtime in the same way as if an invalid
340 index were accessed in a vector:
341
342 ```console
343 {{#include ../listings/ch08-common-collections/output-only-01-not-char-boundary/output.txt}}
344 ```
345
346 You should use ranges to create string slices with caution, because doing so
347 can crash your program.
348
349 ### Methods for Iterating Over Strings
350
351 The best way to operate on pieces of strings is to be explicit about whether
352 you want characters or bytes. For individual Unicode scalar values, use the
353 `chars` method. Calling `chars` on “Зд” separates out and returns two values
354 of type `char`, and you can iterate over the result to access each element:
355
356 ```rust
357 for c in "Зд".chars() {
358 println!("{}", c);
359 }
360 ```
361
362 This code will print the following:
363
364 ```text
365 З
366 д
367 ```
368
369 Alternatively, the `bytes` method returns each raw byte, which might be
370 appropriate for your domain:
371
372 ```rust
373 for b in "Зд".bytes() {
374 println!("{}", b);
375 }
376 ```
377
378 This code will print the four bytes that make up this string:
379
380 ```text
381 208
382 151
383 208
384 180
385 ```
386
387 But be sure to remember that valid Unicode scalar values may be made up of more
388 than 1 byte.
389
390 Getting grapheme clusters from strings as with the Devanagari script is
391 complex, so this functionality is not provided by the standard library. Crates
392 are available on [crates.io](https://crates.io/)<!-- ignore --> if this is the
393 functionality you need.
394
395 ### Strings Are Not So Simple
396
397 To summarize, strings are complicated. Different programming languages make
398 different choices about how to present this complexity to the programmer. Rust
399 has chosen to make the correct handling of `String` data the default behavior
400 for all Rust programs, which means programmers have to put more thought into
401 handling UTF-8 data upfront. This trade-off exposes more of the complexity of
402 strings than is apparent in other programming languages, but it prevents you
403 from having to handle errors involving non-ASCII characters later in your
404 development life cycle.
405
406 The good news is that the standard library offers a lot of functionality built
407 off the `String` and `&str` types to help handle these complex situations
408 correctly. Be sure to check out the documentation for useful methods like
409 `contains` for searching in a string and `replace` for substituting parts of a
410 string with another string.
411
412 Let’s switch to something a bit less complex: hash maps!