]> git.proxmox.com Git - rustc.git/blame - src/doc/book/src/ch08-02-strings.md
New upstream version 1.37.0+dfsg1
[rustc.git] / src / doc / book / src / ch08-02-strings.md
CommitLineData
13cf67c4
XL
1## Storing UTF-8 Encoded Text with Strings
2
3We talked about strings in Chapter 4, but we’ll look at them in more depth now.
69743fb6 4New Rustaceans commonly get stuck on strings for a combination of three
13cf67c4
XL
5reasons: Rust’s propensity for exposing possible errors, strings being a more
6complicated data structure than many programmers give them credit for, and
7UTF-8. These factors combine in a way that can seem difficult when you’re
8coming from other programming languages.
9
10It’s useful to discuss strings in the context of collections because strings
11are implemented as a collection of bytes, plus some methods to provide useful
12functionality when those bytes are interpreted as text. In this section, we’ll
13talk about the operations on `String` that every collection type has, such as
14creating, updating, and reading. We’ll also discuss the ways in which `String`
15is different from the other collections, namely how indexing into a `String` is
16complicated by the differences between how people and computers interpret
17`String` data.
18
19### What Is a String?
20
21We’ll first define what we mean by the term *string*. Rust has only one string
22type in the core language, which is the string slice `str` that is usually seen
23in its borrowed form `&str`. In Chapter 4, we talked about *string slices*,
24which are references to some UTF-8 encoded string data stored elsewhere. String
532ac7d7
XL
25literals, for example, are stored in the program’s binary and are therefore
26string slices.
13cf67c4
XL
27
28The `String` type, which is provided by Rust’s standard library rather than
29coded into the core language, is a growable, mutable, owned, UTF-8 encoded
30string type. When Rustaceans refer to “strings” in Rust, they usually mean the
31`String` and the string slice `&str` types, not just one of those types.
32Although this section is largely about `String`, both types are used heavily in
33Rust’s standard library, and both `String` and string slices are UTF-8 encoded.
34
35Rust’s standard library also includes a number of other string types, such as
36`OsString`, `OsStr`, `CString`, and `CStr`. Library crates can provide even
37more options for storing string data. See how those names all end in `String`
38or `Str`? They refer to owned and borrowed variants, just like the `String` and
39`str` types you’ve seen previously. These string types can store text in
40different encodings or be represented in memory in a different way, for
41example. We won’t discuss these other string types in this chapter; see their
42API documentation for more about how to use them and when each is appropriate.
43
44### Creating a New String
45
46Many of the same operations available with `Vec<T>` are available with `String`
47as well, starting with the `new` function to create a string, shown in Listing
69743fb6 488-11.
13cf67c4
XL
49
50```rust
51let mut s = String::new();
52```
53
54<span class="caption">Listing 8-11: Creating a new, empty `String`</span>
55
56This line creates a new empty string called `s`, which we can then load data
57into. Often, we’ll have some initial data that we want to start the string
58with. For that, we use the `to_string` method, which is available on any type
59that implements the `Display` trait, as string literals do. Listing 8-12 shows
69743fb6 60two examples.
13cf67c4
XL
61
62```rust
63let data = "initial contents";
64
65let s = data.to_string();
66
67// the method also works on a literal directly:
68let s = "initial contents".to_string();
69```
70
71<span class="caption">Listing 8-12: Using the `to_string` method to create a
72`String` from a string literal</span>
73
74This code creates a string containing `initial contents`.
75
76We can also use the function `String::from` to create a `String` from a string
77literal. The code in Listing 8-13 is equivalent to the code from Listing 8-12
69743fb6 78that uses `to_string`.
13cf67c4
XL
79
80```rust
81let s = String::from("initial contents");
82```
83
84<span class="caption">Listing 8-13: Using the `String::from` function to create
85a `String` from a string literal</span>
86
87Because strings are used for so many things, we can use many different generic
88APIs for strings, providing us with a lot of options. Some of them can seem
89redundant, but they all have their place! In this case, `String::from` and
90`to_string` do the same thing, so which you choose is a matter of style.
91
92Remember that strings are UTF-8 encoded, so we can include any properly encoded
69743fb6 93data in them, as shown in Listing 8-14.
13cf67c4
XL
94
95```rust
96let hello = String::from("السلام عليكم");
97let hello = String::from("Dobrý den");
98let hello = String::from("Hello");
99let hello = String::from("שָׁלוֹם");
100let hello = String::from("नमस्ते");
101let hello = String::from("こんにちは");
102let hello = String::from("안녕하세요");
103let hello = String::from("你好");
104let hello = String::from("Olá");
105let hello = String::from("Здравствуйте");
106let hello = String::from("Hola");
107```
108
109<span class="caption">Listing 8-14: Storing greetings in different languages in
110strings</span>
111
112All of these are valid `String` values.
113
114### Updating a String
115
116A `String` can grow in size and its contents can change, just like the contents
117of a `Vec<T>`, if you push more data into it. In addition, you can conveniently
118use the `+` operator or the `format!` macro to concatenate `String` values.
119
120#### Appending to a String with `push_str` and `push`
121
122We can grow a `String` by using the `push_str` method to append a string slice,
69743fb6 123as shown in Listing 8-15.
13cf67c4
XL
124
125```rust
126let mut s = String::from("foo");
127s.push_str("bar");
128```
129
130<span class="caption">Listing 8-15: Appending a string slice to a `String`
131using the `push_str` method</span>
132
133After these two lines, `s` will contain `foobar`. The `push_str` method takes a
134string slice because we don’t necessarily want to take ownership of the
135parameter. For example, the code in Listing 8-16 shows that it would be
69743fb6 136unfortunate if we weren’t able to use `s2` after appending its contents to `s1`.
13cf67c4
XL
137
138```rust
139let mut s1 = String::from("foo");
140let s2 = "bar";
141s1.push_str(s2);
142println!("s2 is {}", s2);
143```
144
145<span class="caption">Listing 8-16: Using a string slice after appending its
146contents to a `String`</span>
147
148If the `push_str` method took ownership of `s2`, we wouldn’t be able to print
149its value on the last line. However, this code works as we’d expect!
150
151The `push` method takes a single character as a parameter and adds it to the
69743fb6
XL
152`String`. Listing 8-17 shows code that adds the letter *l* to a `String` using
153the `push` method.
13cf67c4
XL
154
155```rust
156let mut s = String::from("lo");
157s.push('l');
158```
159
160<span class="caption">Listing 8-17: Adding one character to a `String` value
161using `push`</span>
162
163As a result of this code, `s` will contain `lol`.
164
165#### Concatenation with the `+` Operator or the `format!` Macro
166
167Often, you’ll want to combine two existing strings. One way is to use the `+`
69743fb6 168operator, as shown in Listing 8-18.
13cf67c4
XL
169
170```rust
171let s1 = String::from("Hello, ");
172let s2 = String::from("world!");
69743fb6 173let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used
13cf67c4
XL
174```
175
176<span class="caption">Listing 8-18: Using the `+` operator to combine two
177`String` values into a new `String` value</span>
178
179The string `s3` will contain `Hello, world!` as a result of this code. The
180reason `s1` is no longer valid after the addition and the reason we used a
181reference to `s2` has to do with the signature of the method that gets called
182when we use the `+` operator. The `+` operator uses the `add` method, whose
183signature looks something like this:
184
185```rust,ignore
186fn add(self, s: &str) -> String {
187```
188
189This isn’t the exact signature that’s in the standard library: in the standard
190library, `add` is defined using generics. Here, we’re looking at the signature
191of `add` with concrete types substituted for the generic ones, which is what
192happens when we call this method with `String` values. We’ll discuss generics
193in Chapter 10. This signature gives us the clues we need to understand the
194tricky bits of the `+` operator.
195
196First, `s2` has an `&`, meaning that we’re adding a *reference* of the second
197string to the first string because of the `s` parameter in the `add` function:
198we can only add a `&str` to a `String`; we can’t add two `String` values
199together. But wait—the type of `&s2` is `&String`, not `&str`, as specified in
200the second parameter to `add`. So why does Listing 8-18 compile?
201
202The reason we’re able to use `&s2` in the call to `add` is that the compiler
203can *coerce* the `&String` argument into a `&str`. When we call the `add`
204method, Rust uses a *deref coercion*, which here turns `&s2` into `&s2[..]`.
205We’ll discuss deref coercion in more depth in Chapter 15. Because `add` does
206not take ownership of the `s` parameter, `s2` will still be a valid `String`
207after this operation.
208
209Second, we can see in the signature that `add` takes ownership of `self`,
210because `self` does *not* have an `&`. This means `s1` in Listing 8-18 will be
211moved into the `add` call and no longer be valid after that. So although `let
212s3 = s1 + &s2;` looks like it will copy both strings and create a new one, this
213statement actually takes ownership of `s1`, appends a copy of the contents of
214`s2`, and then returns ownership of the result. In other words, it looks like
215it’s making a lot of copies but isn’t; the implementation is more efficient
216than copying.
217
218If we need to concatenate multiple strings, the behavior of the `+` operator
219gets unwieldy:
220
221```rust
222let s1 = String::from("tic");
223let s2 = String::from("tac");
224let s3 = String::from("toe");
225
226let s = s1 + "-" + &s2 + "-" + &s3;
227```
228
229At this point, `s` will be `tic-tac-toe`. With all of the `+` and `"`
230characters, it’s difficult to see what’s going on. For more complicated string
231combining, we can use the `format!` macro:
232
233```rust
234let s1 = String::from("tic");
235let s2 = String::from("tac");
236let s3 = String::from("toe");
237
238let s = format!("{}-{}-{}", s1, s2, s3);
239```
240
241This code also sets `s` to `tic-tac-toe`. The `format!` macro works in the same
242way as `println!`, but instead of printing the output to the screen, it returns
243a `String` with the contents. The version of the code using `format!` is much
244easier to read and doesn’t take ownership of any of its parameters.
245
246### Indexing into Strings
247
248In many other programming languages, accessing individual characters in a
249string by referencing them by index is a valid and common operation. However,
250if you try to access parts of a `String` using indexing syntax in Rust, you’ll
69743fb6 251get an error. Consider the invalid code in Listing 8-19.
13cf67c4
XL
252
253```rust,ignore,does_not_compile
254let s1 = String::from("hello");
255let h = s1[0];
256```
257
258<span class="caption">Listing 8-19: Attempting to use indexing syntax with a
259String</span>
260
261This code will result in the following error:
262
263```text
264error[E0277]: the trait bound `std::string::String: std::ops::Index<{integer}>` is not satisfied
265 -->
266 |
2673 | let h = s1[0];
268 | ^^^^^ the type `std::string::String` cannot be indexed by `{integer}`
269 |
270 = help: the trait `std::ops::Index<{integer}>` is not implemented for `std::string::String`
271```
272
273The error and the note tell the story: Rust strings don’t support indexing. But
274why not? To answer that question, we need to discuss how Rust stores strings in
275memory.
276
277#### Internal Representation
278
279A `String` is a wrapper over a `Vec<u8>`. Let’s look at some of our properly
280encoded UTF-8 example strings from Listing 8-14. First, this one:
281
282```rust
283let len = String::from("Hola").len();
284```
285
286In this case, `len` will be 4, which means the vector storing the string “Hola”
287is 4 bytes long. Each of these letters takes 1 byte when encoded in UTF-8. But
69743fb6 288what about the following line? (Note that this string begins with the capital
13cf67c4
XL
289Cyrillic letter Ze, not the Arabic number 3.)
290
291```rust
292let len = String::from("Здравствуйте").len();
293```
294
295Asked how long the string is, you might say 12. However, Rust’s answer is 24:
296that’s the number of bytes it takes to encode “Здравствуйте” in UTF-8, because
297each Unicode scalar value in that string takes 2 bytes of storage. Therefore,
298an index into the string’s bytes will not always correlate to a valid Unicode
299scalar value. To demonstrate, consider this invalid Rust code:
300
301```rust,ignore,does_not_compile
302let hello = "Здравствуйте";
303let answer = &hello[0];
304```
305
306What should the value of `answer` be? Should it be `З`, the first letter? When
307encoded in UTF-8, the first byte of `З` is `208` and the second is `151`, so
308`answer` should in fact be `208`, but `208` is not a valid character on its
309own. Returning `208` is likely not what a user would want if they asked for the
310first letter of this string; however, that’s the only data that Rust has at
311byte index 0. Users generally don’t want the byte value returned, even if the
312string contains only Latin letters: if `&"hello"[0]` were valid code that
313returned the byte value, it would return `104`, not `h`. To avoid returning an
314unexpected value and causing bugs that might not be discovered immediately,
315Rust doesn’t compile this code at all and prevents misunderstandings early in
316the development process.
317
318#### Bytes and Scalar Values and Grapheme Clusters! Oh My!
319
320Another point about UTF-8 is that there are actually three relevant ways to
321look at strings from Rust’s perspective: as bytes, scalar values, and grapheme
322clusters (the closest thing to what we would call *letters*).
323
324If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is
325stored as a vector of `u8` values that looks like this:
326
327```text
328[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
329224, 165, 135]
330```
331
332That’s 18 bytes and is how computers ultimately store this data. If we look at
333them as Unicode scalar values, which are what Rust’s `char` type is, those
334bytes look like this:
335
336```text
337['न', 'म', 'स', '्', 'त', 'े']
338```
339
340There are six `char` values here, but the fourth and sixth are not letters:
341they’re diacritics that don’t make sense on their own. Finally, if we look at
342them as grapheme clusters, we’d get what a person would call the four letters
343that make up the Hindi word:
344
345```text
346["न", "म", "स्", "ते"]
347```
348
349Rust provides different ways of interpreting the raw string data that computers
350store so that each program can choose the interpretation it needs, no matter
351what human language the data is in.
352
353A final reason Rust doesn’t allow us to index into a `String` to get a
354character is that indexing operations are expected to always take constant time
355(O(1)). But it isn’t possible to guarantee that performance with a `String`,
356because Rust would have to walk through the contents from the beginning to the
357index to determine how many valid characters there were.
358
359### Slicing Strings
360
361Indexing into a string is often a bad idea because it’s not clear what the
362return type of the string-indexing operation should be: a byte value, a
363character, a grapheme cluster, or a string slice. Therefore, Rust asks you to
364be more specific if you really need to use indices to create string slices. To
365be more specific in your indexing and indicate that you want a string slice,
366rather than indexing using `[]` with a single number, you can use `[]` with a
367range to create a string slice containing particular bytes:
368
369```rust
370let hello = "Здравствуйте";
371
372let s = &hello[0..4];
373```
374
375Here, `s` will be a `&str` that contains the first 4 bytes of the string.
376Earlier, we mentioned that each of these characters was 2 bytes, which means
377`s` will be `Зд`.
378
379What would happen if we used `&hello[0..1]`? The answer: Rust would panic at
380runtime in the same way as if an invalid index were accessed in a vector:
381
382```text
383thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'З' (bytes 0..2) of `Здравствуйте`', src/libcore/str/mod.rs:2188:4
384```
385
386You should use ranges to create string slices with caution, because doing so
387can crash your program.
388
389### Methods for Iterating Over Strings
390
391Fortunately, you can access elements in a string in other ways.
392
393If you need to perform operations on individual Unicode scalar values, the best
394way to do so is to use the `chars` method. Calling `chars` on “नमस्ते” separates
395out and returns six values of type `char`, and you can iterate over the result
69743fb6 396to access each element:
13cf67c4
XL
397
398```rust
399for c in "नमस्ते".chars() {
400 println!("{}", c);
401}
402```
403
404This code will print the following:
405
406```text
407
408
409
410
411
412
413```
414
415The `bytes` method returns each raw byte, which might be appropriate for your
416domain:
417
418```rust
419for b in "नमस्ते".bytes() {
420 println!("{}", b);
421}
422```
423
424This code will print the 18 bytes that make up this `String`:
425
426```text
427224
428164
429// --snip--
430165
431135
432```
433
434But be sure to remember that valid Unicode scalar values may be made up of more
435than 1 byte.
436
437Getting grapheme clusters from strings is complex, so this functionality is not
438provided by the standard library. Crates are available on
dc9dc135 439[crates.io](https://crates.io/) if this is the functionality you need.
13cf67c4
XL
440
441### Strings Are Not So Simple
442
443To summarize, strings are complicated. Different programming languages make
444different choices about how to present this complexity to the programmer. Rust
445has chosen to make the correct handling of `String` data the default behavior
446for all Rust programs, which means programmers have to put more thought into
447handling UTF-8 data upfront. This trade-off exposes more of the complexity of
448strings than is apparent in other programming languages, but it prevents you
449from having to handle errors involving non-ASCII characters later in your
450development life cycle.
451
452Let’s switch to something a bit less complex: hash maps!