]>
Commit | Line | Data |
---|---|---|
5099ac24 FG |
1 | # Nom Recipes |
2 | ||
3 | These are short recipes for accomplishing common tasks with nom. | |
4 | ||
5 | * [Whitespace](#whitespace) | |
6 | + [Wrapper combinators that eat whitespace before and after a parser](#wrapper-combinators-that-eat-whitespace-before-and-after-a-parser) | |
7 | * [Comments](#comments) | |
8 | + [`// C++/EOL-style comments`](#-ceol-style-comments) | |
9 | + [`/* C-style comments */`](#-c-style-comments-) | |
10 | * [Identifiers](#identifiers) | |
11 | + [`Rust-Style Identifiers`](#rust-style-identifiers) | |
12 | * [Literal Values](#literal-values) | |
13 | + [Escaped Strings](#escaped-strings) | |
14 | + [Integers](#integers) | |
15 | - [Hexadecimal](#hexadecimal) | |
16 | - [Octal](#octal) | |
17 | - [Binary](#binary) | |
18 | - [Decimal](#decimal) | |
19 | + [Floating Point Numbers](#floating-point-numbers) | |
20 | ||
21 | ## Whitespace | |
22 | ||
23 | ||
24 | ||
25 | ### Wrapper combinators that eat whitespace before and after a parser | |
26 | ||
27 | ```rust | |
28 | use nom::{ | |
29 | IResult, | |
30 | error::ParseError, | |
31 | combinator::value, | |
32 | sequence::delimited, | |
33 | character::complete::multispace0, | |
34 | }; | |
35 | ||
36 | /// A combinator that takes a parser `inner` and produces a parser that also consumes both leading and | |
37 | /// trailing whitespace, returning the output of `inner`. | |
38 | fn ws<'a, F: 'a, O, E: ParseError<&'a str>>(inner: F) -> impl FnMut(&'a str) -> IResult<&'a str, O, E> | |
39 | where | |
40 | F: Fn(&'a str) -> IResult<&'a str, O, E>, | |
41 | { | |
42 | delimited( | |
43 | multispace0, | |
44 | inner, | |
45 | multispace0 | |
46 | ) | |
47 | } | |
48 | ``` | |
49 | ||
50 | To eat only trailing whitespace, replace `delimited(...)` with `terminated(&inner, multispace0)`. | |
51 | Likewise, the eat only leading whitespace, replace `delimited(...)` with `preceded(multispace0, | |
52 | &inner)`. You can use your own parser instead of `multispace0` if you want to skip a different set | |
53 | of lexemes. | |
54 | ||
55 | ## Comments | |
56 | ||
57 | ### `// C++/EOL-style comments` | |
58 | ||
59 | This version uses `%` to start a comment, does not consume the newline character, and returns an | |
60 | output of `()`. | |
61 | ||
62 | ```rust | |
63 | use nom::{ | |
64 | IResult, | |
65 | error::ParseError, | |
66 | combinator::value, | |
67 | sequence::pair, | |
68 | bytes::complete::is_not, | |
69 | character::complete::char, | |
70 | }; | |
71 | ||
72 | pub fn peol_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E> | |
73 | { | |
74 | value( | |
75 | (), // Output is thrown away. | |
76 | pair(char('%'), is_not("\n\r")) | |
77 | )(i) | |
78 | } | |
79 | ``` | |
80 | ||
81 | ### `/* C-style comments */` | |
82 | ||
83 | Inline comments surrounded with sentinel tags `(*` and `*)`. This version returns an output of `()` | |
84 | and does not handle nested comments. | |
85 | ||
86 | ```rust | |
87 | use nom::{ | |
88 | IResult, | |
89 | error::ParseError, | |
90 | combinator::value, | |
91 | sequence::tuple, | |
92 | bytes::complete::{tag, take_until}, | |
93 | }; | |
94 | ||
95 | pub fn pinline_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E> { | |
96 | value( | |
97 | (), // Output is thrown away. | |
98 | tuple(( | |
99 | tag("(*"), | |
100 | take_until("*)"), | |
101 | tag("*)") | |
102 | )) | |
103 | )(i) | |
104 | } | |
105 | ``` | |
106 | ||
107 | ## Identifiers | |
108 | ||
109 | ### `Rust-Style Identifiers` | |
110 | ||
111 | Parsing identifiers that may start with a letter (or underscore) and may contain underscores, | |
112 | letters and numbers may be parsed like this: | |
113 | ||
114 | ```rust | |
115 | use nom::{ | |
116 | IResult, | |
117 | branch::alt, | |
49aad941 | 118 | multi::many0_count, |
5099ac24 FG |
119 | combinator::recognize, |
120 | sequence::pair, | |
121 | character::complete::{alpha1, alphanumeric1}, | |
122 | bytes::complete::tag, | |
123 | }; | |
124 | ||
125 | pub fn identifier(input: &str) -> IResult<&str, &str> { | |
126 | recognize( | |
127 | pair( | |
128 | alt((alpha1, tag("_"))), | |
49aad941 | 129 | many0_count(alt((alphanumeric1, tag("_")))) |
5099ac24 FG |
130 | ) |
131 | )(input) | |
132 | } | |
133 | ``` | |
134 | ||
135 | Let's say we apply this to the identifier `hello_world123abc`. The first `alt` parser would | |
136 | recognize `h`. The `pair` combinator ensures that `ello_world123abc` will be piped to the next | |
137 | `alphanumeric0` parser, which recognizes every remaining character. However, the `pair` combinator | |
138 | returns a tuple of the results of its sub-parsers. The `recognize` parser produces a `&str` of the | |
139 | input text that was parsed, which in this case is the entire `&str` `hello_world123abc`. | |
140 | ||
141 | ## Literal Values | |
142 | ||
143 | ### Escaped Strings | |
144 | ||
49aad941 | 145 | This is [one of the examples](https://github.com/Geal/nom/blob/main/examples/string.rs) in the |
5099ac24 FG |
146 | examples directory. |
147 | ||
148 | ### Integers | |
149 | ||
150 | The following recipes all return string slices rather than integer values. How to obtain an | |
151 | integer value instead is demonstrated for hexadecimal integers. The others are similar. | |
152 | ||
153 | The parsers allow the grouping character `_`, which allows one to group the digits by byte, for | |
154 | example: `0xA4_3F_11_28`. If you prefer to exclude the `_` character, the lambda to convert from a | |
155 | string slice to an integer value is slightly simpler. You can also strip the `_` from the string | |
156 | slice that is returned, which is demonstrated in the second hexdecimal number parser. | |
157 | ||
158 | If you wish to limit the number of digits in a valid integer literal, replace `many1` with | |
159 | `many_m_n` in the recipes. | |
160 | ||
161 | #### Hexadecimal | |
162 | ||
163 | The parser outputs the string slice of the digits without the leading `0x`/`0X`. | |
164 | ||
165 | ```rust | |
166 | use nom::{ | |
167 | IResult, | |
168 | branch::alt, | |
169 | multi::{many0, many1}, | |
170 | combinator::recognize, | |
171 | sequence::{preceded, terminated}, | |
172 | character::complete::{char, one_of}, | |
173 | bytes::complete::tag, | |
174 | }; | |
175 | ||
176 | fn hexadecimal(input: &str) -> IResult<&str, &str> { // <'a, E: ParseError<&'a str>> | |
177 | preceded( | |
178 | alt((tag("0x"), tag("0X"))), | |
179 | recognize( | |
180 | many1( | |
181 | terminated(one_of("0123456789abcdefABCDEF"), many0(char('_'))) | |
182 | ) | |
183 | ) | |
184 | )(input) | |
185 | } | |
186 | ``` | |
187 | ||
188 | If you want it to return the integer value instead, use map: | |
189 | ||
190 | ```rust | |
191 | use nom::{ | |
192 | IResult, | |
193 | branch::alt, | |
194 | multi::{many0, many1}, | |
195 | combinator::{map_res, recognize}, | |
196 | sequence::{preceded, terminated}, | |
197 | character::complete::{char, one_of}, | |
198 | bytes::complete::tag, | |
199 | }; | |
200 | ||
201 | fn hexadecimal_value(input: &str) -> IResult<&str, i64> { | |
202 | map_res( | |
203 | preceded( | |
204 | alt((tag("0x"), tag("0X"))), | |
205 | recognize( | |
206 | many1( | |
207 | terminated(one_of("0123456789abcdefABCDEF"), many0(char('_'))) | |
208 | ) | |
209 | ) | |
210 | ), | |
211 | |out: &str| i64::from_str_radix(&str::replace(&out, "_", ""), 16) | |
212 | )(input) | |
213 | } | |
214 | ``` | |
215 | ||
216 | #### Octal | |
217 | ||
218 | ```rust | |
219 | use nom::{ | |
220 | IResult, | |
221 | branch::alt, | |
222 | multi::{many0, many1}, | |
223 | combinator::recognize, | |
224 | sequence::{preceded, terminated}, | |
225 | character::complete::{char, one_of}, | |
226 | bytes::complete::tag, | |
227 | }; | |
228 | ||
229 | fn octal(input: &str) -> IResult<&str, &str> { | |
230 | preceded( | |
231 | alt((tag("0o"), tag("0O"))), | |
232 | recognize( | |
233 | many1( | |
234 | terminated(one_of("01234567"), many0(char('_'))) | |
235 | ) | |
236 | ) | |
237 | )(input) | |
238 | } | |
239 | ``` | |
240 | ||
241 | #### Binary | |
242 | ||
243 | ```rust | |
244 | use nom::{ | |
245 | IResult, | |
246 | branch::alt, | |
247 | multi::{many0, many1}, | |
248 | combinator::recognize, | |
249 | sequence::{preceded, terminated}, | |
250 | character::complete::{char, one_of}, | |
251 | bytes::complete::tag, | |
252 | }; | |
253 | ||
254 | fn binary(input: &str) -> IResult<&str, &str> { | |
255 | preceded( | |
256 | alt((tag("0b"), tag("0B"))), | |
257 | recognize( | |
258 | many1( | |
259 | terminated(one_of("01"), many0(char('_'))) | |
260 | ) | |
261 | ) | |
262 | )(input) | |
263 | } | |
264 | ``` | |
265 | ||
266 | #### Decimal | |
267 | ||
268 | ```rust | |
269 | use nom::{ | |
270 | IResult, | |
271 | multi::{many0, many1}, | |
272 | combinator::recognize, | |
273 | sequence::terminated, | |
274 | character::complete::{char, one_of}, | |
275 | }; | |
276 | ||
277 | fn decimal(input: &str) -> IResult<&str, &str> { | |
278 | recognize( | |
279 | many1( | |
280 | terminated(one_of("0123456789"), many0(char('_'))) | |
281 | ) | |
282 | )(input) | |
283 | } | |
284 | ``` | |
285 | ||
286 | ### Floating Point Numbers | |
287 | ||
288 | The following is adapted from [the Python parser by Valentin Lorentz (ProgVal)](https://github.com/ProgVal/rust-python-parser/blob/master/src/numbers.rs). | |
289 | ||
290 | ```rust | |
291 | use nom::{ | |
292 | IResult, | |
293 | branch::alt, | |
294 | multi::{many0, many1}, | |
295 | combinator::{opt, recognize}, | |
296 | sequence::{preceded, terminated, tuple}, | |
297 | character::complete::{char, one_of}, | |
298 | }; | |
299 | ||
300 | fn float(input: &str) -> IResult<&str, &str> { | |
301 | alt(( | |
302 | // Case one: .42 | |
303 | recognize( | |
304 | tuple(( | |
305 | char('.'), | |
306 | decimal, | |
307 | opt(tuple(( | |
308 | one_of("eE"), | |
309 | opt(one_of("+-")), | |
310 | decimal | |
311 | ))) | |
312 | )) | |
313 | ) | |
314 | , // Case two: 42e42 and 42.42e42 | |
315 | recognize( | |
316 | tuple(( | |
317 | decimal, | |
318 | opt(preceded( | |
319 | char('.'), | |
320 | decimal, | |
321 | )), | |
322 | one_of("eE"), | |
323 | opt(one_of("+-")), | |
324 | decimal | |
325 | )) | |
326 | ) | |
327 | , // Case three: 42. and 42.42 | |
328 | recognize( | |
329 | tuple(( | |
330 | decimal, | |
331 | char('.'), | |
332 | opt(decimal) | |
333 | )) | |
334 | ) | |
335 | ))(input) | |
336 | } | |
337 | ||
338 | fn decimal(input: &str) -> IResult<&str, &str> { | |
339 | recognize( | |
340 | many1( | |
341 | terminated(one_of("0123456789"), many0(char('_'))) | |
342 | ) | |
343 | )(input) | |
344 | } | |
345 | ``` | |
346 | ||
347 | # implementing FromStr | |
348 | ||
349 | The [FromStr trait](https://doc.rust-lang.org/std/str/trait.FromStr.html) provides | |
350 | a common interface to parse from a string. | |
351 | ||
352 | ```rust | |
353 | use nom::{ | |
354 | IResult, Finish, error::Error, | |
355 | bytes::complete::{tag, take_while}, | |
356 | }; | |
357 | use std::str::FromStr; | |
358 | ||
359 | // will recognize the name in "Hello, name!" | |
360 | fn parse_name(input: &str) -> IResult<&str, &str> { | |
361 | let (i, _) = tag("Hello, ")(input)?; | |
362 | let (i, name) = take_while(|c:char| c.is_alphabetic())(i)?; | |
363 | let (i, _) = tag("!")(i)?; | |
364 | ||
365 | Ok((i, name)) | |
366 | } | |
367 | ||
368 | // with FromStr, the result cannot be a reference to the input, it must be owned | |
369 | #[derive(Debug)] | |
370 | pub struct Name(pub String); | |
371 | ||
372 | impl FromStr for Name { | |
373 | // the error must be owned as well | |
374 | type Err = Error<String>; | |
375 | ||
376 | fn from_str(s: &str) -> Result<Self, Self::Err> { | |
377 | match parse_name(s).finish() { | |
378 | Ok((_remaining, name)) => Ok(Name(name.to_string())), | |
379 | Err(Error { input, code }) => Err(Error { | |
380 | input: input.to_string(), | |
381 | code, | |
382 | }) | |
383 | } | |
384 | } | |
385 | } | |
386 | ||
387 | fn main() { | |
388 | // parsed: Ok(Name("nom")) | |
389 | println!("parsed: {:?}", "Hello, nom!".parse::<Name>()); | |
390 | ||
391 | // parsed: Err(Error { input: "123!", code: Tag }) | |
392 | println!("parsed: {:?}", "Hello, 123!".parse::<Name>()); | |
393 | } | |
394 | ``` | |
395 |