]>
Commit | Line | Data |
---|---|---|
1d09f67e TL |
1 | # arrow |
2 | ||
3 | [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow) | |
4 | [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush) | |
5 | [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow) | |
6 | ||
7 | **[Apache Arrow](https://arrow.apache.org/) is a cross-language | |
8 | development platform for in-memory data.** It specifies a standardized | |
9 | language-independent columnar memory format for flat and hierarchical | |
10 | data, organized for efficient analytic operations on modern hardware. It | |
11 | also provides computational libraries and zero-copy streaming messaging | |
12 | and interprocess communication. | |
13 | ||
14 | **The `arrow` package exposes an interface to the Arrow C++ library, | |
15 | enabling access to many of its features in R.** It provides low-level | |
16 | access to the Arrow C++ library API and higher-level access through a | |
17 | `dplyr` backend and familiar R functions. | |
18 | ||
19 | ## What can the `arrow` package do? | |
20 | ||
21 | - Read and write **Parquet files** (`read_parquet()`, | |
22 | `write_parquet()`), an efficient and widely used columnar format | |
23 | - Read and write **Feather files** (`read_feather()`, | |
24 | `write_feather()`), a format optimized for speed and | |
25 | interoperability | |
26 | - Analyze, process, and write **multi-file, larger-than-memory | |
27 | datasets** (`open_dataset()`, `write_dataset()`) | |
28 | - Read **large CSV and JSON files** with excellent **speed and | |
29 | efficiency** (`read_csv_arrow()`, `read_json_arrow()`) | |
30 | - Write CSV files (`write_csv_arrow()`) | |
31 | - Manipulate and analyze Arrow data with **`dplyr` verbs** | |
32 | - Read and write files in **Amazon S3** buckets with no additional | |
33 | function calls | |
34 | - Exercise **fine control over column types** for seamless | |
35 | interoperability with databases and data warehouse systems | |
36 | - Use **compression codecs** including Snappy, gzip, Brotli, | |
37 | Zstandard, LZ4, LZO, and bzip2 for reading and writing data | |
38 | - Enable **zero-copy data sharing** between **R and Python** | |
39 | - Connect to **Arrow Flight** RPC servers to send and receive large | |
40 | datasets over networks | |
41 | - Access and manipulate Arrow objects through **low-level bindings** | |
42 | to the C++ library | |
43 | - Provide a **toolkit for building connectors** to other applications | |
44 | and services that use Arrow | |
45 | ||
46 | ## Installation | |
47 | ||
48 | ### Installing the latest release version | |
49 | ||
50 | Install the latest release of `arrow` from CRAN with | |
51 | ||
52 | ``` r | |
53 | install.packages("arrow") | |
54 | ``` | |
55 | ||
56 | Conda users can install `arrow` from conda-forge with | |
57 | ||
58 | ``` shell | |
59 | conda install -c conda-forge --strict-channel-priority r-arrow | |
60 | ``` | |
61 | ||
62 | Installing a released version of the `arrow` package requires no | |
63 | additional system dependencies. For macOS and Windows, CRAN hosts binary | |
64 | packages that contain the Arrow C++ library. On Linux, source package | |
65 | installation will also build necessary C++ dependencies. For a faster, | |
66 | more complete installation, set the environment variable | |
67 | `NOT_CRAN=true`. See `vignette("install", package = "arrow")` for | |
68 | details. | |
69 | ||
70 | For Windows users of R 3.6 and earlier, note that support for AWS S3 is not | |
71 | available, and the 32-bit version does not support Arrow Datasets. | |
72 | These features are only supported by the `rtools40` toolchain on Windows | |
73 | and thus are only available in R >= 4.0. | |
74 | ||
75 | ### Installing a development version | |
76 | ||
77 | Development versions of the package (binary and source) are built | |
78 | nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To | |
79 | install from there: | |
80 | ||
81 | ``` r | |
82 | install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") | |
83 | ``` | |
84 | ||
85 | Conda users can install `arrow` nightly builds with | |
86 | ||
87 | ``` shell | |
88 | conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow | |
89 | ``` | |
90 | ||
91 | If you already have a version of `arrow` installed, you can switch to | |
92 | the latest nightly development version with | |
93 | ||
94 | ``` r | |
95 | arrow::install_arrow(nightly = TRUE) | |
96 | ``` | |
97 | ||
98 | These nightly package builds are not official Apache releases and are | |
99 | not recommended for production use. They may be useful for testing bug | |
100 | fixes and new features under active development. | |
101 | ||
102 | ## Usage | |
103 | ||
104 | Among the many applications of the `arrow` package, two of the most accessible are: | |
105 | ||
106 | - High-performance reading and writing of data files with multiple | |
107 | file formats and compression codecs, including built-in support for | |
108 | cloud storage | |
109 | - Analyzing and manipulating bigger-than-memory data with `dplyr` | |
110 | verbs | |
111 | ||
112 | The sections below describe these two uses and illustrate them with | |
113 | basic examples. The sections below mention two Arrow data structures: | |
114 | ||
115 | - `Table`: a tabular, column-oriented data structure capable of | |
116 | storing and processing large amounts of data more efficiently than | |
117 | R’s built-in `data.frame` and with SQL-like column data types that | |
118 | afford better interoperability with databases and data warehouse | |
119 | systems | |
120 | - `Dataset`: a data structure functionally similar to `Table` but with | |
121 | the capability to work on larger-than-memory data partitioned across | |
122 | multiple files | |
123 | ||
124 | ### Reading and writing data files with `arrow` | |
125 | ||
126 | The `arrow` package provides functions for reading single data files in | |
127 | several common formats. By default, calling any of these functions | |
128 | returns an R `data.frame`. To return an Arrow `Table`, set argument | |
129 | `as_data_frame = FALSE`. | |
130 | ||
131 | - `read_parquet()`: read a file in Parquet format | |
132 | - `read_feather()`: read a file in Feather format (the Apache Arrow | |
133 | IPC format) | |
134 | - `read_delim_arrow()`: read a delimited text file (default delimiter | |
135 | is comma) | |
136 | - `read_csv_arrow()`: read a comma-separated values (CSV) file | |
137 | - `read_tsv_arrow()`: read a tab-separated values (TSV) file | |
138 | - `read_json_arrow()`: read a JSON data file | |
139 | ||
140 | For writing data to single files, the `arrow` package provides the | |
141 | functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`. | |
142 | These can be used with R `data.frame` and Arrow `Table` objects. | |
143 | ||
144 | For example, let’s write the Star Wars characters data that’s included | |
145 | in `dplyr` to a Parquet file, then read it back in. Parquet is a popular | |
146 | choice for storing analytic data; it is optimized for reduced file sizes | |
147 | and fast read performance, especially for column-based access patterns. | |
148 | Parquet is widely supported by many tools and platforms. | |
149 | ||
150 | First load the `arrow` and `dplyr` packages: | |
151 | ||
152 | ``` r | |
153 | library(arrow, warn.conflicts = FALSE) | |
154 | library(dplyr, warn.conflicts = FALSE) | |
155 | ``` | |
156 | ||
157 | Then write the `data.frame` named `starwars` to a Parquet file at | |
158 | `file_path`: | |
159 | ||
160 | ``` r | |
161 | file_path <- tempfile() | |
162 | write_parquet(starwars, file_path) | |
163 | ``` | |
164 | ||
165 | Then read the Parquet file into an R `data.frame` named `sw`: | |
166 | ||
167 | ``` r | |
168 | sw <- read_parquet(file_path) | |
169 | ``` | |
170 | ||
171 | R object attributes are preserved when writing data to Parquet or | |
172 | Feather files and when reading those files back into R. This enables | |
173 | round-trip writing and reading of `sf::sf` objects, R `data.frame`s with | |
174 | with `haven::labelled` columns, and `data.frame`s with other custom | |
175 | attributes. | |
176 | ||
177 | For reading and writing larger files or sets of multiple files, `arrow` | |
178 | defines `Dataset` objects and provides the functions `open_dataset()` | |
179 | and `write_dataset()`, which enable analysis and processing of | |
180 | bigger-than-memory data, including the ability to partition data into | |
181 | smaller chunks without loading the full data into memory. For examples | |
182 | of these functions, see `vignette("dataset", package = "arrow")`. | |
183 | ||
184 | All these functions can read and write files in the local filesystem or | |
185 | in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more | |
186 | details, see `vignette("fs", package = "arrow")` | |
187 | ||
188 | ### Using `dplyr` with `arrow` | |
189 | ||
190 | The `arrow` package provides a `dplyr` backend enabling manipulation of | |
191 | Arrow tabular data with `dplyr` verbs. To use it, first load both | |
192 | packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or | |
193 | `Dataset` object. For example, read the Parquet file written in the | |
194 | previous example into an Arrow `Table` named `sw`: | |
195 | ||
196 | ``` r | |
197 | sw <- read_parquet(file_path, as_data_frame = FALSE) | |
198 | ``` | |
199 | ||
200 | Next, pipe on `dplyr` verbs: | |
201 | ||
202 | ``` r | |
203 | result <- sw %>% | |
204 | filter(homeworld == "Tatooine") %>% | |
205 | rename(height_cm = height, mass_kg = mass) %>% | |
206 | mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>% | |
207 | arrange(desc(birth_year)) %>% | |
208 | select(name, height_in, mass_lbs) | |
209 | ``` | |
210 | ||
211 | The `arrow` package uses lazy evaluation to delay computation until the | |
212 | result is required. This speeds up processing by enabling the Arrow C++ | |
213 | library to perform multiple computations in one operation. `result` is | |
214 | an object with class `arrow_dplyr_query` which represents all the | |
215 | computations to be performed: | |
216 | ||
217 | ``` r | |
218 | result | |
219 | #> Table (query) | |
220 | #> name: string | |
221 | #> height_in: expr | |
222 | #> mass_lbs: expr | |
223 | #> | |
224 | #> * Filter: equal(homeworld, "Tatooine") | |
225 | #> * Sorted by birth_year [desc] | |
226 | #> See $.data for the source Arrow object | |
227 | ``` | |
228 | ||
229 | To perform these computations and materialize the result, call | |
230 | `compute()` or `collect()`. `compute()` returns an Arrow `Table`, | |
231 | suitable for passing to other `arrow` or `dplyr` functions: | |
232 | ||
233 | ``` r | |
234 | result %>% compute() | |
235 | #> Table | |
236 | #> 10 rows x 3 columns | |
237 | #> $name <string> | |
238 | #> $height_in <double> | |
239 | #> $mass_lbs <double> | |
240 | ``` | |
241 | ||
242 | `collect()` returns an R `data.frame`, suitable for viewing or passing | |
243 | to other R functions for analysis or visualization: | |
244 | ||
245 | ``` r | |
246 | result %>% collect() | |
247 | #> # A tibble: 10 x 3 | |
248 | #> name height_in mass_lbs | |
249 | #> <chr> <dbl> <dbl> | |
250 | #> 1 C-3PO 65.7 165. | |
251 | #> 2 Cliegg Lars 72.0 NA | |
252 | #> 3 Shmi Skywalker 64.2 NA | |
253 | #> 4 Owen Lars 70.1 265. | |
254 | #> 5 Beru Whitesun lars 65.0 165. | |
255 | #> 6 Darth Vader 79.5 300. | |
256 | #> 7 Anakin Skywalker 74.0 185. | |
257 | #> 8 Biggs Darklighter 72.0 185. | |
258 | #> 9 Luke Skywalker 67.7 170. | |
259 | #> 10 R5-D4 38.2 70.5 | |
260 | ``` | |
261 | ||
262 | The `arrow` package works with most single-table `dplyr` verbs, including those | |
263 | that compute aggregates. | |
264 | ||
265 | ```r | |
266 | sw %>% | |
267 | group_by(species) %>% | |
268 | summarise(mean_height = mean(height, na.rm = TRUE)) %>% | |
269 | collect() | |
270 | ``` | |
271 | ||
272 | Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported | |
273 | for joining multiple tables. | |
274 | ||
275 | ```r | |
276 | jedi <- data.frame( | |
277 | name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"), | |
278 | jedi = c(FALSE, TRUE, TRUE) | |
279 | ) | |
280 | ||
281 | sw %>% | |
282 | select(1:11) %>% | |
283 | right_join(jedi) %>% | |
284 | collect() | |
285 | ``` | |
286 | ||
287 | Window functions (e.g. `ntile()`) are not yet | |
288 | supported. Inside `dplyr` verbs, Arrow offers support for many functions and | |
289 | operators, with common functions mapped to their base R and tidyverse | |
290 | equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html) | |
291 | lists many of them. If there are additional functions you would like to see | |
292 | implemented, please file an issue as described in the [Getting | |
293 | help](#getting-help) section below. | |
294 | ||
295 | For `dplyr` queries on `Table` objects, if the `arrow` package detects | |
296 | an unimplemented function within a `dplyr` verb, it automatically calls | |
297 | `collect()` to return the data as an R `data.frame` before processing | |
298 | that `dplyr` verb. For queries on `Dataset` objects (which can be larger | |
299 | than memory), it raises an error if the function is unimplemented; | |
300 | you need to explicitly tell it to `collect()`. | |
301 | ||
302 | ### Additional features | |
303 | ||
304 | Other applications of `arrow` are described in the following vignettes: | |
305 | ||
306 | - `vignette("python", package = "arrow")`: use `arrow` and | |
307 | `reticulate` to pass data between R and Python | |
308 | - `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC | |
309 | servers to send and receive data | |
310 | - `vignette("arrow", package = "arrow")`: access and manipulate Arrow | |
311 | objects through low-level bindings to the C++ library | |
312 | ||
313 | ## Getting help | |
314 | ||
315 | If you encounter a bug, please file an issue with a minimal reproducible | |
316 | example on the [Apache Jira issue | |
317 | tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create | |
318 | an account or log in, then click **Create** to file an issue. Select the | |
319 | project **Apache Arrow (ARROW)**, select the component **R**, and begin | |
320 | the issue summary with **`[R]`** followed by a space. For more | |
321 | information, see the **Report bugs and propose features** section of the | |
322 | [Contributing to Apache | |
323 | Arrow](https://arrow.apache.org/docs/developers/contributing.html) page | |
324 | in the Arrow developer documentation. | |
325 | ||
326 | We welcome questions, discussion, and contributions from users of the | |
327 | `arrow` package. For information about mailing lists and other venues | |
328 | for engaging with the Arrow developer and user communities, please see | |
329 | the [Apache Arrow Community](https://arrow.apache.org/community/) page. | |
330 | ||
331 | ------------------------------------------------------------------------ | |
332 | ||
333 | All participation in the Apache Arrow project is governed by the Apache | |
334 | Software Foundation’s [code of | |
335 | conduct](https://www.apache.org/foundation/policies/conduct.html). |