]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/r/README.md
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / r / README.md
CommitLineData
1d09f67e
TL
1# arrow
2
3[![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
4[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
5[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
6
7**[Apache Arrow](https://arrow.apache.org/) is a cross-language
8development platform for in-memory data.** It specifies a standardized
9language-independent columnar memory format for flat and hierarchical
10data, organized for efficient analytic operations on modern hardware. It
11also provides computational libraries and zero-copy streaming messaging
12and interprocess communication.
13
14**The `arrow` package exposes an interface to the Arrow C++ library,
15enabling access to many of its features in R.** It provides low-level
16access to the Arrow C++ library API and higher-level access through a
17`dplyr` backend and familiar R functions.
18
19## What can the `arrow` package do?
20
21- Read and write **Parquet files** (`read_parquet()`,
22 `write_parquet()`), an efficient and widely used columnar format
23- Read and write **Feather files** (`read_feather()`,
24 `write_feather()`), a format optimized for speed and
25 interoperability
26- Analyze, process, and write **multi-file, larger-than-memory
27 datasets** (`open_dataset()`, `write_dataset()`)
28- Read **large CSV and JSON files** with excellent **speed and
29 efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
30- Write CSV files (`write_csv_arrow()`)
31- Manipulate and analyze Arrow data with **`dplyr` verbs**
32- Read and write files in **Amazon S3** buckets with no additional
33 function calls
34- Exercise **fine control over column types** for seamless
35 interoperability with databases and data warehouse systems
36- Use **compression codecs** including Snappy, gzip, Brotli,
37 Zstandard, LZ4, LZO, and bzip2 for reading and writing data
38- Enable **zero-copy data sharing** between **R and Python**
39- Connect to **Arrow Flight** RPC servers to send and receive large
40 datasets over networks
41- Access and manipulate Arrow objects through **low-level bindings**
42 to the C++ library
43- Provide a **toolkit for building connectors** to other applications
44 and services that use Arrow
45
46## Installation
47
48### Installing the latest release version
49
50Install the latest release of `arrow` from CRAN with
51
52``` r
53install.packages("arrow")
54```
55
56Conda users can install `arrow` from conda-forge with
57
58``` shell
59conda install -c conda-forge --strict-channel-priority r-arrow
60```
61
62Installing a released version of the `arrow` package requires no
63additional system dependencies. For macOS and Windows, CRAN hosts binary
64packages that contain the Arrow C++ library. On Linux, source package
65installation will also build necessary C++ dependencies. For a faster,
66more complete installation, set the environment variable
67`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
68details.
69
70For Windows users of R 3.6 and earlier, note that support for AWS S3 is not
71available, and the 32-bit version does not support Arrow Datasets.
72These features are only supported by the `rtools40` toolchain on Windows
73and thus are only available in R >= 4.0.
74
75### Installing a development version
76
77Development versions of the package (binary and source) are built
78nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
79install from there:
80
81``` r
82install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
83```
84
85Conda users can install `arrow` nightly builds with
86
87``` shell
88conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
89```
90
91If you already have a version of `arrow` installed, you can switch to
92the latest nightly development version with
93
94``` r
95arrow::install_arrow(nightly = TRUE)
96```
97
98These nightly package builds are not official Apache releases and are
99not recommended for production use. They may be useful for testing bug
100fixes and new features under active development.
101
102## Usage
103
104Among the many applications of the `arrow` package, two of the most accessible are:
105
106- High-performance reading and writing of data files with multiple
107 file formats and compression codecs, including built-in support for
108 cloud storage
109- Analyzing and manipulating bigger-than-memory data with `dplyr`
110 verbs
111
112The sections below describe these two uses and illustrate them with
113basic examples. The sections below mention two Arrow data structures:
114
115- `Table`: a tabular, column-oriented data structure capable of
116 storing and processing large amounts of data more efficiently than
117 R’s built-in `data.frame` and with SQL-like column data types that
118 afford better interoperability with databases and data warehouse
119 systems
120- `Dataset`: a data structure functionally similar to `Table` but with
121 the capability to work on larger-than-memory data partitioned across
122 multiple files
123
124### Reading and writing data files with `arrow`
125
126The `arrow` package provides functions for reading single data files in
127several common formats. By default, calling any of these functions
128returns an R `data.frame`. To return an Arrow `Table`, set argument
129`as_data_frame = FALSE`.
130
131- `read_parquet()`: read a file in Parquet format
132- `read_feather()`: read a file in Feather format (the Apache Arrow
133 IPC format)
134- `read_delim_arrow()`: read a delimited text file (default delimiter
135 is comma)
136- `read_csv_arrow()`: read a comma-separated values (CSV) file
137- `read_tsv_arrow()`: read a tab-separated values (TSV) file
138- `read_json_arrow()`: read a JSON data file
139
140For writing data to single files, the `arrow` package provides the
141functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
142These can be used with R `data.frame` and Arrow `Table` objects.
143
144For example, let’s write the Star Wars characters data that’s included
145in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
146choice for storing analytic data; it is optimized for reduced file sizes
147and fast read performance, especially for column-based access patterns.
148Parquet is widely supported by many tools and platforms.
149
150First load the `arrow` and `dplyr` packages:
151
152``` r
153library(arrow, warn.conflicts = FALSE)
154library(dplyr, warn.conflicts = FALSE)
155```
156
157Then write the `data.frame` named `starwars` to a Parquet file at
158`file_path`:
159
160``` r
161file_path <- tempfile()
162write_parquet(starwars, file_path)
163```
164
165Then read the Parquet file into an R `data.frame` named `sw`:
166
167``` r
168sw <- read_parquet(file_path)
169```
170
171R object attributes are preserved when writing data to Parquet or
172Feather files and when reading those files back into R. This enables
173round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
174with `haven::labelled` columns, and `data.frame`s with other custom
175attributes.
176
177For reading and writing larger files or sets of multiple files, `arrow`
178defines `Dataset` objects and provides the functions `open_dataset()`
179and `write_dataset()`, which enable analysis and processing of
180bigger-than-memory data, including the ability to partition data into
181smaller chunks without loading the full data into memory. For examples
182of these functions, see `vignette("dataset", package = "arrow")`.
183
184All these functions can read and write files in the local filesystem or
185in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
186details, see `vignette("fs", package = "arrow")`
187
188### Using `dplyr` with `arrow`
189
190The `arrow` package provides a `dplyr` backend enabling manipulation of
191Arrow tabular data with `dplyr` verbs. To use it, first load both
192packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
193`Dataset` object. For example, read the Parquet file written in the
194previous example into an Arrow `Table` named `sw`:
195
196``` r
197sw <- read_parquet(file_path, as_data_frame = FALSE)
198```
199
200Next, pipe on `dplyr` verbs:
201
202``` r
203result <- sw %>%
204 filter(homeworld == "Tatooine") %>%
205 rename(height_cm = height, mass_kg = mass) %>%
206 mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
207 arrange(desc(birth_year)) %>%
208 select(name, height_in, mass_lbs)
209```
210
211The `arrow` package uses lazy evaluation to delay computation until the
212result is required. This speeds up processing by enabling the Arrow C++
213library to perform multiple computations in one operation. `result` is
214an object with class `arrow_dplyr_query` which represents all the
215computations to be performed:
216
217``` r
218result
219#> Table (query)
220#> name: string
221#> height_in: expr
222#> mass_lbs: expr
223#>
224#> * Filter: equal(homeworld, "Tatooine")
225#> * Sorted by birth_year [desc]
226#> See $.data for the source Arrow object
227```
228
229To perform these computations and materialize the result, call
230`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
231suitable for passing to other `arrow` or `dplyr` functions:
232
233``` r
234result %>% compute()
235#> Table
236#> 10 rows x 3 columns
237#> $name <string>
238#> $height_in <double>
239#> $mass_lbs <double>
240```
241
242`collect()` returns an R `data.frame`, suitable for viewing or passing
243to other R functions for analysis or visualization:
244
245``` r
246result %>% collect()
247#> # A tibble: 10 x 3
248#> name height_in mass_lbs
249#> <chr> <dbl> <dbl>
250#> 1 C-3PO 65.7 165.
251#> 2 Cliegg Lars 72.0 NA
252#> 3 Shmi Skywalker 64.2 NA
253#> 4 Owen Lars 70.1 265.
254#> 5 Beru Whitesun lars 65.0 165.
255#> 6 Darth Vader 79.5 300.
256#> 7 Anakin Skywalker 74.0 185.
257#> 8 Biggs Darklighter 72.0 185.
258#> 9 Luke Skywalker 67.7 170.
259#> 10 R5-D4 38.2 70.5
260```
261
262The `arrow` package works with most single-table `dplyr` verbs, including those
263that compute aggregates.
264
265```r
266sw %>%
267 group_by(species) %>%
268 summarise(mean_height = mean(height, na.rm = TRUE)) %>%
269 collect()
270```
271
272Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported
273for joining multiple tables.
274
275```r
276jedi <- data.frame(
277 name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
278 jedi = c(FALSE, TRUE, TRUE)
279)
280
281sw %>%
282 select(1:11) %>%
283 right_join(jedi) %>%
284 collect()
285```
286
287Window functions (e.g. `ntile()`) are not yet
288supported. Inside `dplyr` verbs, Arrow offers support for many functions and
289operators, with common functions mapped to their base R and tidyverse
290equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html)
291lists many of them. If there are additional functions you would like to see
292implemented, please file an issue as described in the [Getting
293help](#getting-help) section below.
294
295For `dplyr` queries on `Table` objects, if the `arrow` package detects
296an unimplemented function within a `dplyr` verb, it automatically calls
297`collect()` to return the data as an R `data.frame` before processing
298that `dplyr` verb. For queries on `Dataset` objects (which can be larger
299than memory), it raises an error if the function is unimplemented;
300you need to explicitly tell it to `collect()`.
301
302### Additional features
303
304Other applications of `arrow` are described in the following vignettes:
305
306- `vignette("python", package = "arrow")`: use `arrow` and
307 `reticulate` to pass data between R and Python
308- `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC
309 servers to send and receive data
310- `vignette("arrow", package = "arrow")`: access and manipulate Arrow
311 objects through low-level bindings to the C++ library
312
313## Getting help
314
315If you encounter a bug, please file an issue with a minimal reproducible
316example on the [Apache Jira issue
317tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create
318an account or log in, then click **Create** to file an issue. Select the
319project **Apache Arrow (ARROW)**, select the component **R**, and begin
320the issue summary with **`[R]`** followed by a space. For more
321information, see the **Report bugs and propose features** section of the
322[Contributing to Apache
323Arrow](https://arrow.apache.org/docs/developers/contributing.html) page
324in the Arrow developer documentation.
325
326We welcome questions, discussion, and contributions from users of the
327`arrow` package. For information about mailing lists and other venues
328for engaging with the Arrow developer and user communities, please see
329the [Apache Arrow Community](https://arrow.apache.org/community/) page.
330
331------------------------------------------------------------------------
332
333All participation in the Apache Arrow project is governed by the Apache
334Software Foundation’s [code of
335conduct](https://www.apache.org/foundation/policies/conduct.html).