]>
Commit | Line | Data |
---|---|---|
1d09f67e TL |
1 | % Generated by roxygen2: do not edit by hand |
2 | % Please edit documentation in R/dataset-scan.R | |
3 | \name{Scanner} | |
4 | \alias{Scanner} | |
5 | \alias{ScannerBuilder} | |
6 | \title{Scan the contents of a dataset} | |
7 | \description{ | |
8 | A \code{Scanner} iterates over a \link{Dataset}'s fragments and returns data | |
9 | according to given row filtering and column projection. A \code{ScannerBuilder} | |
10 | can help create one. | |
11 | } | |
12 | \section{Factory}{ | |
13 | ||
14 | \code{Scanner$create()} wraps the \code{ScannerBuilder} interface to make a \code{Scanner}. | |
15 | It takes the following arguments: | |
16 | \itemize{ | |
17 | \item \code{dataset}: A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the | |
18 | \code{dplyr} methods on \code{Dataset}. | |
19 | \item \code{projection}: A character vector of column names to select columns or a | |
20 | named list of expressions | |
21 | \item \code{filter}: A \code{Expression} to filter the scanned rows by, or \code{TRUE} (default) | |
22 | to keep all rows. | |
23 | \item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE} | |
24 | \item \code{use_async}: logical: should the async scanner (performs better on | |
25 | high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE} | |
26 | \item \code{...}: Additional arguments, currently ignored | |
27 | } | |
28 | } | |
29 | ||
30 | \section{Methods}{ | |
31 | ||
32 | \code{ScannerBuilder} has the following methods: | |
33 | \itemize{ | |
34 | \item \verb{$Project(cols)}: Indicate that the scan should only return columns given | |
35 | by \code{cols}, a character vector of column names | |
36 | \item \verb{$Filter(expr)}: Filter rows by an \link{Expression}. | |
37 | \item \verb{$UseThreads(threads)}: logical: should the scan use multithreading? | |
38 | The method's default input is \code{TRUE}, but you must call the method to enable | |
39 | multithreading because the scanner default is \code{FALSE}. | |
40 | \item \verb{$UseAsync(use_async)}: logical: should the async scanner be used? | |
41 | \item \verb{$BatchSize(batch_size)}: integer: Maximum row count of scanned record | |
42 | batches, default is 32K. If scanned record batches are overflowing memory | |
43 | then this method can be called to reduce their size. | |
44 | \item \verb{$schema}: Active binding, returns the \link{Schema} of the Dataset | |
45 | \item \verb{$Finish()}: Returns a \code{Scanner} | |
46 | } | |
47 | ||
48 | \code{Scanner} currently has a single method, \verb{$ToTable()}, which evaluates the | |
49 | query and returns an Arrow \link{Table}. | |
50 | } | |
51 |