]> git.proxmox.com Git - ceph.git/blob - ceph/src/arrow/r/man/Scanner.Rd
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / r / man / Scanner.Rd
1 % Generated by roxygen2: do not edit by hand
2 % Please edit documentation in R/dataset-scan.R
3 \name{Scanner}
4 \alias{Scanner}
5 \alias{ScannerBuilder}
6 \title{Scan the contents of a dataset}
7 \description{
8 A \code{Scanner} iterates over a \link{Dataset}'s fragments and returns data
9 according to given row filtering and column projection. A \code{ScannerBuilder}
10 can help create one.
11 }
12 \section{Factory}{
13
14 \code{Scanner$create()} wraps the \code{ScannerBuilder} interface to make a \code{Scanner}.
15 It takes the following arguments:
16 \itemize{
17 \item \code{dataset}: A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
18 \code{dplyr} methods on \code{Dataset}.
19 \item \code{projection}: A character vector of column names to select columns or a
20 named list of expressions
21 \item \code{filter}: A \code{Expression} to filter the scanned rows by, or \code{TRUE} (default)
22 to keep all rows.
23 \item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE}
24 \item \code{use_async}: logical: should the async scanner (performs better on
25 high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE}
26 \item \code{...}: Additional arguments, currently ignored
27 }
28 }
29
30 \section{Methods}{
31
32 \code{ScannerBuilder} has the following methods:
33 \itemize{
34 \item \verb{$Project(cols)}: Indicate that the scan should only return columns given
35 by \code{cols}, a character vector of column names
36 \item \verb{$Filter(expr)}: Filter rows by an \link{Expression}.
37 \item \verb{$UseThreads(threads)}: logical: should the scan use multithreading?
38 The method's default input is \code{TRUE}, but you must call the method to enable
39 multithreading because the scanner default is \code{FALSE}.
40 \item \verb{$UseAsync(use_async)}: logical: should the async scanner be used?
41 \item \verb{$BatchSize(batch_size)}: integer: Maximum row count of scanned record
42 batches, default is 32K. If scanned record batches are overflowing memory
43 then this method can be called to reduce their size.
44 \item \verb{$schema}: Active binding, returns the \link{Schema} of the Dataset
45 \item \verb{$Finish()}: Returns a \code{Scanner}
46 }
47
48 \code{Scanner} currently has a single method, \verb{$ToTable()}, which evaluates the
49 query and returns an Arrow \link{Table}.
50 }
51