]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/r/man/Scanner.Rd
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / r / man / Scanner.Rd
CommitLineData
1d09f67e
TL
1% Generated by roxygen2: do not edit by hand
2% Please edit documentation in R/dataset-scan.R
3\name{Scanner}
4\alias{Scanner}
5\alias{ScannerBuilder}
6\title{Scan the contents of a dataset}
7\description{
8A \code{Scanner} iterates over a \link{Dataset}'s fragments and returns data
9according to given row filtering and column projection. A \code{ScannerBuilder}
10can help create one.
11}
12\section{Factory}{
13
14\code{Scanner$create()} wraps the \code{ScannerBuilder} interface to make a \code{Scanner}.
15It takes the following arguments:
16\itemize{
17\item \code{dataset}: A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
18\code{dplyr} methods on \code{Dataset}.
19\item \code{projection}: A character vector of column names to select columns or a
20named list of expressions
21\item \code{filter}: A \code{Expression} to filter the scanned rows by, or \code{TRUE} (default)
22to keep all rows.
23\item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE}
24\item \code{use_async}: logical: should the async scanner (performs better on
25high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE}
26\item \code{...}: Additional arguments, currently ignored
27}
28}
29
30\section{Methods}{
31
32\code{ScannerBuilder} has the following methods:
33\itemize{
34\item \verb{$Project(cols)}: Indicate that the scan should only return columns given
35by \code{cols}, a character vector of column names
36\item \verb{$Filter(expr)}: Filter rows by an \link{Expression}.
37\item \verb{$UseThreads(threads)}: logical: should the scan use multithreading?
38The method's default input is \code{TRUE}, but you must call the method to enable
39multithreading because the scanner default is \code{FALSE}.
40\item \verb{$UseAsync(use_async)}: logical: should the async scanner be used?
41\item \verb{$BatchSize(batch_size)}: integer: Maximum row count of scanned record
42batches, default is 32K. If scanned record batches are overflowing memory
43then this method can be called to reduce their size.
44\item \verb{$schema}: Active binding, returns the \link{Schema} of the Dataset
45\item \verb{$Finish()}: Returns a \code{Scanner}
46}
47
48\code{Scanner} currently has a single method, \verb{$ToTable()}, which evaluates the
49query and returns an Arrow \link{Table}.
50}
51