[ceph.git] / ceph / src / arrow / r / man / Scanner.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dataset-scan.R
\name{Scanner}
\alias{Scanner}
\alias{ScannerBuilder}
\title{Scan the contents of a dataset}
\description{
A \code{Scanner} iterates over a \link{Dataset}'s fragments and returns data
according to given row filtering and column projection. A \code{ScannerBuilder}
can help create one.
}
\section{Factory}{

\code{Scanner$create()} wraps the \code{ScannerBuilder} interface to make a \code{Scanner}.
It takes the following arguments:
\itemize{
\item \code{dataset}: A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
\code{dplyr} methods on \code{Dataset}.
\item \code{projection}: A character vector of column names to select columns or a
named list of expressions
\item \code{filter}: A \code{Expression} to filter the scanned rows by, or \code{TRUE} (default)
to keep all rows.
\item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE}
\item \code{use_async}: logical: should the async scanner (performs better on
high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE}
\item \code{...}: Additional arguments, currently ignored
}
}

\section{Methods}{

\code{ScannerBuilder} has the following methods:
\itemize{
\item \verb{$Project(cols)}: Indicate that the scan should only return columns given
by \code{cols}, a character vector of column names
\item \verb{$Filter(expr)}: Filter rows by an \link{Expression}.
\item \verb{$UseThreads(threads)}: logical: should the scan use multithreading?
The method's default input is \code{TRUE}, but you must call the method to enable
multithreading because the scanner default is \code{FALSE}.
\item \verb{$UseAsync(use_async)}: logical: should the async scanner be used?
\item \verb{$BatchSize(batch_size)}: integer: Maximum row count of scanned record
batches, default is 32K. If scanned record batches are overflowing memory
then this method can be called to reduce their size.
\item \verb{$schema}: Active binding, returns the \link{Schema} of the Dataset
\item \verb{$Finish()}: Returns a \code{Scanner}
}

\code{Scanner} currently has a single method, \verb{$ToTable()}, which evaluates the
query and returns an Arrow \link{Table}.
}
Commit	Line	Data
1d09f67e TL	1	% Generated by roxygen2: do not edit by hand
	2	% Please edit documentation in R/dataset-scan.R
	3	\name{Scanner}
	4	\alias{Scanner}
	5	\alias{ScannerBuilder}
	6	\title{Scan the contents of a dataset}
	7	\description{
	8	A \code{Scanner} iterates over a \link{Dataset}'s fragments and returns data
	9	according to given row filtering and column projection. A \code{ScannerBuilder}
	10	can help create one.
	11	}
	12	\section{Factory}{
	13
	14	\code{Scanner$create()} wraps the \code{ScannerBuilder} interface to make a \code{Scanner}.
	15	It takes the following arguments:
	16	\itemize{
	17	\item \code{dataset}: A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
	18	\code{dplyr} methods on \code{Dataset}.
	19	\item \code{projection}: A character vector of column names to select columns or a
	20	named list of expressions
	21	\item \code{filter}: A \code{Expression} to filter the scanned rows by, or \code{TRUE} (default)
	22	to keep all rows.
	23	\item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE}
	24	\item \code{use_async}: logical: should the async scanner (performs better on
	25	high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE}
	26	\item \code{...}: Additional arguments, currently ignored
	27	}
	28	}
	29
	30	\section{Methods}{
	31
	32	\code{ScannerBuilder} has the following methods:
	33	\itemize{
	34	\item \verb{$Project(cols)}: Indicate that the scan should only return columns given
	35	by \code{cols}, a character vector of column names
	36	\item \verb{$Filter(expr)}: Filter rows by an \link{Expression}.
	37	\item \verb{$UseThreads(threads)}: logical: should the scan use multithreading?
	38	The method's default input is \code{TRUE}, but you must call the method to enable
	39	multithreading because the scanner default is \code{FALSE}.
	40	\item \verb{$UseAsync(use_async)}: logical: should the async scanner be used?
	41	\item \verb{$BatchSize(batch_size)}: integer: Maximum row count of scanned record
	42	batches, default is 32K. If scanned record batches are overflowing memory
	43	then this method can be called to reduce their size.
	44	\item \verb{$schema}: Active binding, returns the \link{Schema} of the Dataset
	45	\item \verb{$Finish()}: Returns a \code{Scanner}
	46	}
	47
	48	\code{Scanner} currently has a single method, \verb{$ToTable()}, which evaluates the
	49	query and returns an Arrow \link{Table}.
	50	}
	51