[ceph.git] / ceph / src / arrow / r / man / Dataset.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dataset.R, R/dataset-factory.R
\name{Dataset}
\alias{Dataset}
\alias{FileSystemDataset}
\alias{UnionDataset}
\alias{InMemoryDataset}
\alias{DatasetFactory}
\alias{FileSystemDatasetFactory}
\title{Multi-file datasets}
\description{
Arrow Datasets allow you to query against data that has been split across
multiple files. This sharding of data may indicate partitioning, which
can accelerate queries that only touch some partitions (files).

A \code{Dataset} contains one or more \code{Fragments}, such as files, of potentially
differing type and partitioning.

For \code{Dataset$create()}, see \code{\link[=open_dataset]{open_dataset()}}, which is an alias for it.

\code{DatasetFactory} is used to provide finer control over the creation of \code{Dataset}s.
}
\section{Factory}{

\code{DatasetFactory} is used to create a \code{Dataset}, inspect the \link{Schema} of the
fragments contained in it, and declare a partitioning.
\code{FileSystemDatasetFactory} is a subclass of \code{DatasetFactory} for
discovering files in the local file system, the only currently supported
file system.

For the \code{DatasetFactory$create()} factory method, see \code{\link[=dataset_factory]{dataset_factory()}}, an
alias for it. A \code{DatasetFactory} has:
\itemize{
\item \verb{$Inspect(unify_schemas)}: If \code{unify_schemas} is \code{TRUE}, all fragments
will be scanned and a unified \link{Schema} will be created from them; if \code{FALSE}
(default), only the first fragment will be inspected for its schema. Use this
fast path when you know and trust that all fragments have an identical schema.
\item \verb{$Finish(schema, unify_schemas)}: Returns a \code{Dataset}. If \code{schema} is provided,
it will be used for the \code{Dataset}; if omitted, a \code{Schema} will be created from
inspecting the fragments (files) in the dataset, following \code{unify_schemas}
as described above.
}

\code{FileSystemDatasetFactory$create()} is a lower-level factory method and
takes the following arguments:
\itemize{
\item \code{filesystem}: A \link{FileSystem}
\item \code{selector}: Either a \link{FileSelector} or \code{NULL}
\item \code{paths}: Either a character vector of file paths or \code{NULL}
\item \code{format}: A \link{FileFormat}
\item \code{partitioning}: Either \code{Partitioning}, \code{PartitioningFactory}, or \code{NULL}
}
}

\section{Methods}{


A \code{Dataset} has the following methods:
\itemize{
\item \verb{$NewScan()}: Returns a \link{ScannerBuilder} for building a query
\item \verb{$schema}: Active binding that returns the \link{Schema} of the Dataset; you
may also replace the dataset's schema by using \code{ds$schema <- new_schema}.
This method currently supports only adding, removing, or reordering
fields in the schema: you cannot alter or cast the field types.
}

\code{FileSystemDataset} has the following methods:
\itemize{
\item \verb{$files}: Active binding, returns the files of the \code{FileSystemDataset}
\item \verb{$format}: Active binding, returns the \link{FileFormat} of the \code{FileSystemDataset}
}

\code{UnionDataset} has the following methods:
\itemize{
\item \verb{$children}: Active binding, returns all child \code{Dataset}s.
}
}

\seealso{
\code{\link[=open_dataset]{open_dataset()}} for a simple interface to creating a \code{Dataset}
}
Commit	Line	Data
1d09f67e TL	1	% Generated by roxygen2: do not edit by hand
	2	% Please edit documentation in R/dataset.R, R/dataset-factory.R
	3	\name{Dataset}
	4	\alias{Dataset}
	5	\alias{FileSystemDataset}
	6	\alias{UnionDataset}
	7	\alias{InMemoryDataset}
	8	\alias{DatasetFactory}
	9	\alias{FileSystemDatasetFactory}
	10	\title{Multi-file datasets}
	11	\description{
	12	Arrow Datasets allow you to query against data that has been split across
	13	multiple files. This sharding of data may indicate partitioning, which
	14	can accelerate queries that only touch some partitions (files).
	15
	16	A \code{Dataset} contains one or more \code{Fragments}, such as files, of potentially
	17	differing type and partitioning.
	18
	19	For \code{Dataset$create()}, see \code{\link[=open_dataset]{open_dataset()}}, which is an alias for it.
	20
	21	\code{DatasetFactory} is used to provide finer control over the creation of \code{Dataset}s.
	22	}
	23	\section{Factory}{
	24
	25	\code{DatasetFactory} is used to create a \code{Dataset}, inspect the \link{Schema} of the
	26	fragments contained in it, and declare a partitioning.
	27	\code{FileSystemDatasetFactory} is a subclass of \code{DatasetFactory} for
	28	discovering files in the local file system, the only currently supported
	29	file system.
	30
	31	For the \code{DatasetFactory$create()} factory method, see \code{\link[=dataset_factory]{dataset_factory()}}, an
	32	alias for it. A \code{DatasetFactory} has:
	33	\itemize{
	34	\item \verb{$Inspect(unify_schemas)}: If \code{unify_schemas} is \code{TRUE}, all fragments
	35	will be scanned and a unified \link{Schema} will be created from them; if \code{FALSE}
	36	(default), only the first fragment will be inspected for its schema. Use this
	37	fast path when you know and trust that all fragments have an identical schema.
	38	\item \verb{$Finish(schema, unify_schemas)}: Returns a \code{Dataset}. If \code{schema} is provided,
	39	it will be used for the \code{Dataset}; if omitted, a \code{Schema} will be created from
	40	inspecting the fragments (files) in the dataset, following \code{unify_schemas}
	41	as described above.
	42	}
	43
	44	\code{FileSystemDatasetFactory$create()} is a lower-level factory method and
	45	takes the following arguments:
	46	\itemize{
	47	\item \code{filesystem}: A \link{FileSystem}
	48	\item \code{selector}: Either a \link{FileSelector} or \code{NULL}
	49	\item \code{paths}: Either a character vector of file paths or \code{NULL}
	50	\item \code{format}: A \link{FileFormat}
	51	\item \code{partitioning}: Either \code{Partitioning}, \code{PartitioningFactory}, or \code{NULL}
	52	}
	53	}
	54
	55	\section{Methods}{
	56
	57
	58	A \code{Dataset} has the following methods:
	59	\itemize{
	60	\item \verb{$NewScan()}: Returns a \link{ScannerBuilder} for building a query
	61	\item \verb{$schema}: Active binding that returns the \link{Schema} of the Dataset; you
	62	may also replace the dataset's schema by using \code{ds$schema <- new_schema}.
	63	This method currently supports only adding, removing, or reordering
	64	fields in the schema: you cannot alter or cast the field types.
65	}
66
67	\code{FileSystemDataset} has the following methods:
68	\itemize{
69	\item \verb{$files}: Active binding, returns the files of the \code{FileSystemDataset}
70	\item \verb{$format}: Active binding, returns the \link{FileFormat} of the \code{FileSystemDataset}
71	}
72
73	\code{UnionDataset} has the following methods:
74	\itemize{
75	\item \verb{$children}: Active binding, returns all child \code{Dataset}s.
76	}
77	}
78
79	\seealso{
80	\code{\link[=open_dataset]{open_dataset()}} for a simple interface to creating a \code{Dataset}
81	}