]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/r/man/Dataset.Rd
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / r / man / Dataset.Rd
CommitLineData
1d09f67e
TL
1% Generated by roxygen2: do not edit by hand
2% Please edit documentation in R/dataset.R, R/dataset-factory.R
3\name{Dataset}
4\alias{Dataset}
5\alias{FileSystemDataset}
6\alias{UnionDataset}
7\alias{InMemoryDataset}
8\alias{DatasetFactory}
9\alias{FileSystemDatasetFactory}
10\title{Multi-file datasets}
11\description{
12Arrow Datasets allow you to query against data that has been split across
13multiple files. This sharding of data may indicate partitioning, which
14can accelerate queries that only touch some partitions (files).
15
16A \code{Dataset} contains one or more \code{Fragments}, such as files, of potentially
17differing type and partitioning.
18
19For \code{Dataset$create()}, see \code{\link[=open_dataset]{open_dataset()}}, which is an alias for it.
20
21\code{DatasetFactory} is used to provide finer control over the creation of \code{Dataset}s.
22}
23\section{Factory}{
24
25\code{DatasetFactory} is used to create a \code{Dataset}, inspect the \link{Schema} of the
26fragments contained in it, and declare a partitioning.
27\code{FileSystemDatasetFactory} is a subclass of \code{DatasetFactory} for
28discovering files in the local file system, the only currently supported
29file system.
30
31For the \code{DatasetFactory$create()} factory method, see \code{\link[=dataset_factory]{dataset_factory()}}, an
32alias for it. A \code{DatasetFactory} has:
33\itemize{
34\item \verb{$Inspect(unify_schemas)}: If \code{unify_schemas} is \code{TRUE}, all fragments
35will be scanned and a unified \link{Schema} will be created from them; if \code{FALSE}
36(default), only the first fragment will be inspected for its schema. Use this
37fast path when you know and trust that all fragments have an identical schema.
38\item \verb{$Finish(schema, unify_schemas)}: Returns a \code{Dataset}. If \code{schema} is provided,
39it will be used for the \code{Dataset}; if omitted, a \code{Schema} will be created from
40inspecting the fragments (files) in the dataset, following \code{unify_schemas}
41as described above.
42}
43
44\code{FileSystemDatasetFactory$create()} is a lower-level factory method and
45takes the following arguments:
46\itemize{
47\item \code{filesystem}: A \link{FileSystem}
48\item \code{selector}: Either a \link{FileSelector} or \code{NULL}
49\item \code{paths}: Either a character vector of file paths or \code{NULL}
50\item \code{format}: A \link{FileFormat}
51\item \code{partitioning}: Either \code{Partitioning}, \code{PartitioningFactory}, or \code{NULL}
52}
53}
54
55\section{Methods}{
56
57
58A \code{Dataset} has the following methods:
59\itemize{
60\item \verb{$NewScan()}: Returns a \link{ScannerBuilder} for building a query
61\item \verb{$schema}: Active binding that returns the \link{Schema} of the Dataset; you
62may also replace the dataset's schema by using \code{ds$schema <- new_schema}.
63This method currently supports only adding, removing, or reordering
64fields in the schema: you cannot alter or cast the field types.
65}
66
67\code{FileSystemDataset} has the following methods:
68\itemize{
69\item \verb{$files}: Active binding, returns the files of the \code{FileSystemDataset}
70\item \verb{$format}: Active binding, returns the \link{FileFormat} of the \code{FileSystemDataset}
71}
72
73\code{UnionDataset} has the following methods:
74\itemize{
75\item \verb{$children}: Active binding, returns all child \code{Dataset}s.
76}
77}
78
79\seealso{
80\code{\link[=open_dataset]{open_dataset()}} for a simple interface to creating a \code{Dataset}
81}