ceph/src/arrow/docs/source/python/csv.rst

   1 .. Licensed to the Apache Software Foundation (ASF) under one
   2 .. or more contributor license agreements.  See the NOTICE file
   3 .. distributed with this work for additional information
   4 .. regarding copyright ownership.  The ASF licenses this file
   5 .. to you under the Apache License, Version 2.0 (the
   6 .. "License"); you may not use this file except in compliance
   7 .. with the License.  You may obtain a copy of the License at
   8
   9 ..   http://www.apache.org/licenses/LICENSE-2.0
  10
  11 .. Unless required by applicable law or agreed to in writing,
  12 .. software distributed under the License is distributed on an
  13 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  14 .. KIND, either express or implied.  See the License for the
  15 .. specific language governing permissions and limitations
  16 .. under the License.
  17
  18 .. currentmodule:: pyarrow.csv
  19 .. _csv:
  20
  21 Reading and Writing CSV files
  22 =============================
  23
  24 Arrow supports reading and writing columnar data from/to CSV files.
  25 The features currently offered are the following:
  26
  27 * multi-threaded or single-threaded reading
  28 * automatic decompression of input files (based on the filename extension,
  29   such as ``my_data.csv.gz``)
  30 * fetching column names from the first row in the CSV file
  31 * column-wise type inference and conversion to one of ``null``, ``int64``,
  32   ``float64``, ``date32``, ``time32[s]``, ``timestamp[s]``, ``timestamp[ns]``,
  33   ``string`` or ``binary`` data
  34 * opportunistic dictionary encoding of ``string`` and ``binary`` columns
  35   (disabled by default)
  36 * detecting various spellings of null values such as ``NaN`` or ``#N/A``
  37 * writing CSV files with options to configure the exact output format
  38
  39 Usage
  40 -----
  41
  42 CSV reading and writing functionality is available through the
  43 :mod:`pyarrow.csv` module.  In many cases, you will simply call the
  44 :func:`read_csv` function with the file path you want to read from::
  45
  46    >>> from pyarrow import csv
  47    >>> fn = 'tips.csv.gz'
  48    >>> table = csv.read_csv(fn)
  49    >>> table
  50    pyarrow.Table
  51    total_bill: double
  52    tip: double
  53    sex: string
  54    smoker: string
  55    day: string
  56    time: string
  57    size: int64
  58    >>> len(table)
  59    244
  60    >>> df = table.to_pandas()
  61    >>> df.head()
  62       total_bill   tip     sex smoker  day    time  size
  63    0       16.99  1.01  Female     No  Sun  Dinner     2
  64    1       10.34  1.66    Male     No  Sun  Dinner     3
  65    2       21.01  3.50    Male     No  Sun  Dinner     3
  66    3       23.68  3.31    Male     No  Sun  Dinner     2
  67    4       24.59  3.61  Female     No  Sun  Dinner     4
  68
  69 To write CSV files, just call :func:`write_csv` with a
  70 :class:`pyarrow.RecordBatch` or :class:`pyarrow.Table` and a path or
  71 file-like object::
  72
  73   >>> import pyarrow as pa
  74   >>> import pyarrow.csv as csv
  75   >>> csv.write_csv(table, "tips.csv")
  76   >>> with pa.CompressedOutputStream("tips.csv.gz", "gzip") as out:
  77   ...     csv.write_csv(table, out)
  78
  79 .. note:: The writer does not yet support all Arrow types.
  80
  81 Customized parsing
  82 ------------------
  83
  84 To alter the default parsing settings in case of reading CSV files with an
  85 unusual structure, you should create a :class:`ParseOptions` instance
  86 and pass it to :func:`read_csv`.
  87
  88 Customized conversion
  89 ---------------------
  90
  91 To alter how CSV data is converted to Arrow types and data, you should create
  92 a :class:`ConvertOptions` instance and pass it to :func:`read_csv`::
  93
  94    import pyarrow as pa
  95    import pyarrow.csv as csv
  96
  97    table = csv.read_csv('tips.csv.gz', convert_options=pa.csv.ConvertOptions(
  98        column_types={
  99            'total_bill': pa.decimal128(precision=10, scale=2),
 100            'tip': pa.decimal128(precision=10, scale=2),
 101        }
 102    ))
 103
 104
 105 Incremental reading
 106 -------------------
 107
 108 For memory-constrained environments, it is also possible to read a CSV file
 109 one batch at a time, using :func:`open_csv`.
 110
 111 There are a few caveats:
 112
 113 1. For now, the incremental reader is always single-threaded (regardless of
 114    :attr:`ReadOptions.use_threads`)
 115
 116 2. Type inference is done on the first block and types are frozen afterwards;
 117    to make sure the right data types are inferred, either set
 118    :attr:`ReadOptions.block_size` to a large enough value, or use
 119    :attr:`ConvertOptions.column_types` to set the desired data types explicitly.
 120
 121 Character encoding
 122 ------------------
 123
 124 By default, CSV files are expected to be encoded in UTF8.  Non-UTF8 data
 125 is accepted for ``binary`` columns.  The encoding can be changed using
 126 the :class:`ReadOptions` class.
 127
 128 Customized writing
 129 ------------------
 130
 131 To alter the default write settings in case of writing CSV files with
 132 different conventions, you can create a :class:`WriteOptions` instance and
 133 pass it to :func:`write_csv`::
 134
 135   >>> import pyarrow as pa
 136   >>> import pyarrow.csv as csv
 137   >>> # Omit the header row (include_header=True is the default)
 138   >>> options = csv.WriteOptions(include_header=False)
 139   >>> csv.write_csv(table, "data.csv", options)
 140
 141 Incremental writing
 142 -------------------
 143
 144 To write CSV files one batch at a time, create a :class:`CSVWriter`. This
 145 requires the output (a path or file-like object), the schema of the data to
 146 be written, and optionally write options as described above::
 147
 148   >>> import pyarrow as pa
 149   >>> import pyarrow.csv as csv
 150   >>> with csv.CSVWriter("data.csv", table.schema) as writer:
 151   >>>     writer.write_table(table)
 152
 153 Performance
 154 -----------
 155
 156 Due to the structure of CSV files, one cannot expect the same levels of
 157 performance as when reading dedicated binary formats like
 158 :ref:`Parquet <Parquet>`.  Nevertheless, Arrow strives to reduce the
 159 overhead of reading CSV files.  A reasonable expectation is at least
 160 100 MB/s per core on a performant desktop or laptop computer (measured
 161 in source CSV bytes, not target Arrow data bytes).
 162
 163 Performance options can be controlled through the :class:`ReadOptions` class.
 164 Multi-threaded reading is the default for highest performance, distributing
 165 the workload efficiently over all available cores.
 166
 167 .. note::
 168    The number of concurrent threads is automatically inferred by Arrow.
 169    You can inspect and change it using the :func:`~pyarrow.cpu_count()`
 170    and :func:`~pyarrow.set_cpu_count()` functions, respectively.