ceph/src/s3select/include/csvparser/README.md

   1 # Fast C++ CSV Parser
   2
   3 This is a small, easy-to-use and fast header-only library for reading comma separated value (CSV) files.
   4
   5 ## Features
   6
   7   * Automatically rearranges columns by parsing the header line.
   8   * Disk I/O and CSV-parsing are overlapped using threads for efficiency.
   9   * Parsing features such as escaped strings can be enabled and disabled at compile time using templates. You only pay in speed for the features you actually use.
  10   * Can read multiple GB files in reasonable time.
  11   * Support for custom columns separators (i.e. Tab separated value files are supported), quote escaped strings, automatic space trimming.
  12   * Works with `*`nix and Windows newlines and automatically ignores UTF-8 BOMs.
  13   * Exception classes with enough context to format useful error messages. what() returns error messages ready to be shown to a user.
  14
  15 ## Getting Started
  16
  17 The following small example should contain most of the syntax you need to use the library.
  18
  19 ```cpp
  20 # include "csv.h"
  21
  22 int main(){
  23   io::CSVReader<3> in("ram.csv");
  24   in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
  25   std::string vendor; int size; double speed;
  26   while(in.read_row(vendor, size, speed)){
  27     // do stuff with the data
  28   }
  29 }
  30 ```
  31
  32 ## Installation
  33
  34 The library only needs a standard conformant C++11 compiler. It has no further dependencies. The library is completely contained inside a single header file and therefore it is sufficient to copy this file to some place on your include path. The library does not have to be explicitly build.
  35
  36 Note however, that threads are used and some compiler (for example GCC) require you to link against additional libraries to make it work. With GCC it is important to add -lpthread as the last item when linking, i.e. the order in
  37
  38 ```
  39 g++ -std=c++0x a.o b.o -o prog -lpthread
  40 ```
  41
  42 is important. If you for some reason do not want to use threads you can define CSV_IO_NO_THREAD before including the header.
  43
  44 Remember that the library makes use of C++11 features and therefore you have to enable support for it (f.e. add -std=c++0x or -std=gnu++0x).
  45
  46 The library was developed and tested with GCC 4.6.1
  47
  48 Note that VS2013 is not C++11 compilant and will therefore not work out of the box. See [here](https://code.google.com/p/fast-cpp-csv-parser/issues/detail?id=6) for what needs to be adjusted to make the code work.
  49
  50 ## Documentation
  51
  52 The libary provides two classes:
  53
  54   * `LineReader`: A class to efficiently read large files line by line.
  55   * `CSVReader`: A class that efficiently reads large CSV files.
  56
  57 Note that everything is contained in the `io` namespace.
  58
  59 ### `LineReader`
  60
  61 ```cpp
  62 class LineReader{
  63 public:
  64   // Constructors
  65   LineReader(some_string_type file_name);
  66   LineReader(some_string_type file_name, std::FILE*source);
  67   LineReader(some_string_type file_name, std::istream&source);
  68   LineReader(some_string_type file_name, std::unique_ptr<ByteSourceBase>source);
  69
  70   // Reading
  71   char*next_line();
  72
  73   // File Location
  74   void set_file_line(unsigned);
  75   unsigned get_file_line()const;
  76   void set_file_name(some_string_type file_name);
  77   const char*get_truncated_file_name()const;
  78 };
  79 ```
  80
  81 The constructor takes a file name and optionally a data source. If no data source is provided the function tries to open the file with the given name and throws an `error::can_not_open_file exception` on failure. If a data source is provided then the file name is only used to format error messages. In that case you can essentially put any string there. Using a string that describes the data source results in more informative error messages.
  82
  83 `some_string_type` can be a `std::string` or a `char*`. If the data source is a `std::FILE*` then the library will take care of calling `std::fclose`. If it is a `std::istream` then the stream is not closed by the library. For best performance open the streams in binary mode. However using text mode also works. `ByteSourceBase` provides an interface that you can use to implement further data sources.
  84
  85 ```cpp
  86 class ByteSourceBase{
  87 public:
  88   virtual int read(char*buffer, int size)=0;
  89   virtual ~ByteSourceBase(){}
  90 };
  91 ```
  92
  93 The read function should fill the provided buffer with at most `size` bytes from the data source. It should return the number of bytes actually written to the buffer. If data source has run out of bytes (because for example an end of file was reached) then the function should return 0. If a fatal error occures then you can throw an exception. Note that the function can be called both from the main and the worker thread. However, it is guarenteed that they do not call the function at the same time.
  94
  95 Lines are read by calling the `next_line` function. It returns a pointer to a null terminated C-string that contains the line. If the end of file is reached a null pointer is returned. The newline character is not included in the string. You may modify the string as long as you do not write past the null terminator. The string stays valid until the destructor is called or until next_line is called again. Windows and `*`nix newlines are handled transparently. UTF-8 BOMs are automatically ignored and missing newlines at the end of the file are no problem.
  96
  97 **Important:** There is a limit of 2^24-1 characters per line. If this limit is exceeded a `error::line_length_limit_exceeded` exception is thrown.
  98
  99 Looping over all the lines in a file can be done in the following way.
 100 ```cpp
 101 LineReader in(...);
 102 while(char*line = in.next_line()){
 103   ...
 104 }
 105 ```
 106
 107 The remaining functions are mainly used used to format error messages. The file line indicates the current position in the file, i.e., after the first `next_line` call it is 1 and after the second 2. Before the first call it is 0. The file name is truncated as internally C-strings are used to avoid `std::bad_alloc` exceptions during error reporting.
 108
 109 **Note:** It is not possible to exchange the line termination character.
 110
 111 ### `CSVReader`
 112
 113 `CSVReader` uses policies. These are classes with only static members to allow core functionality to be exchanged in an efficient way.
 114
 115 ```cpp
 116 template<
 117   unsigned column_count,
 118   class trim_policy = trim_chars<' ', '\t'>,
 119   class quote_policy = no_quote_escape<','>,
 120   class overflow_policy = throw_on_overflow,
 121   class comment_policy = no_comment
 122 >
 123 class CSVReader{
 124 public:
 125   // Constructors
 126   // same as for LineReader
 127
 128   // Parsing Header
 129   void read_header(ignore_column ignore_policy, some_string_type col_name1, some_string_type col_name2, ...);
 130   void set_header(some_string_type col_name1, some_string_type col_name2, ...);
 131   bool has_column(some_string_type col_name)const;
 132
 133   // Read
 134   char*next_line();
 135   bool read_row(ColType1&col1, ColType2&col2, ...);
 136
 137   // File Location
 138   void set_file_line(unsigned);
 139   unsigned get_file_line()const;
 140   void set_file_name(some_string_type file_name);
 141   const char*get_truncated_file_name()const;
 142 };
 143 ```
 144
 145 The `column_count` template parameter indicates how many columns you want to read from the CSV file. This must not necessarily coincide with the actual number of columns in the file. The three policies govern various aspects of the parsing.
 146
 147 The trim policy indicates what characters should be ignored at the begin and the end of every column. The default ignores spaces and tabs. This makes sure that
 148
 149 ```
 150 a,b,c
 151 1,2,3
 152 ```
 153
 154 is interpreted in the same way as
 155
 156 ```
 157   a, b,   c
 158 1  , 2,   3
 159 ```
 160
 161 The trim_chars can take any number of template parameters. For example `trim_chars<' ', '\t', '_'> `is also valid. If no character should be trimmed use `trim_chars<>`.
 162
 163 The quote policy indicates how string should be escaped. It also specifies the column separator. The predefined policies are:
 164
 165   * `no_quote_escape<sep>` : Strings are not escaped. "`sep`" is used as column separator.
 166   * `double_quote_escape<sep, quote>` : Strings are escaped using quotes. Quotes are escaped using two consecutive quotes. "`sep`" is used as column separator and "`quote`" as quoting character.
 167
 168 **Important**: When combining trimming and quoting the rows are first trimmed and then unquoted. A consequence is that spaces inside the quotes will be conserved. If you want to get rid of spaces inside the quotes, you need to remove them yourself.
 169
 170 **Important**: Quoting can be quite expensive. Disable it if you do not need it.
 171
 172 **Important**: Quoted strings may not contain unescaped newlines. This is currently not supported.
 173
 174 The overflow policy indicates what should be done if the integers in the input are too large to fit into the variables. There following policies are predefined:
 175
 176   * `throw_on_overflow` : Throw an `error::integer_overflow` or `error::integer_underflow` exception.
 177   * `ignore_overflow` : Do nothing and let the overflow happen.
 178   * `set_to_max_on_overflow` : Set the value to `numeric_limits<...>::max()` (or to the min-pendant).
 179
 180 The comment policy allows to skip lines based on some criteria. Valid predefined policies are:
 181
 182   * `no_comment` : Do not ignore any line.
 183   * `empty_line_comment` : Ignore all lines that are empty or only contains spaces and tabs.
 184   * `single_line_comment<com1, com2, ...>` : Ignore all lines that start with com1 or com2 or ... as the first character. There may not be any space between the beginning of the line and the comment character.
 185   * `single_and_empty_line_comment<com1, com2, ...>` : Ignore all empty lines and single line comments.
 186
 187 Examples:
 188
 189   * `CSVReader<4, trim_chars<' '>, double_quote_escape<',','\"'> >` reads 4 columns from a normal CSV file with string escaping enabled.
 190   * `CSVReader<3, trim_chars<' '>, no_quote_escape<'\t'>, throw_on_overflow, single_line_comment<'#'> >` reads 3 columns from a tab separated file with string escaping disabled. Lines starting with a # are ignored.
 191
 192 The constructors and the file location functions are exactly the same as for `LineReader`. See its documentation for details.
 193
 194 There are three methods that deal with headers. The `read_header` methods reads a line from the file and rearranges the columns to match that order. It also checks whether all necessary columns are present. The `set_header` method does *not* read any input. Use it if the file does not have any header. Obviously it is impossible to rearrange columns or check for their availability when using it. The order in the file and in the program must match when using `set_header`. The `has_column` method checks whether a column is present in the file. The first argument of `read_header` is a bitfield that determines how the function should react to column mismatches. The default behavior is to throw an `error::extra_column_in_header` exception if the file contains more columns than expected and an `error::missing_column_in_header` when there are not enough. This behavior can be altered using the following flags.
 195
 196   * `ignore_no_column`: The default behavior, no flags are set
 197   * `ignore_extra_column`: If a column with a name is in the file but not in the argument list, then it is silently ignored.
 198   * `ignore_missing_column`: If a column with a name is not in the file but is in the argument list, then `read_row` will not modify the corresponding variable.
 199
 200 When using `ignore_missing_column` it is a good idea to initialize the variables passed to `read_row` with a default value, for example:
 201
 202 ```cpp
 203 // The file only contains column "a"
 204 CSVReader<2>in(...);
 205 in.read_header(ignore_missing_column, "a", "b");
 206 int a,b = 42;
 207 while(in.read_row(a,b)){
 208   // a contains the value from the file
 209   // b is left unchanged by read_row, i.e., it is 42
 210 }
 211 ```
 212
 213 If only some columns are optional or their default value depends on other columns you have to use `has_column`, for example:
 214
 215 ```cpp
 216 // The file only contains the columns "a" and "b"
 217 CSVReader<3>in(...);
 218 in.read_header(ignore_missing_column, "a", "b", "sum");
 219 if(!in.has_column("a") || !in.has_column("b"))
 220   throw my_neat_error_class();
 221 bool has_sum = in.has_column("sum");
 222 int a,b,sum;
 223 while(in.read_row(a,b,sum)){
 224   if(!has_sum)
 225     sum = a+b;
 226 }
 227 ```
 228
 229 **Important**: Do not call `has_column` from within the read-loop. It would work correctly but significantly slowdown processing.
 230
 231 If two columns have the same name an error::duplicated_column_in_header exception is thrown. If `read_header` is called but the file is empty a `error::header_missing` exception is thrown.
 232
 233 The `next_line` functions reads a line without parsing it. It works analogous to `LineReader::next_line`. This can be used to skip broken lines in a CSV file. However, in nearly all applications you will want to use the `read_row` function.
 234
 235 The `read_row` function reads a line, splits it into the columns and arranges them correctly. It trims the entries and unescapes them. If requested the content is interpreted as integer or as floating point. The variables passed to read_row may be of the following types.
 236
 237   * builtin signed integer: These are `signed char`, `short`, `int`, `long` and `long long`. The input must be encoded as a base 10 ASCII number optionally preceded by a + or -. The function detects whether the integer is too large would overflow (or underflow) and behaves as indicated by overflow_policy.
 238   * builtin unsigned integer: Just as the signed counterparts except that a leading + or - is not allowed.
 239   * builtin floating point: These are `float`, `double` and `long double`. The input may have a leading + or -. The number must be base 10 encoded. The decimal point may either be a dot or a comma. (Note that a comma will only work if it is not also used as column separator or the number is escaped.) A base 10 exponent may be specified using the "1e10" syntax. The "e" may be lower- or uppercase. Examples for valid floating points are "1", "-42.42" and "+123.456E789". The input is rounded to the next floating point or infinity if it is too large or small.
 240   * `char`: The column content must be a single character.
 241   * `std::string`: The column content is assigned to the string. The std::string is filled with the trimmed and unescaped version.
 242   * `char*`: A pointer directly into the buffer. The string is trimmed and unescaped and null terminated. This pointer stays valid until read_row is called again or the CSVReader is destroyed. Use this for user defined types.
 243
 244 Note that there is no inherent overhead to using `char*` and then interpreting it compared to using one of the parsers directly build into `CSVReader`. The builtin number parsers are pure convenience. If you need a slightly different syntax then use `char*` and do the parsing yourself.
 245
 246 ## FAQ
 247
 248 Q: The library is throwing a std::system_error with code -1. How to get it to work?
 249
 250 A: Your compiler's std::thread implementation is broken. Define CSV\_IO\_NO\_THREAD to disable threading support.
 251
 252
 253 Q: My values are not just ints or strings. I want to parse my customized type. Is this possible?
 254
 255 A: Read a `char*` and parse the string. At first this seems expensive but it is not as the pointer you get points directly into the memory buffer. In fact there is no inherent reason why a custom int-parser realized this way must be any slower than the int-parser build into the library. By reading a `char*` the library takes care of column reordering and quote escaping and leaves the actual parsing to you. Note that using a std::string is slower as it involves a memory copy.
 256
 257
 258 Q: I get lots of compiler errors when compiling the header! Please fix it. :(
 259
 260 A: Have you enabled the C++11 mode of your compiler? If you use GCC you have to add -std=c++0x to the commandline. If this does not resolve the problem, then please open a ticket.
 261
 262
 263 Q: The library crashes when parsing large files! Please fix it. :(
 264
 265 A: When using GCC have you linked against -lpthread? Read the installation section for details on how to do this. If this does not resolve the issue then please open a ticket. (The reason why it only crashes only on large files is that the first chuck is read synchronous and if the whole file fits into this chuck then no asynchronous call is performed.) Alternatively you can define CSV\_IO\_NO\_THREAD.
 266
 267
 268 Q: Does the library support UTF?
 269
 270 A: The library has basic UTF-8 support, or to be more precise it does not break when passing UTF-8 strings through it. If you read a `char*` then you get a pointer to the UTF-8 string. You will have to decode the string on your own. The separator, quoting, and commenting characters used by the library can only be ASCII characters.
 271
 272
 273 Q: Does the library support string fields that span multiple lines?
 274
 275 A: No. This feature has been often requested in the past, however, it is difficult to make it work with the current design without breaking something else.