ceph/src/s3select/README.md

   1 # s3select
   2
   3 <br />The s3select is another S3 request, that enables the client to push down an SQL statement(according to [spec](https://docs.ceph.com/en/latest/radosgw/s3select/#features-support)) into CEPH storage.
   4 <br />The s3select is an implementation of a push-down paradigm.
   5 <br />The push-down paradigm is about moving(“pushing”) the operation close to the data.
   6 <br />It's contrary to what is commonly done, i.e. moving the data to the “place” of operation.
   7 <br />In a big-data ecosystem, it makes a big difference.
   8 <br />In order to execute __“select sum( x + y) from s3object where a + b > c”__
   9 <br />It needs to fetch the entire object to the client side, and only then execute the operation with an analytic application,
  10 <br />With push-down(s3-select) the entire operation is executed on the server side, and only the result is returned to the client side.
  11
  12
  13 ## Analyzing huge amount of cold/warm data without moving or converting
  14 <br />The s3-storage is reliable, efficient, cheap, and already contains a huge amount of objects, It contains many CSV, JSON, and Parquet objects, and these objects contain a huge amount of data to analyze.
  15 <br />An ETL may convert these objects into Parquet and then run queries on these converted objects.
  16 <br />But it comes with an expensive price, downloading all of these objects close to the analytic application.
  17
  18 <br />The s3select-engine that resides on s3-storage can do these jobs for many use cases, saving time and resources.
  19
  20
  21 ## The s3select engine stands by itself
  22 <br />The engine resides on a dedicated GitHub repo, and it is also capable to execute SQL statements on standard input or files residing on a local file system.
  23 <br />Users may clone and build this repo, and execute various SQL statements as CLI.
  24
  25 ## A docker image containing a development environment
  26 An immediate way for a quick start is available using the following container.
  27 That container already contains the cloned repo, enabling code review and modification.
  28
  29 ### Running the s3select container image
  30 `sudo docker run -w /s3select -it galsl/ubunto_arrow_parquet_s3select:dev`
  31
  32 ### Running google test suite, it contains hundreads of queries
  33 `./test/s3select_test`
  34
  35 ### Running SQL statements using CLI on standard input
  36 `./example/s3select_example`, is a small demo app, it lets you run queries on local file or standard input.
  37 for one example, the following runs the engine on standard input.
  38 `seq 1 1000 | ./example/s3select_example -q 'select count(0) from stdin;'`
  39
  40 #### SQL statement on ps command (standard input)
  41 >`ps -ef | tr -s ' ' | CSV_COLUMN_DELIMETER=' ' CSV_HEADER_INFO= ./example/s3select_example  -q 'select PID,CMD from stdin where PPID="1";'`
  42
  43 #### SQL statement processed by the container, the input-data pipe into the container.
  44 > `seq 1 1000000 | sudo docker run -w /s3select -i galsl/ubunto_arrow_parquet_s3select:dev
  45 bash -c "./example/s3select_example -q 'select count(0) from stdin;'"`
  46 ### Running SQL statements using CLI on local file
  47 it possible to run a query on local file, as follows.
  48
  49 `./example/s3select_example -q 'select count(0) from /full/path/file_name;'`
  50 #### SQL statement processed by the container, the input-data is mapped to container FS.
  51 >`sudo docker run -w /s3select -v /home/gsalomon/work:/work -it galsl/ubunto_arrow_parquet_s3select:dev bash -c "./example/s3select_example -q 'select count(*) from /work/datatime.csv;'"`
  52
  53