]> git.proxmox.com Git - ceph.git/blob - ceph/src/s3select/README.md
update ceph source to reef 18.1.2
[ceph.git] / ceph / src / s3select / README.md
1 # s3select
2
3 <br />The s3select is another S3 request, that enables the client to push down an SQL statement(according to [spec](https://docs.ceph.com/en/latest/radosgw/s3select/#features-support)) into CEPH storage.
4 <br />The s3select is an implementation of a push-down paradigm.
5 <br />The push-down paradigm is about moving(“pushing”) the operation close to the data.
6 <br />It's contrary to what is commonly done, i.e. moving the data to the “place” of operation.
7 <br />In a big-data ecosystem, it makes a big difference.
8 <br />In order to execute __“select sum( x + y) from s3object where a + b > c”__
9 <br />It needs to fetch the entire object to the client side, and only then execute the operation with an analytic application,
10 <br />With push-down(s3-select) the entire operation is executed on the server side, and only the result is returned to the client side.
11
12
13 ## Analyzing huge amount of cold/warm data without moving or converting
14 <br />The s3-storage is reliable, efficient, cheap, and already contains a huge amount of objects, It contains many CSV, JSON, and Parquet objects, and these objects contain a huge amount of data to analyze.
15 <br />An ETL may convert these objects into Parquet and then run queries on these converted objects.
16 <br />But it comes with an expensive price, downloading all of these objects close to the analytic application.
17
18 <br />The s3select-engine that resides on s3-storage can do these jobs for many use cases, saving time and resources.
19
20
21 ## The s3select engine stands by itself
22 <br />The engine resides on a dedicated GitHub repo, and it is also capable to execute SQL statements on standard input or files residing on a local file system.
23 <br />Users may clone and build this repo, and execute various SQL statements as CLI.
24
25 ## A docker image containing a development environment
26 An immediate way for a quick start is available using the following container.
27 That container already contains the cloned repo, enabling code review and modification.
28
29 ### Running the s3select container image
30 `sudo docker run -w /s3select -it galsl/ubunto_arrow_parquet_s3select:dev`
31
32 ### Running google test suite, it contains hundreads of queries
33 `./test/s3select_test`
34
35 ### Running SQL statements using CLI on standard input
36 `./example/s3select_example`, is a small demo app, it lets you run queries on local file or standard input.
37 for one example, the following runs the engine on standard input.
38 `seq 1 1000 | ./example/s3select_example -q 'select count(0) from stdin;'`
39
40 #### SQL statement on ps command (standard input)
41 >`ps -ef | tr -s ' ' | CSV_COLUMN_DELIMETER=' ' CSV_HEADER_INFO= ./example/s3select_example -q 'select PID,CMD from stdin where PPID="1";'`
42
43 #### SQL statement processed by the container, the input-data pipe into the container.
44 > `seq 1 1000000 | sudo docker run -w /s3select -i galsl/ubunto_arrow_parquet_s3select:dev
45 bash -c "./example/s3select_example -q 'select count(0) from stdin;'"`
46 ### Running SQL statements using CLI on local file
47 it possible to run a query on local file, as follows.
48
49 `./example/s3select_example -q 'select count(0) from /full/path/file_name;'`
50 #### SQL statement processed by the container, the input-data is mapped to container FS.
51 >`sudo docker run -w /s3select -v /home/gsalomon/work:/work -it galsl/ubunto_arrow_parquet_s3select:dev bash -c "./example/s3select_example -q 'select count(*) from /work/datatime.csv;'"`
52
53