]>
Commit | Line | Data |
---|---|---|
f67539c2 TL |
1 | =============== |
2 | Ceph s3 select | |
3 | =============== | |
4 | ||
5 | .. contents:: | |
6 | ||
7 | Overview | |
8 | -------- | |
9 | ||
10 | | The purpose of **s3 select** engine is to create an efficient pipe between user client to storage node (the engine should be close as possible to storage). | |
11 | | It enables the user to define the exact portion of data should be received by his side. | |
12 | | It also enables for higher level analytic-applications (such as SPARK-SQL) , using that feature to improve their latency and throughput. | |
13 | ||
14 | | For example, a s3-object of several GB (CSV file), a user needs to extract a single column which filtered by another column. | |
15 | | As the following query: | |
16 | | ``select customer-id from s3Object where age>30 and age<65;`` | |
17 | ||
18 | | Currently the whole s3-object must retrieve from OSD via RGW before filtering and extracting data. | |
19 | | By "pushing down" the query into OSD , it's possible to save a lot of network and CPU(serialization / deserialization). | |
20 | ||
21 | | **The bigger the object, and the more accurate the query, the better the performance**. | |
22 | ||
23 | Basic workflow | |
24 | -------------- | |
25 | ||
26 | | S3-select query is sent to RGW via `AWS-CLI <https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html>`_ | |
27 | ||
28 | | It passes the authentication and permission process as an incoming message (POST). | |
29 | | **RGWSelectObj_ObjStore_S3::send_response_data** is the “entry point”, it handles each fetched chunk according to input object-key. | |
30 | | **send_response_data** is first handling the input query, it extracts the query and other CLI parameters. | |
31 | ||
32 | | Per each new fetched chunk (~4m), it runs the s3-select query on that chunk. | |
33 | | The current implementation supports CSV objects and since chunks are randomly “cutting” the CSV rows in the middle, those broken-lines (first or last per chunk) are skipped while processing the query. | |
34 | | Those “broken” lines are stored and later merged with the next broken-line (belong to the next chunk), and finally processed. | |
35 | ||
36 | | Per each processed chunk an output message is formatted according to AWS specification and sent back to the client. | |
37 | | For aggregation queries the last chunk should be identified as the end of input, following that the s3-select-engine initiates end-of-process and produces an aggregate result. | |
38 | ||
39 | Design Concepts | |
40 | --------------- | |
41 | ||
42 | AST- Abstract Syntax Tree | |
43 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
44 | | The s3-select main flow is initiated with parsing of input-string (i.e user query), and follows | |
45 | | with building an AST (abstract-syntax-tree) as a result. | |
46 | | The execution phase is built upon the AST. | |
47 | ||
48 | | ``Base_statement`` is the base for the all object-nodes participating in the execution phase, it consists of the ``eval()`` method which returns the <value> object. | |
49 | ||
50 | | ``value`` object is handling the known basic-types such as int,string,float,time-stamp | |
51 | | It is able to operate comparison and basic arithmetic operations on mentioned types. | |
52 | ||
53 | | The execution-flow is actually calling the ``eval()`` method on the root-node (per each projection), it goes all the way down, and returns the actual result (``value`` object) from bottom node to root node(all the way up) . | |
54 | ||
55 | | **Alias** programming-construct is an essential part of s3-select language, it enables much better programming especially with objects containing many columns or in the case of complex queries. | |
56 | ||
57 | | Upon parsing the statement containing alias construct, it replaces alias with reference to the correct AST-node, on runtime the node is simply evaluated as any other node. | |
58 | ||
59 | | There is a risk that self(or cyclic) reference may occur causing stack-overflow(endless-loop), for that concern upon evaluating an alias, it is validated for cyclic reference. | |
60 | ||
61 | | Alias also maintains result-cache, meaning upon using the same alias more than once, it’s not evaluating the same node again(it will return the same result),instead it uses the result from cache. | |
62 | ||
63 | | Of Course, per each new row the cache is invalidated. | |
64 | ||
65 | ||
66 | S3 select parser definition | |
67 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
68 | | The implementation of s3-select uses the `boost::spirit <https://www.boost.org/doc/libs/1_71_0/libs/spirit/classic/doc/grammar.html>`_ the definition of s3-select command is according to AWS. | |
69 | ||
70 | | Upon parsing is initiated on input text, and a specific rule is identified, an action which is bound to that rule is executed. | |
71 | | Those actions are building the AST, each action is unique (as its rule), at the end of the process it forms a structure similar to a tree. | |
72 | ||
73 | | As mentioned, running eval() on the root node, execute the s3-select statement (per projection). | |
74 | | The input stream is accessible to the execution tree, by the scratch-area object, that object is constantly updated per each new row. | |
75 | ||
76 | Basic functionalities | |
77 | ~~~~~~~~~~~~~~~~~~~~~ | |
78 | ||
79 | | **S3select** has a definite set of functionalities that should be implemented (if we wish to stay compliant with AWS), currently only a portion of it is implemented. | |
80 | ||
81 | | The implemented software architecture supports basic arithmetic expressions, logical and compare expressions, including nested function calls and casting operators, that alone enables the user reasonable flexibility. | |
82 | | review the bellow feature-table_. | |
83 | ||
84 | ||
85 | ||
86 | Memory handling | |
87 | ~~~~~~~~~~~~~~~ | |
88 | ||
89 | | S3select structures and objects are lockless and thread-safe, it uses placement-new in order to reduce the alloc/dealloc intensive cycles, which may impact the main process hosting s3-select. | |
90 | ||
91 | | Once AST is built there is no need to allocate memory for the execution itself, the AST is “static” for the query-execution life-cycle. | |
92 | ||
93 | | The execution itself is stream-oriented, meaning there is no pre-allocation before execution, object size has no impact on memory consumption. | |
94 | ||
95 | | It processes chunk after chunk, row after row, all memory needed for processing resides on AST. | |
96 | ||
97 | | The AST is similar to stack behaviour in that it consumes already allocated memory and “releases” it upon completing its task. | |
98 | ||
99 | S3 Object different types | |
100 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
101 | ||
102 | | The processing of input stream is decoupled from s3-select-engine, meaning , each input-type should have its own parser, converting s3-object into columns. | |
103 | ||
104 | | Current implementation includes only CSV reader; its parsing definitions are according to AWS. | |
105 | | The parser is implemented using `boost::state-machine <https://www.boost.org/doc/libs/1_64_0/libs/msm/doc/HTML/index.html>`_. | |
106 | ||
107 | | The CSV parser handles NULL,quote,escape rules,field delimiter,row delimiter and users may define (via AWS CLI) all of those dynamically. | |
108 | ||
109 | Error Handling | |
110 | ~~~~~~~~~~~~~~ | |
111 | | S3-select statement may be syntactically correct but semantically wrong, for one example ``select a * b from …`` , where a is number and b is a string. | |
112 | | Current implementation is for CSV file types, CSV has no schema, column-types may evaluate on runtime. | |
113 | | The above means that wrong semantic statements may occur on runtime. | |
114 | ||
115 | | As for syntax error ``select x frm stdin;`` , the builtin parser fails on first miss-match to language definition, and produces an error message back to client (AWS-CLI). | |
116 | | The error message is point on location of miss-match. | |
117 | ||
118 | | Fatal severity (attached to the exception) will end execution immediately, other error severity are counted, upon reaching 100, it ends execution with an error message. | |
119 | ||
120 | ||
121 | AST denostration | |
122 | ~~~~~~~~~~~~~~~~ | |
123 | .. ditaa:: | |
124 | ||
125 | +---------------------+ | |
126 | | select | | |
127 | +------ +---------------------+---------+ | |
128 | | | | | |
129 | | | | | |
130 | | | | | |
131 | | V | | |
132 | | +--------------------+ | | |
133 | | | s3object | | | |
134 | | +--------------------+ | | |
135 | | | | |
136 | V V | |
137 | +---------------------+ +-------------+ | |
138 | | projections | | where | | |
139 | +---------------------+ +-------------+ | |
140 | | | | | |
141 | | | | | |
142 | | | | | |
143 | | | | | |
144 | | | | | |
145 | | | | | |
146 | V V V | |
147 | +-----------+ +-----------+ +-------------+ | |
148 | | multiply | | date | | and | | |
149 | +-----------+ +-----------+ +-------------+ | |
150 | | | | | | |
151 | | | | | | |
152 | | | | | | |
153 | | | | | | |
154 | V V V V | |
155 | +-------+ +-------+ +-----+ +-----+ | |
156 | |payment| | 0.3 | | EQ | | LT | | |
157 | +-------+ +-------+ +--+-----+ +-----+--+ | |
158 | | | | | | |
159 | | | | | | |
160 | V V V V | |
161 | +-------+ +----+ +-----+ +-----+ | |
162 | | region| |east| |age | | 30 | | |
163 | +-------+ +----+ +-----+ +-----+ | |
164 | ||
165 | Features Support | |
166 | ---------------- | |
167 | ||
168 | .. _feature-table: | |
169 | ||
170 | The following table describes the support for s3-select functionalities: | |
171 | ||
172 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
173 | | Feature | Detailed | Example | | |
174 | +=================================+=================+=======================================================================+ | |
175 | | Arithmetic operators | ^ * / + - ( ) | select (int(_1)+int(_2))*int(_9) from stdin; | | |
176 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
177 | | | | select ((1+2)*3.14) ^ 2 from stdin; | | |
178 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
179 | | Compare operators | > < >= <= == != | select _1,_2 from stdin where (int(1)+int(_3))>int(_5); | | |
180 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
181 | | logical operator | AND OR | select count(*) from stdin where int(1)>123 and int(_5)<200; | | |
182 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
183 | | casting operator | int(expression) | select int(_1),int( 1.2 + 3.4) from stdin; | | |
184 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
185 | | |float(expression)| | | |
186 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
187 | | | timestamp(...) | select timestamp("1999:10:10-12:23:44") from stdin; | | |
188 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
189 | | Aggregation Function | sum | select sum(int(_1)) from stdin; | | |
190 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
191 | | Aggregation Function | min | select min( int(_1) * int(_5) ) from stdin; | | |
192 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
193 | | Aggregation Function | max | select max(float(_1)),min(int(_5)) from stdin; | | |
194 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
195 | | Aggregation Function | count | select count(*) from stdin where (int(1)+int(_3))>int(_5); | | |
196 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
197 | | Timestamp Functions | extract | select count(*) from stdin where | | |
198 | | | | extract("year",timestamp(_2)) > 1950 | | |
199 | | | | and extract("year",timestamp(_1)) < 1960; | | |
200 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
201 | | Timestamp Functions | dateadd | select count(0) from stdin where | | |
202 | | | | datediff("year",timestamp(_1),dateadd("day",366,timestamp(_1))) == 1; | | |
203 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
204 | | Timestamp Functions | datediff | select count(0) from stdin where | | |
205 | | | | datediff("month",timestamp(_1),timestamp(_2))) == 2; | | |
206 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
207 | | Timestamp Functions | utcnow | select count(0) from stdin where | | |
208 | | | | datediff("hours",utcnow(),dateadd("day",1,utcnow())) == 24 ; | | |
209 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
210 | | String Functions | substr | select count(0) from stdin where | | |
211 | | | | int(substr(_1,1,4))>1950 and int(substr(_1,1,4))<1960; | | |
212 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
213 | | alias support | | select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 | | |
214 | | | | from stdin where a3>100 and a3<300; | | |
215 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
216 | ||
217 | Sending Query to RGW | |
218 | -------------------- | |
219 | ||
220 | Syntax | |
221 | ~~~~~~ | |
222 | CSV default defintion for field-delimiter,row-delimiter,quote-char,escape-char are: { , \\n " \\ } | |
223 | ||
224 | :: | |
225 | ||
226 | aws --endpoint-url http://localhost:8000 s3api select-object-content | |
227 | --bucket {BUCKET-NAME} | |
228 | --expression-type 'SQL' | |
229 | --input-serialization | |
230 | '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}' | |
231 | --output-serialization '{"CSV": {}}' | |
232 | --key {OBJECT-NAME} | |
233 | --expression "select count(0) from stdin where int(_1)<10;" output.csv | |
234 | ||
235 | CSV parsing behavior | |
236 | -------------------- | |
237 | ||
238 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
239 | | Feature | Description | input ==> tokens | | |
240 | +=================================+=================+=======================================================================+ | |
241 | | NULL | successive | ,,1,,2, ==> {null}{null}{1}{null}{2}{null} | | |
242 | | | field delimiter | | | |
243 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
244 | | QUOTE | quote character | 11,22,"a,b,c,d",last ==> {11}{22}{"a,b,c,d"}{last} | | |
245 | | | overrides | | | |
246 | | | field delimiter | | | |
247 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
248 | | Escape | escape char | 11,22,str=\\"abcd\\"\\,str2=\\"123\\",last | | |
249 | | | overrides | ==> {11}{22}{str="abcd",str2="123"}{last} | | |
250 | | | meta-character. | | | |
251 | | | escape removed | | | |
252 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
253 | | row delimiter | no close quote, | 11,22,a="str,44,55,66 | | |
254 | | | row delimiter is| ==> {11}{22}{a="str,44,55,66} | | |
255 | | | closing line | | | |
256 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ | |
257 | | csv header info | FileHeaderInfo | "**USE**" value means each token on first line is column-name, | | |
258 | | | tag | "**IGNORE**" value means to skip the first line | | |
259 | +---------------------------------+-----------------+-----------------------------------------------------------------------+ |