Overview
--------
-The purpose of the **s3 select** engine is to create an efficient pipe between
-user client and storage nodes (the engine should be close as possible to
-storage). It enables the selection of a restricted subset of (structured) data
-stored in an S3 object using an SQL-like syntax. It also enables for higher
-level analytic-applications (such as SPARK-SQL), using that feature to improve
-their latency and throughput.
-
-For example, an s3-object of several GB (CSV file), a user needs to extract a
-single column filtered by another column. As the following query: ``select
-customer-id from s3Object where age>30 and age<65;``
-
-Currently the whole s3-object must be retrieved from OSD via RGW before
-filtering and extracting data. By "pushing down" the query into radosgw, it's
-possible to save a lot of network and CPU(serialization / deserialization).
-
- **The bigger the object, and the more accurate the query, the better the
- performance**.
+The **S3 Select** engine creates an efficient pipe between clients and Ceph
+back end nodes. The S3 Select engine works best when implemented as closely as
+possible to back end storage.
+
+The S3 Select engine makes it possible to use an SQL-like syntax to select a
+restricted subset of data stored in an S3 object. The S3 Select engine
+facilitates the use of higher level, analytic applications (for example:
+SPARK-SQL). The ability of the S3 Select engine to target a proper subset of
+structed data within an S3 object decreases latency and increases throughput.
+
+For example: assume that a user needs to extract a single column that is
+filtered by another column, and that these colums are stored in a CSV file in
+an S3 object that is several GB in size. The following query performs this
+extraction: ``select customer-id from s3Object where age>30 and age<65;``
+
+Without the use of S3 Select, the whole S3 object must be retrieved from an OSD
+via RGW before the data is filtered and extracted. Significant network and CPU
+overhead are saved by "pushing down" the query into radosgw.
+
+**The bigger the object and the more accurate the query,
+the better the performance of s3select**.
Basic Workflow
--------------
-S3-select query is sent to RGW via `AWS-CLI
+S3 Select queries are sent to RGW via `AWS-CLI
<https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html>`_
-It passes the authentication and permission process as an incoming message
-(POST). **RGWSelectObj_ObjStore_S3::send_response_data** is the “entry point”,
-it handles each fetched chunk according to input object-key.
-**send_response_data** is first handling the input query, it extracts the query
-and other CLI parameters.
+S3 Select passes the authentication and permission parameters as an incoming
+message (POST). ``RGWSelectObj_ObjStore_S3::send_response_data`` is the entry
+point and handles each fetched chunk according to the object key that was
+input. ``send_response_data`` is the first to handle the input query: it
+extracts the query and other CLI parameters.
-Per each new fetched chunk (~4m), RGW executes an s3-select query on it. The
-current implementation supports CSV objects and since chunks are randomly
-“cutting” the CSV rows in the middle, those broken-lines (first or last per
-chunk) are skipped while processing the query. Those “broken” lines are
-stored and later merged with the next broken-line (belong to the next chunk),
-and finally processed.
-
-Per each processed chunk an output message is formatted according to `AWS
+RGW executes an S3 Select query on each new fetched chunk (up to 4 MB). The
+current implementation supports CSV objects. CSV rows are sometimes "cut" in
+the middle by the limits of the chunks, and those broken-lines (the first or
+last per chunk) are skipped while processing the query. Such broken lines are
+stored and later merged with the next broken line (which belongs to the next
+chunk), and only then processed.
+
+For each processed chunk, an output message is formatted according to `aws
specification
-<https://docs.aws.amazon.com/AmazonS3/latest/API/archive-RESTObjectSELECTContent.html#archive-RESTObjectSELECTContent-responses>`_
-and sent back to the client. RGW supports the following response:
+<https://docs.aws.amazon.com/amazons3/latest/api/archive-restobjectselectcontent.html#archive-restobjectselectcontent-responses>`_
+and sent back to the client. RGW supports the following response:
``{:event-type,records} {:content-type,application/octet-stream}
-{:message-type,event}``. For aggregation queries the last chunk should be
-identified as the end of input, following that the s3-select-engine initiates
-end-of-process and produces an aggregated result.
+{:message-type,event}``. For aggregation queries, the last chunk should be
+identified as the end of input.
Basic Functionalities