]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | =========================== |
2 | Rados Gateway Data Layout | |
3 | =========================== | |
4 | ||
5 | Although the source code is the ultimate guide, this document helps | |
6 | new developers to get up to speed with the implementation details. | |
7 | ||
8 | Introduction | |
9 | ------------ | |
10 | ||
f67539c2 TL |
11 | Swift offers something called a *container*, which we use interchangeably with |
12 | the term *bucket*, so we say that RGW's buckets implement Swift containers. | |
7c673cae FG |
13 | |
14 | This document does not consider how RGW operates on these structures, | |
15 | e.g. the use of encode() and decode() methods for serialization and so on. | |
16 | ||
17 | Conceptual View | |
18 | --------------- | |
19 | ||
20 | Although RADOS only knows about pools and objects with their xattrs and | |
21 | omap[1], conceptually RGW organizes its data into three different kinds: | |
22 | metadata, bucket index, and data. | |
23 | ||
24 | Metadata | |
25 | ^^^^^^^^ | |
26 | ||
27 | We have 3 'sections' of metadata: 'user', 'bucket', and 'bucket.instance'. | |
28 | You can use the following commands to introspect metadata entries: :: | |
29 | ||
30 | $ radosgw-admin metadata list | |
31 | $ radosgw-admin metadata list bucket | |
32 | $ radosgw-admin metadata list bucket.instance | |
33 | $ radosgw-admin metadata list user | |
34 | ||
35 | $ radosgw-admin metadata get bucket:<bucket> | |
36 | $ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id> | |
37 | $ radosgw-admin metadata get user:<user> # get or set | |
38 | ||
39 | Some variables have been used in above commands, they are: | |
40 | ||
41 | - user: Holds user information | |
42 | - bucket: Holds a mapping between bucket name and bucket instance id | |
43 | - bucket.instance: Holds bucket instance information[2] | |
44 | ||
f67539c2 | 45 | Every metadata entry is kept on a single RADOS object. See below for implementation details. |
7c673cae FG |
46 | |
47 | Note that the metadata is not indexed. When listing a metadata section we do a | |
f67539c2 | 48 | RADOS ``pgls`` operation on the containing pool. |
7c673cae FG |
49 | |
50 | Bucket Index | |
51 | ^^^^^^^^^^^^ | |
52 | ||
53 | It's a different kind of metadata, and kept separately. The bucket index holds | |
f67539c2 TL |
54 | a key-value map in RADOS objects. By default it is a single RADOS object per |
55 | bucket, but it is possible since Hammer to shard that map over multiple RADOS | |
56 | objects. The map itself is kept in omap, associated with each RADOS object. | |
7c673cae FG |
57 | The key of each omap is the name of the objects, and the value holds some basic |
58 | metadata of that object -- metadata that shows up when listing the bucket. | |
59 | Also, each omap holds a header, and we keep some bucket accounting metadata | |
60 | in that header (number of objects, total size, etc.). | |
61 | ||
62 | Note that we also hold other information in the bucket index, and it's kept in | |
63 | other key namespaces. We can hold the bucket index log there, and for versioned | |
64 | objects there is more information that we keep on other keys. | |
65 | ||
66 | Data | |
67 | ^^^^ | |
68 | ||
f67539c2 | 69 | Objects data is kept in one or more RADOS objects for each rgw object. |
7c673cae FG |
70 | |
71 | Object Lookup Path | |
72 | ------------------ | |
73 | ||
74 | When accessing objects, ReST APIs come to RGW with three parameters: | |
75 | account information (access key in S3 or account name in Swift), | |
76 | bucket or container name, and object name (or key). At present, RGW only | |
77 | uses account information to find out the user ID and for access control. | |
78 | Only the bucket name and object key are used to address the object in a pool. | |
79 | ||
80 | The user ID in RGW is a string, typically the actual user name from the user | |
81 | credentials and not a hashed or mapped identifier. | |
82 | ||
83 | When accessing a user's data, the user record is loaded from an object | |
31f18b77 | 84 | "<user_id>" in pool "default.rgw.meta" with namespace "users.uid". |
7c673cae | 85 | |
31f18b77 FG |
86 | Bucket names are represented in the pool "default.rgw.meta" with namespace |
87 | "root". Bucket record is | |
7c673cae FG |
88 | loaded in order to obtain so-called marker, which serves as a bucket ID. |
89 | ||
31f18b77 FG |
90 | The object is located in pool "default.rgw.buckets.data". |
91 | Object name is "<marker>_<key>", | |
7c673cae FG |
92 | for example "default.7593.4_image.png", where the marker is "default.7593.4" |
93 | and the key is "image.png". Since these concatenated names are not parsed, | |
94 | only passed down to RADOS, the choice of the separator is not important and | |
95 | causes no ambiguity. For the same reason, slashes are permitted in object | |
96 | names (keys). | |
97 | ||
98 | It is also possible to create multiple data pools and make it so that | |
20effc67 | 99 | different users\` buckets will be created in different RADOS pools by default, |
7c673cae FG |
100 | thus providing the necessary scaling. The layout and naming of these pools |
101 | is controlled by a 'policy' setting.[3] | |
102 | ||
103 | An RGW object may consist of several RADOS objects, the first of which | |
104 | is the head that contains the metadata, such as manifest, ACLs, content type, | |
105 | ETag, and user-defined metadata. The metadata is stored in xattrs. | |
20effc67 | 106 | The head may also contain up to :confval:`rgw_max_chunk_size` of object data, for efficiency |
7c673cae FG |
107 | and atomicity. The manifest describes how each object is laid out in RADOS |
108 | objects. | |
109 | ||
110 | Bucket and Object Listing | |
111 | ------------------------- | |
112 | ||
113 | Buckets that belong to a given user are listed in an omap of an object named | |
31f18b77 FG |
114 | "<user_id>.buckets" (for example, "foo.buckets") in pool "default.rgw.meta" |
115 | with namespace "users.uid". | |
7c673cae FG |
116 | These objects are accessed when listing buckets, when updating bucket |
117 | contents, and updating and retrieving bucket statistics (e.g. for quota). | |
118 | ||
119 | See the user-visible, encoded class 'cls_user_bucket_entry' and its | |
20effc67 | 120 | nested class 'cls_user_bucket' for the values of these omap entries. |
7c673cae FG |
121 | |
122 | These listings are kept consistent with buckets in pool ".rgw". | |
123 | ||
124 | Objects that belong to a given bucket are listed in a bucket index, | |
125 | as discussed in sub-section 'Bucket Index' above. The default naming | |
31f18b77 | 126 | for index objects is ".dir.<marker>" in pool "default.rgw.buckets.index". |
7c673cae FG |
127 | |
128 | Footnotes | |
129 | --------- | |
130 | ||
131 | [1] Omap is a key-value store, associated with an object, in a way similar | |
132 | to how Extended Attributes associate with a POSIX file. An object's omap | |
133 | is not physically located in the object's storage, but its precise | |
134 | implementation is invisible and immaterial to RADOS Gateway. | |
20effc67 TL |
135 | In Hammer, LevelDB is used to store omap data within each OSD; later releases |
136 | default to RocksDB but can be configured to use LevelDB. | |
7c673cae FG |
137 | |
138 | [2] Before the Dumpling release, the 'bucket.instance' metadata did not | |
139 | exist and the 'bucket' metadata contained its information. It is possible | |
140 | to encounter such buckets in old installations. | |
141 | ||
20effc67 | 142 | [3] Pool names changed with the Infernalis release. |
31f18b77 FG |
143 | If you are looking at an older setup, some details may be different. In |
144 | particular there was a different pool for each of the namespaces that are | |
20effc67 | 145 | now being used inside the ``default.root.meta`` pool. |
7c673cae | 146 | |
31f18b77 FG |
147 | Appendix: Compendium |
148 | -------------------- | |
7c673cae FG |
149 | |
150 | Known pools: | |
151 | ||
152 | .rgw.root | |
153 | Unspecified region, zone, and global information records, one per object. | |
154 | ||
31f18b77 | 155 | <zone>.rgw.control |
7c673cae FG |
156 | notify.<N> |
157 | ||
31f18b77 FG |
158 | <zone>.rgw.meta |
159 | Multiple namespaces with different kinds of metadata: | |
7c673cae | 160 | |
31f18b77 FG |
161 | namespace: root |
162 | <bucket> | |
163 | .bucket.meta.<bucket>:<marker> # see put_bucket_instance_info() | |
7c673cae | 164 | |
31f18b77 FG |
165 | The tenant is used to disambiguate buckets, but not bucket instances. |
166 | Example:: | |
7c673cae | 167 | |
31f18b77 FG |
168 | .bucket.meta.prodtx:test%25star:default.84099.6 |
169 | .bucket.meta.testcont:default.4126.1 | |
170 | .bucket.meta.prodtx:testcont:default.84099.4 | |
171 | prodtx/testcont | |
172 | prodtx/test%25star | |
173 | testcont | |
7c673cae | 174 | |
31f18b77 FG |
175 | namespace: users.uid |
176 | Contains _both_ per-user information (RGWUserInfo) in "<user>" objects | |
177 | and per-user lists of buckets in omaps of "<user>.buckets" objects. | |
178 | The "<user>" may contain the tenant if non-empty, for example:: | |
7c673cae | 179 | |
31f18b77 FG |
180 | prodtx$prodt |
181 | test2.buckets | |
182 | prodtx$prodt.buckets | |
183 | test2 | |
7c673cae | 184 | |
31f18b77 FG |
185 | namespace: users.email |
186 | Unimportant | |
7c673cae | 187 | |
31f18b77 FG |
188 | namespace: users.keys |
189 | 47UA98JSTJZ9YAN3OS3O | |
7c673cae | 190 | |
f67539c2 | 191 | This allows ``radosgw`` to look up users by their access keys during authentication. |
7c673cae | 192 | |
31f18b77 FG |
193 | namespace: users.swift |
194 | test:tester | |
195 | ||
196 | <zone>.rgw.buckets.index | |
7c673cae FG |
197 | Objects are named ".dir.<marker>", each contains a bucket index. |
198 | If the index is sharded, each shard appends the shard index after | |
199 | the marker. | |
200 | ||
31f18b77 | 201 | <zone>.rgw.buckets.data |
7c673cae FG |
202 | default.7593.4__shadow_.488urDFerTYXavx4yAd-Op8mxehnvTI_1 |
203 | <marker>_<key> | |
204 | ||
205 | An example of a marker would be "default.16004.1" or "default.7593.4". | |
206 | The current format is "<zone>.<instance_id>.<bucket_id>". But once | |
207 | generated, a marker is not parsed again, so its format may change | |
208 | freely in the future. |