]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | =========================== |
2 | Rados Gateway Data Layout | |
3 | =========================== | |
4 | ||
5 | Although the source code is the ultimate guide, this document helps | |
6 | new developers to get up to speed with the implementation details. | |
7 | ||
8 | Introduction | |
9 | ------------ | |
10 | ||
11 | Swift offers something called a container, that we use interchangeably with | |
12 | the term bucket. One may say that RGW's buckets implement Swift containers. | |
13 | ||
14 | This document does not consider how RGW operates on these structures, | |
15 | e.g. the use of encode() and decode() methods for serialization and so on. | |
16 | ||
17 | Conceptual View | |
18 | --------------- | |
19 | ||
20 | Although RADOS only knows about pools and objects with their xattrs and | |
21 | omap[1], conceptually RGW organizes its data into three different kinds: | |
22 | metadata, bucket index, and data. | |
23 | ||
24 | Metadata | |
25 | ^^^^^^^^ | |
26 | ||
27 | We have 3 'sections' of metadata: 'user', 'bucket', and 'bucket.instance'. | |
28 | You can use the following commands to introspect metadata entries: :: | |
29 | ||
30 | $ radosgw-admin metadata list | |
31 | $ radosgw-admin metadata list bucket | |
32 | $ radosgw-admin metadata list bucket.instance | |
33 | $ radosgw-admin metadata list user | |
34 | ||
35 | $ radosgw-admin metadata get bucket:<bucket> | |
36 | $ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id> | |
37 | $ radosgw-admin metadata get user:<user> # get or set | |
38 | ||
39 | Some variables have been used in above commands, they are: | |
40 | ||
41 | - user: Holds user information | |
42 | - bucket: Holds a mapping between bucket name and bucket instance id | |
43 | - bucket.instance: Holds bucket instance information[2] | |
44 | ||
45 | Every metadata entry is kept on a single rados object. | |
46 | See below for implementation defails. | |
47 | ||
48 | Note that the metadata is not indexed. When listing a metadata section we do a | |
49 | rados pgls operation on the containing pool. | |
50 | ||
51 | Bucket Index | |
52 | ^^^^^^^^^^^^ | |
53 | ||
54 | It's a different kind of metadata, and kept separately. The bucket index holds | |
55 | a key-value map in rados objects. By default it is a single rados object per | |
56 | bucket, but it is possible since Hammer to shard that map over multiple rados | |
57 | objects. The map itself is kept in omap, associated with each rados object. | |
58 | The key of each omap is the name of the objects, and the value holds some basic | |
59 | metadata of that object -- metadata that shows up when listing the bucket. | |
60 | Also, each omap holds a header, and we keep some bucket accounting metadata | |
61 | in that header (number of objects, total size, etc.). | |
62 | ||
63 | Note that we also hold other information in the bucket index, and it's kept in | |
64 | other key namespaces. We can hold the bucket index log there, and for versioned | |
65 | objects there is more information that we keep on other keys. | |
66 | ||
67 | Data | |
68 | ^^^^ | |
69 | ||
70 | Objects data is kept in one or more rados objects for each rgw object. | |
71 | ||
72 | Object Lookup Path | |
73 | ------------------ | |
74 | ||
75 | When accessing objects, ReST APIs come to RGW with three parameters: | |
76 | account information (access key in S3 or account name in Swift), | |
77 | bucket or container name, and object name (or key). At present, RGW only | |
78 | uses account information to find out the user ID and for access control. | |
79 | Only the bucket name and object key are used to address the object in a pool. | |
80 | ||
81 | The user ID in RGW is a string, typically the actual user name from the user | |
82 | credentials and not a hashed or mapped identifier. | |
83 | ||
84 | When accessing a user's data, the user record is loaded from an object | |
85 | "<user_id>" in pool ".users.uid". | |
86 | ||
87 | Bucket names are represented directly in the pool ".rgw". Bucket record is | |
88 | loaded in order to obtain so-called marker, which serves as a bucket ID. | |
89 | ||
90 | The object is located in pool ".rgw.buckets". Object name is "<marker>_<key>", | |
91 | for example "default.7593.4_image.png", where the marker is "default.7593.4" | |
92 | and the key is "image.png". Since these concatenated names are not parsed, | |
93 | only passed down to RADOS, the choice of the separator is not important and | |
94 | causes no ambiguity. For the same reason, slashes are permitted in object | |
95 | names (keys). | |
96 | ||
97 | It is also possible to create multiple data pools and make it so that | |
98 | different users buckets will be created in different rados pools by default, | |
99 | thus providing the necessary scaling. The layout and naming of these pools | |
100 | is controlled by a 'policy' setting.[3] | |
101 | ||
102 | An RGW object may consist of several RADOS objects, the first of which | |
103 | is the head that contains the metadata, such as manifest, ACLs, content type, | |
104 | ETag, and user-defined metadata. The metadata is stored in xattrs. | |
105 | The head may also contain up to 512 kilobytes of object data, for efficiency | |
106 | and atomicity. The manifest describes how each object is laid out in RADOS | |
107 | objects. | |
108 | ||
109 | Bucket and Object Listing | |
110 | ------------------------- | |
111 | ||
112 | Buckets that belong to a given user are listed in an omap of an object named | |
113 | "<user_id>.buckets" (for example, "foo.buckets") in pool ".users.uid". | |
114 | These objects are accessed when listing buckets, when updating bucket | |
115 | contents, and updating and retrieving bucket statistics (e.g. for quota). | |
116 | ||
117 | See the user-visible, encoded class 'cls_user_bucket_entry' and its | |
118 | nested class 'cls_user_bucket' for the values of these omap entires. | |
119 | ||
120 | These listings are kept consistent with buckets in pool ".rgw". | |
121 | ||
122 | Objects that belong to a given bucket are listed in a bucket index, | |
123 | as discussed in sub-section 'Bucket Index' above. The default naming | |
124 | for index objects is ".dir.<marker>" in pool ".rgw.buckets.index". | |
125 | ||
126 | Footnotes | |
127 | --------- | |
128 | ||
129 | [1] Omap is a key-value store, associated with an object, in a way similar | |
130 | to how Extended Attributes associate with a POSIX file. An object's omap | |
131 | is not physically located in the object's storage, but its precise | |
132 | implementation is invisible and immaterial to RADOS Gateway. | |
133 | In Hammer, one LevelDB is used to store omap in each OSD. | |
134 | ||
135 | [2] Before the Dumpling release, the 'bucket.instance' metadata did not | |
136 | exist and the 'bucket' metadata contained its information. It is possible | |
137 | to encounter such buckets in old installations. | |
138 | ||
139 | [3] In Infernalis, a pending commit exists that removes the need of prefixing | |
140 | all the rgw system pools with a period, and also renames all of these pools. | |
141 | See Github pull request #4944 "rgw noperiod". | |
142 | ||
143 | Appendix: Compendum | |
144 | ------------------- | |
145 | ||
146 | Known pools: | |
147 | ||
148 | .rgw.root | |
149 | Unspecified region, zone, and global information records, one per object. | |
150 | ||
151 | .rgw.control | |
152 | notify.<N> | |
153 | ||
154 | .rgw | |
155 | <bucket> | |
156 | .bucket.meta.<bucket>:<marker> # see put_bucket_instance_info() | |
157 | ||
158 | The tenant is used to disambiguate buckets, but not bucket instances. | |
159 | Example: | |
160 | ||
161 | .bucket.meta.prodtx:test%25star:default.84099.6 | |
162 | .bucket.meta.testcont:default.4126.1 | |
163 | .bucket.meta.prodtx:testcont:default.84099.4 | |
164 | prodtx/testcont | |
165 | prodtx/test%25star | |
166 | testcont | |
167 | ||
168 | .rgw.gc | |
169 | gc.<N> | |
170 | ||
171 | .users.uid | |
172 | Contains _both_ per-user information (RGWUserInfo) in "<user>" objects | |
173 | and per-user lists of buckets in omaps of "<user>.buckets" objects. | |
174 | The "<user>" may contain the tenant if non-empty, for example: | |
175 | ||
176 | prodtx$prodt | |
177 | test2.buckets | |
178 | prodtx$prodt.buckets | |
179 | test2 | |
180 | ||
181 | .users.email | |
182 | Unimportant | |
183 | ||
184 | .users | |
185 | 47UA98JSTJZ9YAN3OS3O | |
186 | It's unclear why user ID is not used to name objects in this pool. | |
187 | ||
188 | .users.swift | |
189 | test:tester | |
190 | ||
191 | .rgw.buckets.index | |
192 | Objects are named ".dir.<marker>", each contains a bucket index. | |
193 | If the index is sharded, each shard appends the shard index after | |
194 | the marker. | |
195 | ||
196 | .rgw.buckets | |
197 | default.7593.4__shadow_.488urDFerTYXavx4yAd-Op8mxehnvTI_1 | |
198 | <marker>_<key> | |
199 | ||
200 | An example of a marker would be "default.16004.1" or "default.7593.4". | |
201 | The current format is "<zone>.<instance_id>.<bucket_id>". But once | |
202 | generated, a marker is not parsed again, so its format may change | |
203 | freely in the future. |