]> git.proxmox.com Git - ceph.git/blame - ceph/doc/radosgw/layout.rst
Import ceph 15.2.8
[ceph.git] / ceph / doc / radosgw / layout.rst
CommitLineData
7c673cae
FG
1===========================
2 Rados Gateway Data Layout
3===========================
4
5Although the source code is the ultimate guide, this document helps
6new developers to get up to speed with the implementation details.
7
8Introduction
9------------
10
11Swift offers something called a container, that we use interchangeably with
12the term bucket. One may say that RGW's buckets implement Swift containers.
13
14This document does not consider how RGW operates on these structures,
15e.g. the use of encode() and decode() methods for serialization and so on.
16
17Conceptual View
18---------------
19
20Although RADOS only knows about pools and objects with their xattrs and
21omap[1], conceptually RGW organizes its data into three different kinds:
22metadata, bucket index, and data.
23
24Metadata
25^^^^^^^^
26
27We have 3 'sections' of metadata: 'user', 'bucket', and 'bucket.instance'.
28You can use the following commands to introspect metadata entries: ::
29
30 $ radosgw-admin metadata list
31 $ radosgw-admin metadata list bucket
32 $ radosgw-admin metadata list bucket.instance
33 $ radosgw-admin metadata list user
34
35 $ radosgw-admin metadata get bucket:<bucket>
36 $ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id>
37 $ radosgw-admin metadata get user:<user> # get or set
38
39Some variables have been used in above commands, they are:
40
41- user: Holds user information
42- bucket: Holds a mapping between bucket name and bucket instance id
43- bucket.instance: Holds bucket instance information[2]
44
9f95a23c 45Every metadata entry is kept on a single rados object. See below for implementation details.
7c673cae
FG
46
47Note that the metadata is not indexed. When listing a metadata section we do a
48rados pgls operation on the containing pool.
49
50Bucket Index
51^^^^^^^^^^^^
52
53It's a different kind of metadata, and kept separately. The bucket index holds
54a key-value map in rados objects. By default it is a single rados object per
55bucket, but it is possible since Hammer to shard that map over multiple rados
56objects. The map itself is kept in omap, associated with each rados object.
57The key of each omap is the name of the objects, and the value holds some basic
58metadata of that object -- metadata that shows up when listing the bucket.
59Also, each omap holds a header, and we keep some bucket accounting metadata
60in that header (number of objects, total size, etc.).
61
62Note that we also hold other information in the bucket index, and it's kept in
63other key namespaces. We can hold the bucket index log there, and for versioned
64objects there is more information that we keep on other keys.
65
66Data
67^^^^
68
69Objects data is kept in one or more rados objects for each rgw object.
70
71Object Lookup Path
72------------------
73
74When accessing objects, ReST APIs come to RGW with three parameters:
75account information (access key in S3 or account name in Swift),
76bucket or container name, and object name (or key). At present, RGW only
77uses account information to find out the user ID and for access control.
78Only the bucket name and object key are used to address the object in a pool.
79
80The user ID in RGW is a string, typically the actual user name from the user
81credentials and not a hashed or mapped identifier.
82
83When accessing a user's data, the user record is loaded from an object
31f18b77 84"<user_id>" in pool "default.rgw.meta" with namespace "users.uid".
7c673cae 85
31f18b77
FG
86Bucket names are represented in the pool "default.rgw.meta" with namespace
87"root". Bucket record is
7c673cae
FG
88loaded in order to obtain so-called marker, which serves as a bucket ID.
89
31f18b77
FG
90The object is located in pool "default.rgw.buckets.data".
91Object name is "<marker>_<key>",
7c673cae
FG
92for example "default.7593.4_image.png", where the marker is "default.7593.4"
93and the key is "image.png". Since these concatenated names are not parsed,
94only passed down to RADOS, the choice of the separator is not important and
95causes no ambiguity. For the same reason, slashes are permitted in object
96names (keys).
97
98It is also possible to create multiple data pools and make it so that
99different users buckets will be created in different rados pools by default,
100thus providing the necessary scaling. The layout and naming of these pools
101is controlled by a 'policy' setting.[3]
102
103An RGW object may consist of several RADOS objects, the first of which
104is the head that contains the metadata, such as manifest, ACLs, content type,
105ETag, and user-defined metadata. The metadata is stored in xattrs.
106The head may also contain up to 512 kilobytes of object data, for efficiency
107and atomicity. The manifest describes how each object is laid out in RADOS
108objects.
109
110Bucket and Object Listing
111-------------------------
112
113Buckets that belong to a given user are listed in an omap of an object named
31f18b77
FG
114"<user_id>.buckets" (for example, "foo.buckets") in pool "default.rgw.meta"
115with namespace "users.uid".
7c673cae
FG
116These objects are accessed when listing buckets, when updating bucket
117contents, and updating and retrieving bucket statistics (e.g. for quota).
118
119See the user-visible, encoded class 'cls_user_bucket_entry' and its
120nested class 'cls_user_bucket' for the values of these omap entires.
121
122These listings are kept consistent with buckets in pool ".rgw".
123
124Objects that belong to a given bucket are listed in a bucket index,
125as discussed in sub-section 'Bucket Index' above. The default naming
31f18b77 126for index objects is ".dir.<marker>" in pool "default.rgw.buckets.index".
7c673cae
FG
127
128Footnotes
129---------
130
131[1] Omap is a key-value store, associated with an object, in a way similar
132to how Extended Attributes associate with a POSIX file. An object's omap
133is not physically located in the object's storage, but its precise
134implementation is invisible and immaterial to RADOS Gateway.
135In Hammer, one LevelDB is used to store omap in each OSD.
136
137[2] Before the Dumpling release, the 'bucket.instance' metadata did not
138exist and the 'bucket' metadata contained its information. It is possible
139to encounter such buckets in old installations.
140
31f18b77
FG
141[3] The pool names have been changed starting with the Infernalis release.
142If you are looking at an older setup, some details may be different. In
143particular there was a different pool for each of the namespaces that are
144now being used inside the default.root.meta pool.
7c673cae 145
31f18b77
FG
146Appendix: Compendium
147--------------------
7c673cae
FG
148
149Known pools:
150
151.rgw.root
152 Unspecified region, zone, and global information records, one per object.
153
31f18b77 154<zone>.rgw.control
7c673cae
FG
155 notify.<N>
156
31f18b77
FG
157<zone>.rgw.meta
158 Multiple namespaces with different kinds of metadata:
7c673cae 159
31f18b77
FG
160 namespace: root
161 <bucket>
162 .bucket.meta.<bucket>:<marker> # see put_bucket_instance_info()
7c673cae 163
31f18b77
FG
164 The tenant is used to disambiguate buckets, but not bucket instances.
165 Example::
7c673cae 166
31f18b77
FG
167 .bucket.meta.prodtx:test%25star:default.84099.6
168 .bucket.meta.testcont:default.4126.1
169 .bucket.meta.prodtx:testcont:default.84099.4
170 prodtx/testcont
171 prodtx/test%25star
172 testcont
7c673cae 173
31f18b77
FG
174 namespace: users.uid
175 Contains _both_ per-user information (RGWUserInfo) in "<user>" objects
176 and per-user lists of buckets in omaps of "<user>.buckets" objects.
177 The "<user>" may contain the tenant if non-empty, for example::
7c673cae 178
31f18b77
FG
179 prodtx$prodt
180 test2.buckets
181 prodtx$prodt.buckets
182 test2
7c673cae 183
31f18b77
FG
184 namespace: users.email
185 Unimportant
7c673cae 186
31f18b77
FG
187 namespace: users.keys
188 47UA98JSTJZ9YAN3OS3O
7c673cae 189
31f18b77 190 This allows radosgw to look up users by their access keys during authentication.
7c673cae 191
31f18b77
FG
192 namespace: users.swift
193 test:tester
194
195<zone>.rgw.buckets.index
7c673cae
FG
196 Objects are named ".dir.<marker>", each contains a bucket index.
197 If the index is sharded, each shard appends the shard index after
198 the marker.
199
31f18b77 200<zone>.rgw.buckets.data
7c673cae
FG
201 default.7593.4__shadow_.488urDFerTYXavx4yAd-Op8mxehnvTI_1
202 <marker>_<key>
203
204An example of a marker would be "default.16004.1" or "default.7593.4".
205The current format is "<zone>.<instance_id>.<bucket_id>". But once
206generated, a marker is not parsed again, so its format may change
207freely in the future.