]> git.proxmox.com Git - ceph.git/blame - ceph/doc/radosgw/layout.rst
add subtree-ish sources for 12.0.3
[ceph.git] / ceph / doc / radosgw / layout.rst
CommitLineData
7c673cae
FG
1===========================
2 Rados Gateway Data Layout
3===========================
4
5Although the source code is the ultimate guide, this document helps
6new developers to get up to speed with the implementation details.
7
8Introduction
9------------
10
11Swift offers something called a container, that we use interchangeably with
12the term bucket. One may say that RGW's buckets implement Swift containers.
13
14This document does not consider how RGW operates on these structures,
15e.g. the use of encode() and decode() methods for serialization and so on.
16
17Conceptual View
18---------------
19
20Although RADOS only knows about pools and objects with their xattrs and
21omap[1], conceptually RGW organizes its data into three different kinds:
22metadata, bucket index, and data.
23
24Metadata
25^^^^^^^^
26
27We have 3 'sections' of metadata: 'user', 'bucket', and 'bucket.instance'.
28You can use the following commands to introspect metadata entries: ::
29
30 $ radosgw-admin metadata list
31 $ radosgw-admin metadata list bucket
32 $ radosgw-admin metadata list bucket.instance
33 $ radosgw-admin metadata list user
34
35 $ radosgw-admin metadata get bucket:<bucket>
36 $ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id>
37 $ radosgw-admin metadata get user:<user> # get or set
38
39Some variables have been used in above commands, they are:
40
41- user: Holds user information
42- bucket: Holds a mapping between bucket name and bucket instance id
43- bucket.instance: Holds bucket instance information[2]
44
45Every metadata entry is kept on a single rados object.
46See below for implementation defails.
47
48Note that the metadata is not indexed. When listing a metadata section we do a
49rados pgls operation on the containing pool.
50
51Bucket Index
52^^^^^^^^^^^^
53
54It's a different kind of metadata, and kept separately. The bucket index holds
55a key-value map in rados objects. By default it is a single rados object per
56bucket, but it is possible since Hammer to shard that map over multiple rados
57objects. The map itself is kept in omap, associated with each rados object.
58The key of each omap is the name of the objects, and the value holds some basic
59metadata of that object -- metadata that shows up when listing the bucket.
60Also, each omap holds a header, and we keep some bucket accounting metadata
61in that header (number of objects, total size, etc.).
62
63Note that we also hold other information in the bucket index, and it's kept in
64other key namespaces. We can hold the bucket index log there, and for versioned
65objects there is more information that we keep on other keys.
66
67Data
68^^^^
69
70Objects data is kept in one or more rados objects for each rgw object.
71
72Object Lookup Path
73------------------
74
75When accessing objects, ReST APIs come to RGW with three parameters:
76account information (access key in S3 or account name in Swift),
77bucket or container name, and object name (or key). At present, RGW only
78uses account information to find out the user ID and for access control.
79Only the bucket name and object key are used to address the object in a pool.
80
81The user ID in RGW is a string, typically the actual user name from the user
82credentials and not a hashed or mapped identifier.
83
84When accessing a user's data, the user record is loaded from an object
85"<user_id>" in pool ".users.uid".
86
87Bucket names are represented directly in the pool ".rgw". Bucket record is
88loaded in order to obtain so-called marker, which serves as a bucket ID.
89
90The object is located in pool ".rgw.buckets". Object name is "<marker>_<key>",
91for example "default.7593.4_image.png", where the marker is "default.7593.4"
92and the key is "image.png". Since these concatenated names are not parsed,
93only passed down to RADOS, the choice of the separator is not important and
94causes no ambiguity. For the same reason, slashes are permitted in object
95names (keys).
96
97It is also possible to create multiple data pools and make it so that
98different users buckets will be created in different rados pools by default,
99thus providing the necessary scaling. The layout and naming of these pools
100is controlled by a 'policy' setting.[3]
101
102An RGW object may consist of several RADOS objects, the first of which
103is the head that contains the metadata, such as manifest, ACLs, content type,
104ETag, and user-defined metadata. The metadata is stored in xattrs.
105The head may also contain up to 512 kilobytes of object data, for efficiency
106and atomicity. The manifest describes how each object is laid out in RADOS
107objects.
108
109Bucket and Object Listing
110-------------------------
111
112Buckets that belong to a given user are listed in an omap of an object named
113"<user_id>.buckets" (for example, "foo.buckets") in pool ".users.uid".
114These objects are accessed when listing buckets, when updating bucket
115contents, and updating and retrieving bucket statistics (e.g. for quota).
116
117See the user-visible, encoded class 'cls_user_bucket_entry' and its
118nested class 'cls_user_bucket' for the values of these omap entires.
119
120These listings are kept consistent with buckets in pool ".rgw".
121
122Objects that belong to a given bucket are listed in a bucket index,
123as discussed in sub-section 'Bucket Index' above. The default naming
124for index objects is ".dir.<marker>" in pool ".rgw.buckets.index".
125
126Footnotes
127---------
128
129[1] Omap is a key-value store, associated with an object, in a way similar
130to how Extended Attributes associate with a POSIX file. An object's omap
131is not physically located in the object's storage, but its precise
132implementation is invisible and immaterial to RADOS Gateway.
133In Hammer, one LevelDB is used to store omap in each OSD.
134
135[2] Before the Dumpling release, the 'bucket.instance' metadata did not
136exist and the 'bucket' metadata contained its information. It is possible
137to encounter such buckets in old installations.
138
139[3] In Infernalis, a pending commit exists that removes the need of prefixing
140all the rgw system pools with a period, and also renames all of these pools.
141See Github pull request #4944 "rgw noperiod".
142
143Appendix: Compendum
144-------------------
145
146Known pools:
147
148.rgw.root
149 Unspecified region, zone, and global information records, one per object.
150
151.rgw.control
152 notify.<N>
153
154.rgw
155 <bucket>
156 .bucket.meta.<bucket>:<marker> # see put_bucket_instance_info()
157
158 The tenant is used to disambiguate buckets, but not bucket instances.
159 Example:
160
161 .bucket.meta.prodtx:test%25star:default.84099.6
162 .bucket.meta.testcont:default.4126.1
163 .bucket.meta.prodtx:testcont:default.84099.4
164 prodtx/testcont
165 prodtx/test%25star
166 testcont
167
168.rgw.gc
169 gc.<N>
170
171.users.uid
172 Contains _both_ per-user information (RGWUserInfo) in "<user>" objects
173 and per-user lists of buckets in omaps of "<user>.buckets" objects.
174 The "<user>" may contain the tenant if non-empty, for example:
175
176 prodtx$prodt
177 test2.buckets
178 prodtx$prodt.buckets
179 test2
180
181.users.email
182 Unimportant
183
184.users
185 47UA98JSTJZ9YAN3OS3O
186 It's unclear why user ID is not used to name objects in this pool.
187
188.users.swift
189 test:tester
190
191.rgw.buckets.index
192 Objects are named ".dir.<marker>", each contains a bucket index.
193 If the index is sharded, each shard appends the shard index after
194 the marker.
195
196.rgw.buckets
197 default.7593.4__shadow_.488urDFerTYXavx4yAd-Op8mxehnvTI_1
198 <marker>_<key>
199
200An example of a marker would be "default.16004.1" or "default.7593.4".
201The current format is "<zone>.<instance_id>.<bucket_id>". But once
202generated, a marker is not parsed again, so its format may change
203freely in the future.