[ceph.git] / ceph / doc / radosgw / layout.rst

===========================
 Rados Gateway Data Layout
===========================

Although the source code is the ultimate guide, this document helps
new developers to get up to speed with the implementation details.

Introduction
------------

Swift offers something called a container, that we use interchangeably with
the term bucket. One may say that RGW's buckets implement Swift containers.

This document does not consider how RGW operates on these structures,
e.g. the use of encode() and decode() methods for serialization and so on.

Conceptual View
---------------

Although RADOS only knows about pools and objects with their xattrs and
omap[1], conceptually RGW organizes its data into three different kinds:
metadata, bucket index, and data.

Metadata
^^^^^^^^

We have 3 'sections' of metadata: 'user', 'bucket', and 'bucket.instance'.
You can use the following commands to introspect metadata entries: ::

    $ radosgw-admin metadata list
    $ radosgw-admin metadata list bucket
    $ radosgw-admin metadata list bucket.instance
    $ radosgw-admin metadata list user

    $ radosgw-admin metadata get bucket:<bucket>
    $ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id>
    $ radosgw-admin metadata get user:<user>   # get or set
    
Some variables have been used in above commands, they are:

- user: Holds user information
- bucket: Holds a mapping between bucket name and bucket instance id
- bucket.instance: Holds bucket instance information[2]

Every metadata entry is kept on a single rados object.
See below for implementation defails.

Note that the metadata is not indexed. When listing a metadata section we do a
rados pgls operation on the containing pool.

Bucket Index
^^^^^^^^^^^^

It's a different kind of metadata, and kept separately. The bucket index holds
a key-value map in rados objects. By default it is a single rados object per
bucket, but it is possible since Hammer to shard that map over multiple rados
objects. The map itself is kept in omap, associated with each rados object.
The key of each omap is the name of the objects, and the value holds some basic
metadata of that object -- metadata that shows up when listing the bucket.
Also, each omap holds a header, and we keep some bucket accounting metadata
in that header (number of objects, total size, etc.).

Note that we also hold other information in the bucket index, and it's kept in
other key namespaces. We can hold the bucket index log there, and for versioned
objects there is more information that we keep on other keys.

Data
^^^^

Objects data is kept in one or more rados objects for each rgw object.

Object Lookup Path
------------------

When accessing objects, ReST APIs come to RGW with three parameters:
account information (access key in S3 or account name in Swift),
bucket or container name, and object name (or key). At present, RGW only
uses account information to find out the user ID and for access control.
Only the bucket name and object key are used to address the object in a pool.

The user ID in RGW is a string, typically the actual user name from the user
credentials and not a hashed or mapped identifier.

When accessing a user's data, the user record is loaded from an object
"<user_id>" in pool ".users.uid".

Bucket names are represented directly in the pool ".rgw". Bucket record is
loaded in order to obtain so-called marker, which serves as a bucket ID.

The object is located in pool ".rgw.buckets". Object name is "<marker>_<key>",
for example "default.7593.4_image.png", where the marker is "default.7593.4"
and the key is "image.png". Since these concatenated names are not parsed,
only passed down to RADOS, the choice of the separator is not important and
causes no ambiguity. For the same reason, slashes are permitted in object
names (keys).

It is also possible to create multiple data pools and make it so that
different users buckets will be created in different rados pools by default,
thus providing the necessary scaling. The layout and naming of these pools
is controlled by a 'policy' setting.[3]

An RGW object may consist of several RADOS objects, the first of which
is the head that contains the metadata, such as manifest, ACLs, content type,
ETag, and user-defined metadata. The metadata is stored in xattrs.
The head may also contain up to 512 kilobytes of object data, for efficiency
and atomicity. The manifest describes how each object is laid out in RADOS
objects.

Bucket and Object Listing
-------------------------

Buckets that belong to a given user are listed in an omap of an object named
"<user_id>.buckets" (for example, "foo.buckets") in pool ".users.uid".
These objects are accessed when listing buckets, when updating bucket
contents, and updating and retrieving bucket statistics (e.g. for quota).

See the user-visible, encoded class 'cls_user_bucket_entry' and its
nested class 'cls_user_bucket' for the values of these omap entires.

These listings are kept consistent with buckets in pool ".rgw".

Objects that belong to a given bucket are listed in a bucket index,
as discussed in sub-section 'Bucket Index' above. The default naming
for index objects is ".dir.<marker>" in pool ".rgw.buckets.index".

Footnotes
---------

[1] Omap is a key-value store, associated with an object, in a way similar
to how Extended Attributes associate with a POSIX file. An object's omap
is not physically located in the object's storage, but its precise
implementation is invisible and immaterial to RADOS Gateway.
In Hammer, one LevelDB is used to store omap in each OSD.

[2] Before the Dumpling release, the 'bucket.instance' metadata did not
exist and the 'bucket' metadata contained its information. It is possible
to encounter such buckets in old installations.

[3] In Infernalis, a pending commit exists that removes the need of prefixing
all the rgw system pools with a period, and also renames all of these pools.
See Github pull request #4944 "rgw noperiod".

Appendix: Compendum
-------------------

Known pools:

.rgw.root
  Unspecified region, zone, and global information records, one per object.

.rgw.control
  notify.<N>

.rgw
  <bucket>
  .bucket.meta.<bucket>:<marker>   # see put_bucket_instance_info()

  The tenant is used to disambiguate buckets, but not bucket instances.
  Example:

  .bucket.meta.prodtx:test%25star:default.84099.6
  .bucket.meta.testcont:default.4126.1
  .bucket.meta.prodtx:testcont:default.84099.4
  prodtx/testcont
  prodtx/test%25star
  testcont

.rgw.gc
  gc.<N>

.users.uid
  Contains _both_ per-user information (RGWUserInfo) in "<user>" objects
  and per-user lists of buckets in omaps of "<user>.buckets" objects.
  The "<user>" may contain the tenant if non-empty, for example:

  prodtx$prodt
  test2.buckets
  prodtx$prodt.buckets
  test2

.users.email
  Unimportant

.users
  47UA98JSTJZ9YAN3OS3O
  It's unclear why user ID is not used to name objects in this pool.

.users.swift
  test:tester

.rgw.buckets.index
  Objects are named ".dir.<marker>", each contains a bucket index.
  If the index is sharded, each shard appends the shard index after
  the marker.

.rgw.buckets
  default.7593.4__shadow_.488urDFerTYXavx4yAd-Op8mxehnvTI_1
  <marker>_<key>

An example of a marker would be "default.16004.1" or "default.7593.4".
The current format is "<zone>.<instance_id>.<bucket_id>". But once
generated, a marker is not parsed again, so its format may change
freely in the future.
Commit	Line	Data
7c673cae FG	1	===========================
	2	Rados Gateway Data Layout
	3	===========================
	4
	5	Although the source code is the ultimate guide, this document helps
	6	new developers to get up to speed with the implementation details.
	7
	8	Introduction
	9	------------
	10
	11	Swift offers something called a container, that we use interchangeably with
	12	the term bucket. One may say that RGW's buckets implement Swift containers.
	13
	14	This document does not consider how RGW operates on these structures,
	15	e.g. the use of encode() and decode() methods for serialization and so on.
	16
	17	Conceptual View
	18	---------------
	19
	20	Although RADOS only knows about pools and objects with their xattrs and
	21	omap[1], conceptually RGW organizes its data into three different kinds:
	22	metadata, bucket index, and data.
	23
	24	Metadata
	25	^^^^^^^^
	26
	27	We have 3 'sections' of metadata: 'user', 'bucket', and 'bucket.instance'.
	28	You can use the following commands to introspect metadata entries: ::
	29
	30	$ radosgw-admin metadata list
	31	$ radosgw-admin metadata list bucket
	32	$ radosgw-admin metadata list bucket.instance
	33	$ radosgw-admin metadata list user
	34
	35	$ radosgw-admin metadata get bucket:<bucket>
	36	$ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id>
	37	$ radosgw-admin metadata get user:<user> # get or set
	38
	39	Some variables have been used in above commands, they are:
	40
	41	- user: Holds user information
	42	- bucket: Holds a mapping between bucket name and bucket instance id
	43	- bucket.instance: Holds bucket instance information[2]
	44
	45	Every metadata entry is kept on a single rados object.
	46	See below for implementation defails.
	47
	48	Note that the metadata is not indexed. When listing a metadata section we do a
	49	rados pgls operation on the containing pool.
	50
	51	Bucket Index
	52	^^^^^^^^^^^^
	53
	54	It's a different kind of metadata, and kept separately. The bucket index holds
	55	a key-value map in rados objects. By default it is a single rados object per
	56	bucket, but it is possible since Hammer to shard that map over multiple rados
	57	objects. The map itself is kept in omap, associated with each rados object.
	58	The key of each omap is the name of the objects, and the value holds some basic
	59	metadata of that object -- metadata that shows up when listing the bucket.
	60	Also, each omap holds a header, and we keep some bucket accounting metadata
	61	in that header (number of objects, total size, etc.).
	62
	63	Note that we also hold other information in the bucket index, and it's kept in
	64	other key namespaces. We can hold the bucket index log there, and for versioned
65	objects there is more information that we keep on other keys.
66
67	Data
68	^^^^
69
70	Objects data is kept in one or more rados objects for each rgw object.
71
72	Object Lookup Path
73	------------------
74
75	When accessing objects, ReST APIs come to RGW with three parameters:
76	account information (access key in S3 or account name in Swift),
77	bucket or container name, and object name (or key). At present, RGW only
78	uses account information to find out the user ID and for access control.
79	Only the bucket name and object key are used to address the object in a pool.
80
81	The user ID in RGW is a string, typically the actual user name from the user
82	credentials and not a hashed or mapped identifier.
83
84	When accessing a user's data, the user record is loaded from an object
85	"<user_id>" in pool ".users.uid".
86
87	Bucket names are represented directly in the pool ".rgw". Bucket record is
88	loaded in order to obtain so-called marker, which serves as a bucket ID.
89
90	The object is located in pool ".rgw.buckets". Object name is "<marker>_<key>",
91	for example "default.7593.4_image.png", where the marker is "default.7593.4"
92	and the key is "image.png". Since these concatenated names are not parsed,
93	only passed down to RADOS, the choice of the separator is not important and
94	causes no ambiguity. For the same reason, slashes are permitted in object
95	names (keys).
96
97	It is also possible to create multiple data pools and make it so that
98	different users buckets will be created in different rados pools by default,
99	thus providing the necessary scaling. The layout and naming of these pools
100	is controlled by a 'policy' setting.[3]
101
102	An RGW object may consist of several RADOS objects, the first of which
103	is the head that contains the metadata, such as manifest, ACLs, content type,
104	ETag, and user-defined metadata. The metadata is stored in xattrs.
105	The head may also contain up to 512 kilobytes of object data, for efficiency
106	and atomicity. The manifest describes how each object is laid out in RADOS
107	objects.
108
109	Bucket and Object Listing
110	-------------------------
111
112	Buckets that belong to a given user are listed in an omap of an object named
113	"<user_id>.buckets" (for example, "foo.buckets") in pool ".users.uid".
114	These objects are accessed when listing buckets, when updating bucket
115	contents, and updating and retrieving bucket statistics (e.g. for quota).
116
117	See the user-visible, encoded class 'cls_user_bucket_entry' and its
118	nested class 'cls_user_bucket' for the values of these omap entires.
119
120	These listings are kept consistent with buckets in pool ".rgw".
121
122	Objects that belong to a given bucket are listed in a bucket index,
123	as discussed in sub-section 'Bucket Index' above. The default naming
124	for index objects is ".dir.<marker>" in pool ".rgw.buckets.index".
125
126	Footnotes
127	---------
128
129	[1] Omap is a key-value store, associated with an object, in a way similar
130	to how Extended Attributes associate with a POSIX file. An object's omap
131	is not physically located in the object's storage, but its precise
132	implementation is invisible and immaterial to RADOS Gateway.
133	In Hammer, one LevelDB is used to store omap in each OSD.
134
135	[2] Before the Dumpling release, the 'bucket.instance' metadata did not
136	exist and the 'bucket' metadata contained its information. It is possible
137	to encounter such buckets in old installations.
138
139	[3] In Infernalis, a pending commit exists that removes the need of prefixing
140	all the rgw system pools with a period, and also renames all of these pools.
141	See Github pull request #4944 "rgw noperiod".
142
143	Appendix: Compendum
144	-------------------
145
146	Known pools:
147
148	.rgw.root
149	Unspecified region, zone, and global information records, one per object.
150
151	.rgw.control
152	notify.<N>
153
154	.rgw
155	<bucket>
156	.bucket.meta.<bucket>:<marker> # see put_bucket_instance_info()
157
158	The tenant is used to disambiguate buckets, but not bucket instances.
159	Example:
160
161	.bucket.meta.prodtx:test%25star:default.84099.6
162	.bucket.meta.testcont:default.4126.1
163	.bucket.meta.prodtx:testcont:default.84099.4
164	prodtx/testcont
165	prodtx/test%25star
166	testcont
167
168	.rgw.gc
169	gc.<N>
170
171	.users.uid
172	Contains _both_ per-user information (RGWUserInfo) in "<user>" objects
173	and per-user lists of buckets in omaps of "<user>.buckets" objects.
174	The "<user>" may contain the tenant if non-empty, for example:
175
176	prodtx$prodt
177	test2.buckets
178	prodtx$prodt.buckets
179	test2
180
181	.users.email
182	Unimportant
183
184	.users
185	47UA98JSTJZ9YAN3OS3O
186	It's unclear why user ID is not used to name objects in this pool.
187
188	.users.swift
189	test:tester
190
191	.rgw.buckets.index
192	Objects are named ".dir.<marker>", each contains a bucket index.
193	If the index is sharded, each shard appends the shard index after
194	the marker.
195
196	.rgw.buckets
197	default.7593.4__shadow_.488urDFerTYXavx4yAd-Op8mxehnvTI_1
198	<marker>_<key>
199
200	An example of a marker would be "default.16004.1" or "default.7593.4".
201	The current format is "<zone>.<instance_id>.<bucket_id>". But once
202	generated, a marker is not parsed again, so its format may change
203	freely in the future.