]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============== |
2 | Architecture | |
3 | ============== | |
4 | ||
5 | :term:`Ceph` uniquely delivers **object, block, and file storage** in one | |
6 | unified system. Ceph is highly reliable, easy to manage, and free. The power of | |
7 | Ceph can transform your company's IT infrastructure and your ability to manage | |
8 | vast amounts of data. Ceph delivers extraordinary scalability–thousands of | |
9 | clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages | |
10 | commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster` | |
11 | accommodates large numbers of nodes, which communicate with each other to | |
12 | replicate and redistribute data dynamically. | |
13 | ||
14 | .. image:: images/stack.png | |
15 | ||
39ae355f | 16 | .. _arch-ceph-storage-cluster: |
7c673cae FG |
17 | |
18 | The Ceph Storage Cluster | |
19 | ======================== | |
20 | ||
21 | Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon | |
22 | :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read | |
23 | about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale | |
24 | Storage Clusters`_. | |
25 | ||
f67539c2 | 26 | A Ceph Storage Cluster consists of multiple types of daemons: |
7c673cae FG |
27 | |
28 | - :term:`Ceph Monitor` | |
29 | - :term:`Ceph OSD Daemon` | |
f67539c2 TL |
30 | - :term:`Ceph Manager` |
31 | - :term:`Ceph Metadata Server` | |
7c673cae | 32 | |
aee94f69 | 33 | .. _arch_monitor: |
7c673cae | 34 | |
aee94f69 TL |
35 | Ceph Monitors maintain the master copy of the cluster map, which they provide |
36 | to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures | |
37 | availability in the event that one of the monitor daemons or its host fails. | |
38 | The Ceph monitor provides copies of the cluster map to storage cluster clients. | |
7c673cae FG |
39 | |
40 | A Ceph OSD Daemon checks its own state and the state of other OSDs and reports | |
41 | back to monitors. | |
42 | ||
aee94f69 | 43 | A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in |
f67539c2 TL |
44 | modules. |
45 | ||
46 | A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to | |
47 | provide file services. | |
48 | ||
aee94f69 TL |
49 | Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm |
50 | to compute information about data location. This means that clients and OSDs | |
51 | are not bottlenecked by a central lookup table. Ceph's high-level features | |
52 | include a native interface to the Ceph Storage Cluster via ``librados``, and a | |
53 | number of service interfaces built on top of ``librados``. | |
7c673cae FG |
54 | |
55 | Storing Data | |
56 | ------------ | |
57 | ||
39ae355f | 58 | The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it |
7c673cae | 59 | comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the |
aee94f69 TL |
60 | :term:`Ceph File System`, or a custom implementation that you create by using |
61 | ``librados``. The data received by the Ceph Storage Cluster is stored as RADOS | |
62 | objects. Each object is stored on an :term:`Object Storage Device` (this is | |
63 | also called an "OSD"). Ceph OSDs control read, write, and replication | |
64 | operations on storage drives. The default BlueStore back end stores objects | |
65 | in a monolithic, database-like fashion. | |
7c673cae | 66 | |
f91f0fd5 TL |
67 | .. ditaa:: |
68 | ||
aee94f69 TL |
69 | /------\ +-----+ +-----+ |
70 | | obj |------>| {d} |------>| {s} | | |
71 | \------/ +-----+ +-----+ | |
7c673cae | 72 | |
f67539c2 | 73 | Object OSD Drive |
7c673cae | 74 | |
aee94f69 TL |
75 | Ceph OSD Daemons store data as objects in a flat namespace. This means that |
76 | objects are not stored in a hierarchy of directories. An object has an | |
77 | identifier, binary data, and metadata consisting of name/value pairs. | |
78 | :term:`Ceph Client`\s determine the semantics of the object data. For example, | |
79 | CephFS uses metadata to store file attributes such as the file owner, the | |
80 | created date, and the last modified date. | |
7c673cae FG |
81 | |
82 | ||
f91f0fd5 TL |
83 | .. ditaa:: |
84 | ||
85 | /------+------------------------------+----------------\ | |
7c673cae FG |
86 | | ID | Binary Data | Metadata | |
87 | +------+------------------------------+----------------+ | |
88 | | 1234 | 0101010101010100110101010010 | name1 = value1 | | |
89 | | | 0101100001010100110101010010 | name2 = value2 | | |
90 | | | 0101100001010100110101010010 | nameN = valueN | | |
91 | \------+------------------------------+----------------/ | |
92 | ||
93 | .. note:: An object ID is unique across the entire cluster, not just the local | |
94 | filesystem. | |
95 | ||
96 | ||
97 | .. index:: architecture; high availability, scalability | |
98 | ||
aee94f69 TL |
99 | .. _arch_scalability_and_high_availability: |
100 | ||
7c673cae FG |
101 | Scalability and High Availability |
102 | --------------------------------- | |
103 | ||
aee94f69 TL |
104 | In traditional architectures, clients talk to a centralized component. This |
105 | centralized component might be a gateway, a broker, an API, or a facade. A | |
106 | centralized component of this kind acts as a single point of entry to a complex | |
107 | subsystem. Architectures that rely upon such a centralized component have a | |
108 | single point of failure and incur limits to performance and scalability. If | |
109 | the centralized component goes down, the whole system becomes unavailable. | |
7c673cae | 110 | |
aee94f69 TL |
111 | Ceph eliminates this centralized component. This enables clients to interact |
112 | with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes | |
113 | to ensure data safety and high availability. Ceph also uses a cluster of | |
114 | monitors to ensure high availability. To eliminate centralization, Ceph uses an | |
115 | algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`. | |
7c673cae FG |
116 | |
117 | ||
118 | .. index:: CRUSH; architecture | |
119 | ||
120 | CRUSH Introduction | |
121 | ~~~~~~~~~~~~~~~~~~ | |
122 | ||
123 | Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled | |
aee94f69 TL |
124 | Replication Under Scalable Hashing)` algorithm to compute information about |
125 | object location instead of relying upon a central lookup table. CRUSH provides | |
126 | a better data management mechanism than do older approaches, and CRUSH enables | |
127 | massive scale by distributing the work to all the OSD daemons in the cluster | |
128 | and all the clients that communicate with them. CRUSH uses intelligent data | |
129 | replication to ensure resiliency, which is better suited to hyper-scale | |
130 | storage. The following sections provide additional details on how CRUSH works. | |
131 | For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, | |
132 | Decentralized Placement of Replicated Data`_. | |
7c673cae FG |
133 | |
134 | .. index:: architecture; cluster map | |
135 | ||
39ae355f TL |
136 | .. _architecture_cluster_map: |
137 | ||
7c673cae FG |
138 | Cluster Map |
139 | ~~~~~~~~~~~ | |
140 | ||
aee94f69 TL |
141 | In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs |
142 | must have current information about the cluster's topology. Current information | |
143 | is stored in the "Cluster Map", which is in fact a collection of five maps. The | |
144 | five maps that constitute the cluster map are: | |
7c673cae | 145 | |
aee94f69 TL |
146 | #. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name, |
147 | the address, and the TCP port of each monitor. The monitor map specifies the | |
148 | current epoch, the time of the monitor map's creation, and the time of the | |
149 | monitor map's last modification. To view a monitor map, run ``ceph mon | |
150 | dump``. | |
7c673cae | 151 | |
aee94f69 TL |
152 | #. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's |
153 | creation, the time of the OSD map's last modification, a list of pools, a | |
154 | list of replica sizes, a list of PG numbers, and a list of OSDs and their | |
155 | statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph | |
156 | osd dump``. | |
7c673cae | 157 | |
aee94f69 TL |
158 | #. **The PG Map:** Contains the PG version, its time stamp, the last OSD map |
159 | epoch, the full ratios, and the details of each placement group. This | |
160 | includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for | |
161 | example, ``active + clean``), and data usage statistics for each pool. | |
7c673cae FG |
162 | |
163 | #. **The CRUSH Map:** Contains a list of storage devices, the failure domain | |
aee94f69 TL |
164 | hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``), |
165 | and rules for traversing the hierarchy when storing data. To view a CRUSH | |
166 | map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by | |
167 | running ``crushtool -d {comp-crushmap-filename} -o | |
168 | {decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the | |
169 | decompiled map. | |
7c673cae FG |
170 | |
171 | #. **The MDS Map:** Contains the current MDS map epoch, when the map was | |
172 | created, and the last time it changed. It also contains the pool for | |
173 | storing metadata, a list of metadata servers, and which metadata servers | |
174 | are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``. | |
175 | ||
aee94f69 TL |
176 | Each map maintains a history of changes to its operating state. Ceph Monitors |
177 | maintain a master copy of the cluster map. This master copy includes the | |
178 | cluster members, the state of the cluster, changes to the cluster, and | |
179 | information recording the overall health of the Ceph Storage Cluster. | |
7c673cae FG |
180 | |
181 | .. index:: high availability; monitor architecture | |
182 | ||
183 | High Availability Monitors | |
184 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
185 | ||
aee94f69 TL |
186 | A Ceph Client must contact a Ceph Monitor and obtain a current copy of the |
187 | cluster map in order to read data from or to write data to the Ceph cluster. | |
188 | ||
189 | It is possible for a Ceph cluster to function properly with only a single | |
190 | monitor, but a Ceph cluster that has only a single monitor has a single point | |
191 | of failure: if the monitor goes down, Ceph clients will be unable to read data | |
192 | from or write data to the cluster. | |
7c673cae | 193 | |
aee94f69 TL |
194 | Ceph leverages a cluster of monitors in order to increase reliability and fault |
195 | tolerance. When a cluster of monitors is used, however, one or more of the | |
196 | monitors in the cluster can fall behind due to latency or other faults. Ceph | |
197 | mitigates these negative effects by requiring multiple monitor instances to | |
198 | agree about the state of the cluster. To establish consensus among the monitors | |
199 | regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a | |
200 | majority of monitors (for example, one in a cluster that contains only one | |
201 | monitor, two in a cluster that contains three monitors, three in a cluster that | |
202 | contains five monitors, four in a cluster that contains six monitors, and so | |
203 | on). | |
7c673cae | 204 | |
aee94f69 | 205 | See the `Monitor Config Reference`_ for more detail on configuring monitors. |
7c673cae FG |
206 | |
207 | .. index:: architecture; high availability authentication | |
208 | ||
1e59de90 TL |
209 | .. _arch_high_availability_authentication: |
210 | ||
7c673cae FG |
211 | High Availability Authentication |
212 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
213 | ||
aee94f69 TL |
214 | The ``cephx`` authentication system is used by Ceph to authenticate users and |
215 | daemons and to protect against man-in-the-middle attacks. | |
7c673cae FG |
216 | |
217 | .. note:: The ``cephx`` protocol does not address data encryption in transport | |
aee94f69 TL |
218 | (for example, SSL/TLS) or encryption at rest. |
219 | ||
220 | ``cephx`` uses shared secret keys for authentication. This means that both the | |
221 | client and the monitor cluster keep a copy of the client's secret key. | |
222 | ||
223 | The ``cephx`` protocol makes it possible for each party to prove to the other | |
224 | that it has a copy of the key without revealing it. This provides mutual | |
225 | authentication and allows the cluster to confirm (1) that the user has the | |
226 | secret key and (2) that the user can be confident that the cluster has a copy | |
227 | of the secret key. | |
228 | ||
229 | As stated in :ref:`Scalability and High Availability | |
230 | <arch_scalability_and_high_availability>`, Ceph does not have any centralized | |
231 | interface between clients and the Ceph object store. By avoiding such a | |
232 | centralized interface, Ceph avoids the bottlenecks that attend such centralized | |
233 | interfaces. However, this means that clients must interact directly with OSDs. | |
234 | Direct interactions between Ceph clients and OSDs require authenticated | |
235 | connections. The ``cephx`` authentication system establishes and sustains these | |
236 | authenticated connections. | |
237 | ||
238 | The ``cephx`` protocol operates in a manner similar to `Kerberos`_. | |
239 | ||
240 | A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each | |
241 | monitor can authenticate users and distribute keys, which means that there is | |
242 | no single point of failure and no bottleneck when using ``cephx``. The monitor | |
243 | returns an authentication data structure that is similar to a Kerberos ticket. | |
244 | This authentication data structure contains a session key for use in obtaining | |
245 | Ceph services. The session key is itself encrypted with the user's permanent | |
246 | secret key, which means that only the user can request services from the Ceph | |
247 | Monitors. The client then uses the session key to request services from the | |
248 | monitors, and the monitors provide the client with a ticket that authenticates | |
249 | the client against the OSDs that actually handle data. Ceph Monitors and OSDs | |
250 | share a secret, which means that the clients can use the ticket provided by the | |
251 | monitors to authenticate against any OSD or metadata server in the cluster. | |
252 | ||
253 | Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an | |
254 | expired ticket or session key that has been obtained surreptitiously. This form | |
255 | of authentication prevents attackers who have access to the communications | |
256 | medium from creating bogus messages under another user's identity and prevents | |
257 | attackers from altering another user's legitimate messages, as long as the | |
258 | user's secret key is not divulged before it expires. | |
259 | ||
260 | An administrator must set up users before using ``cephx``. In the following | |
261 | diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from | |
7c673cae | 262 | the command line to generate a username and secret key. Ceph's ``auth`` |
aee94f69 TL |
263 | subsystem generates the username and key, stores a copy on the monitor(s), and |
264 | transmits the user's secret back to the ``client.admin`` user. This means that | |
7c673cae FG |
265 | the client and the monitor share a secret key. |
266 | ||
267 | .. note:: The ``client.admin`` user must provide the user ID and | |
268 | secret key to the user in a secure manner. | |
269 | ||
f91f0fd5 TL |
270 | .. ditaa:: |
271 | ||
272 | +---------+ +---------+ | |
7c673cae FG |
273 | | Client | | Monitor | |
274 | +---------+ +---------+ | |
275 | | request to | | |
276 | | create a user | | |
277 | |-------------->|----------+ create user | |
278 | | | | and | |
279 | |<--------------|<---------+ store key | |
280 | | transmit key | | |
281 | | | | |
282 | ||
aee94f69 TL |
283 | Here is how a client authenticates with a monitor. The client passes the user |
284 | name to the monitor. The monitor generates a session key that is encrypted with | |
285 | the secret key associated with the ``username``. The monitor transmits the | |
286 | encrypted ticket to the client. The client uses the shared secret key to | |
287 | decrypt the payload. The session key identifies the user, and this act of | |
288 | identification will last for the duration of the session. The client requests | |
289 | a ticket for the user, and the ticket is signed with the session key. The | |
290 | monitor generates a ticket and uses the user's secret key to encrypt it. The | |
291 | encrypted ticket is transmitted to the client. The client decrypts the ticket | |
292 | and uses it to sign requests to OSDs and to metadata servers in the cluster. | |
7c673cae | 293 | |
f91f0fd5 TL |
294 | .. ditaa:: |
295 | ||
296 | +---------+ +---------+ | |
7c673cae FG |
297 | | Client | | Monitor | |
298 | +---------+ +---------+ | |
299 | | authenticate | | |
300 | |-------------->|----------+ generate and | |
301 | | | | encrypt | |
302 | |<--------------|<---------+ session key | |
303 | | transmit | | |
304 | | encrypted | | |
305 | | session key | | |
306 | | | | |
307 | |-----+ decrypt | | |
308 | | | session | | |
309 | |<----+ key | | |
310 | | | | |
311 | | req. ticket | | |
312 | |-------------->|----------+ generate and | |
313 | | | | encrypt | |
314 | |<--------------|<---------+ ticket | |
315 | | recv. ticket | | |
316 | | | | |
317 | |-----+ decrypt | | |
318 | | | ticket | | |
319 | |<----+ | | |
320 | ||
321 | ||
aee94f69 TL |
322 | The ``cephx`` protocol authenticates ongoing communications between the clients |
323 | and Ceph daemons. After initial authentication, each message sent between a | |
324 | client and a daemon is signed using a ticket that can be verified by monitors, | |
325 | OSDs, and metadata daemons. This ticket is verified by using the secret shared | |
326 | between the client and the daemon. | |
7c673cae | 327 | |
f91f0fd5 TL |
328 | .. ditaa:: |
329 | ||
330 | +---------+ +---------+ +-------+ +-------+ | |
7c673cae FG |
331 | | Client | | Monitor | | MDS | | OSD | |
332 | +---------+ +---------+ +-------+ +-------+ | |
333 | | request to | | | | |
334 | | create a user | | | | |
335 | |-------------->| mon and | | | |
336 | |<--------------| client share | | | |
337 | | receive | a secret. | | | |
338 | | shared secret | | | | |
339 | | |<------------>| | | |
340 | | |<-------------+------------>| | |
341 | | | mon, mds, | | | |
342 | | authenticate | and osd | | | |
343 | |-------------->| share | | | |
344 | |<--------------| a secret | | | |
345 | | session key | | | | |
346 | | | | | | |
347 | | req. ticket | | | | |
348 | |-------------->| | | | |
349 | |<--------------| | | | |
350 | | recv. ticket | | | | |
351 | | | | | | |
352 | | make request (CephFS only) | | | |
353 | |----------------------------->| | | |
354 | |<-----------------------------| | | |
355 | | receive response (CephFS only) | | |
356 | | | | |
357 | | make request | | |
358 | |------------------------------------------->| | |
359 | |<-------------------------------------------| | |
360 | receive response | |
361 | ||
aee94f69 TL |
362 | This authentication protects only the connections between Ceph clients and Ceph |
363 | daemons. The authentication is not extended beyond the Ceph client. If a user | |
364 | accesses the Ceph client from a remote host, cephx authentication will not be | |
7c673cae FG |
365 | applied to the connection between the user's host and the client host. |
366 | ||
aee94f69 | 367 | See `Cephx Config Guide`_ for more on configuration details. |
7c673cae | 368 | |
aee94f69 | 369 | See `User Management`_ for more on user management. |
7c673cae | 370 | |
aee94f69 TL |
371 | See :ref:`A Detailed Description of the Cephx Authentication Protocol |
372 | <cephx_2012_peter>` for more on the distinction between authorization and | |
373 | authentication and for a step-by-step explanation of the setup of ``cephx`` | |
374 | tickets and session keys. | |
7c673cae FG |
375 | |
376 | .. index:: architecture; smart daemons and scalability | |
377 | ||
378 | Smart Daemons Enable Hyperscale | |
379 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
aee94f69 TL |
380 | A feature of many storage clusters is a centralized interface that keeps track |
381 | of the nodes that clients are permitted to access. Such centralized | |
382 | architectures provide services to clients by means of a double dispatch. At the | |
383 | petabyte-to-exabyte scale, such double dispatches are a significant | |
384 | bottleneck. | |
385 | ||
386 | Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are | |
387 | cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph | |
388 | OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly | |
389 | with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being | |
390 | cluster-aware makes it possible for Ceph clients to interact directly with Ceph | |
391 | OSD Daemons. | |
392 | ||
393 | Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one | |
394 | another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM | |
395 | resources of the nodes in the Ceph cluster. This means that a Ceph cluster can | |
396 | easily perform tasks that a cluster with a centralized interface would struggle | |
397 | to perform. The ability of Ceph nodes to make use of the computing power of | |
398 | the greater cluster provides several benefits: | |
399 | ||
400 | #. **OSDs Service Clients Directly:** Network devices can support only a | |
401 | limited number of concurrent connections. Because Ceph clients contact | |
402 | Ceph OSD daemons directly without first connecting to a central interface, | |
403 | Ceph enjoys improved perfomance and increased system capacity relative to | |
404 | storage redundancy strategies that include a central interface. Ceph clients | |
405 | maintain sessions only when needed, and maintain those sessions with only | |
406 | particular Ceph OSD daemons, not with a centralized interface. | |
407 | ||
408 | #. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they | |
409 | report their status. At the lowest level, the Ceph OSD Daemon status is | |
410 | ``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and | |
411 | able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and | |
412 | ``in`` the Ceph Storage Cluster, this status may indicate the failure of the | |
413 | Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed, | |
414 | the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The | |
415 | OSDs periodically send messages to the Ceph Monitor (in releases prior to | |
416 | Luminous, this was done by means of ``MPGStats``, and beginning with the | |
417 | Luminous release, this has been done with ``MOSDBeacon``). If the Ceph | |
418 | Monitors receive no such message after a configurable period of time, | |
419 | then they mark the OSD ``down``. This mechanism is a failsafe, however. | |
420 | Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and | |
421 | report it to the Ceph Monitors. This contributes to making Ceph Monitors | |
422 | lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for | |
423 | additional details. | |
424 | ||
425 | #. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub | |
426 | RADOS objects. Ceph OSD Daemons compare the metadata of their own local | |
427 | objects against the metadata of the replicas of those objects, which are | |
428 | stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds | |
429 | mismatches in object size and finds metadata mismatches, and is usually | |
430 | performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the | |
431 | data in objects, bit-for-bit, against their checksums. Deep scrubbing finds | |
432 | bad sectors on drives that are not detectable with light scrubs. See `Data | |
433 | Scrubbing`_ for details on configuring scrubbing. | |
434 | ||
435 | #. **Replication:** Data replication involves a collaboration between Ceph | |
436 | Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to | |
437 | determine the storage location of object replicas. Ceph clients use the | |
438 | CRUSH algorithm to determine the storage location of an object, then the | |
439 | object is mapped to a pool and to a placement group, and then the client | |
440 | consults the CRUSH map to identify the placement group's primary OSD. | |
441 | ||
442 | After identifying the target placement group, the client writes the object | |
443 | to the identified placement group's primary OSD. The primary OSD then | |
444 | consults its own copy of the CRUSH map to identify secondary and tertiary | |
445 | OSDS, replicates the object to the placement groups in those secondary and | |
446 | tertiary OSDs, confirms that the object was stored successfully in the | |
447 | secondary and tertiary OSDs, and reports to the client that the object | |
448 | was stored successfully. | |
7c673cae | 449 | |
f91f0fd5 TL |
450 | .. ditaa:: |
451 | ||
7c673cae FG |
452 | +----------+ |
453 | | Client | | |
454 | | | | |
455 | +----------+ | |
456 | * ^ | |
457 | Write (1) | | Ack (6) | |
458 | | | | |
459 | v * | |
460 | +-------------+ | |
461 | | Primary OSD | | |
462 | | | | |
463 | +-------------+ | |
464 | * ^ ^ * | |
465 | Write (2) | | | | Write (3) | |
466 | +------+ | | +------+ | |
467 | | +------+ +------+ | | |
468 | | | Ack (4) Ack (5)| | | |
469 | v * * v | |
470 | +---------------+ +---------------+ | |
471 | | Secondary OSD | | Tertiary OSD | | |
472 | | | | | | |
473 | +---------------+ +---------------+ | |
474 | ||
aee94f69 TL |
475 | By performing this act of data replication, Ceph OSD Daemons relieve Ceph |
476 | clients of the burden of replicating data. | |
7c673cae FG |
477 | |
478 | Dynamic Cluster Management | |
479 | -------------------------- | |
480 | ||
481 | In the `Scalability and High Availability`_ section, we explained how Ceph uses | |
aee94f69 | 482 | CRUSH, cluster topology, and intelligent daemons to scale and maintain high |
7c673cae FG |
483 | availability. Key to Ceph's design is the autonomous, self-healing, and |
484 | intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to | |
aee94f69 TL |
485 | enable modern cloud storage infrastructures to place data, rebalance the |
486 | cluster, and adaptively place and balance data and recover from faults. | |
7c673cae FG |
487 | |
488 | .. index:: architecture; pools | |
489 | ||
490 | About Pools | |
491 | ~~~~~~~~~~~ | |
492 | ||
493 | The Ceph storage system supports the notion of 'Pools', which are logical | |
494 | partitions for storing objects. | |
aee94f69 TL |
495 | |
496 | Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS | |
497 | objects to pools. The way that Ceph places the data in the pools is determined | |
498 | by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of | |
499 | placement groups in the pool. | |
7c673cae | 500 | |
f91f0fd5 TL |
501 | .. ditaa:: |
502 | ||
7c673cae FG |
503 | +--------+ Retrieves +---------------+ |
504 | | Client |------------>| Cluster Map | | |
505 | +--------+ +---------------+ | |
506 | | | |
507 | v Writes | |
508 | /-----\ | |
509 | | obj | | |
510 | \-----/ | |
511 | | To | |
512 | v | |
513 | +--------+ +---------------+ | |
b32b8144 | 514 | | Pool |---------->| CRUSH Rule | |
7c673cae FG |
515 | +--------+ Selects +---------------+ |
516 | ||
517 | ||
518 | Pools set at least the following parameters: | |
519 | ||
520 | - Ownership/Access to Objects | |
521 | - The Number of Placement Groups, and | |
b32b8144 | 522 | - The CRUSH Rule to Use. |
7c673cae FG |
523 | |
524 | See `Set Pool Values`_ for details. | |
525 | ||
526 | ||
527 | .. index: architecture; placement group mapping | |
528 | ||
529 | Mapping PGs to OSDs | |
530 | ~~~~~~~~~~~~~~~~~~~ | |
531 | ||
aee94f69 TL |
532 | Each pool has a number of placement groups (PGs) within it. CRUSH dynamically |
533 | maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS | |
534 | object to a PG. | |
535 | ||
536 | This mapping of RADOS objects to PGs implements an abstraction and indirection | |
537 | layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must | |
538 | be able to grow (or shrink) and redistribute data adaptively when the internal | |
539 | topology changes. | |
540 | ||
541 | If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a | |
542 | tight coupling would exist between the Ceph Client and the Ceph OSD Daemon. | |
543 | But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each | |
544 | RADOS object to a placement group and then maps each placement group to one or | |
545 | more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance | |
546 | dynamically when new Ceph OSD Daemons and their underlying OSD devices come | |
547 | online. The following diagram shows how the CRUSH algorithm maps objects to | |
548 | placement groups, and how it maps placement groups to OSDs. | |
7c673cae | 549 | |
f91f0fd5 TL |
550 | .. ditaa:: |
551 | ||
7c673cae FG |
552 | /-----\ /-----\ /-----\ /-----\ /-----\ |
553 | | obj | | obj | | obj | | obj | | obj | | |
554 | \-----/ \-----/ \-----/ \-----/ \-----/ | |
555 | | | | | | | |
556 | +--------+--------+ +---+----+ | |
557 | | | | |
558 | v v | |
559 | +-----------------------+ +-----------------------+ | |
560 | | Placement Group #1 | | Placement Group #2 | | |
561 | | | | | | |
562 | +-----------------------+ +-----------------------+ | |
563 | | | | |
564 | | +-----------------------+---+ | |
565 | +------+------+-------------+ | | |
566 | | | | | | |
567 | v v v v | |
568 | /----------\ /----------\ /----------\ /----------\ | |
569 | | | | | | | | | | |
570 | | OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 | | |
571 | | | | | | | | | | |
572 | \----------/ \----------/ \----------/ \----------/ | |
573 | ||
aee94f69 TL |
574 | The client uses its copy of the cluster map and the CRUSH algorithm to compute |
575 | precisely which OSD it will use when reading or writing a particular object. | |
7c673cae FG |
576 | |
577 | .. index:: architecture; calculating PG IDs | |
578 | ||
579 | Calculating PG IDs | |
580 | ~~~~~~~~~~~~~~~~~~ | |
581 | ||
aee94f69 TL |
582 | When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of |
583 | the `Cluster Map`_. When a client has been equipped with a copy of the cluster | |
584 | map, it is aware of all the monitors, OSDs, and metadata servers in the | |
585 | cluster. **However, even equipped with a copy of the latest version of the | |
586 | cluster map, the client doesn't know anything about object locations.** | |
587 | ||
588 | **Object locations must be computed.** | |
589 | ||
590 | The client requies only the object ID and the name of the pool in order to | |
591 | compute the object location. | |
592 | ||
593 | Ceph stores data in named pools (for example, "liverpool"). When a client | |
594 | stores a named object (for example, "john", "paul", "george", or "ringo") it | |
595 | calculates a placement group by using the object name, a hash code, the number | |
596 | of PGs in the pool, and the pool name. Ceph clients use the following steps to | |
597 | compute PG IDs. | |
598 | ||
599 | #. The client inputs the pool name and the object ID. (for example: pool = | |
600 | "liverpool" and object-id = "john") | |
601 | #. Ceph hashes the object ID. | |
602 | #. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to | |
603 | get a PG ID. | |
604 | #. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" = | |
605 | ``4``) | |
606 | #. Ceph prepends the pool ID to the PG ID (for example: ``4.58``). | |
607 | ||
608 | It is much faster to compute object locations than to perform object location | |
609 | query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under | |
610 | Scalable Hashing)` algorithm allows a client to compute where objects are | |
611 | expected to be stored, and enables the client to contact the primary OSD to | |
612 | store or retrieve the objects. | |
7c673cae FG |
613 | |
614 | .. index:: architecture; PG Peering | |
615 | ||
616 | Peering and Sets | |
617 | ~~~~~~~~~~~~~~~~ | |
618 | ||
39ae355f | 619 | In previous sections, we noted that Ceph OSD Daemons check each other's |
aee94f69 TL |
620 | heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer', |
621 | which is the process of bringing all of the OSDs that store a Placement Group | |
622 | (PG) into agreement about the state of all of the RADOS objects (and their | |
623 | metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph | |
624 | Monitors. Peering issues usually resolve themselves; however, if the problem | |
625 | persists, you may need to refer to the `Troubleshooting Peering Failure`_ | |
626 | section. | |
7c673cae | 627 | |
aee94f69 TL |
628 | .. Note:: PGs that agree on the state of the cluster do not necessarily have |
629 | the current data yet. | |
7c673cae FG |
630 | |
631 | The Ceph Storage Cluster was designed to store at least two copies of an object | |
aee94f69 TL |
632 | (that is, ``size = 2``), which is the minimum requirement for data safety. For |
633 | high availability, a Ceph Storage Cluster should store more than two copies of | |
634 | an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue | |
635 | to run in a ``degraded`` state while maintaining data safety. | |
636 | ||
637 | .. warning:: Although we say here that R2 (replication with two copies) is the | |
638 | minimum requirement for data safety, R3 (replication with three copies) is | |
639 | recommended. On a long enough timeline, data stored with an R2 strategy will | |
640 | be lost. | |
641 | ||
642 | As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not | |
643 | name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``, | |
644 | etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By | |
645 | convention, the *Primary* is the first OSD in the *Acting Set*, and is | |
646 | responsible for orchestrating the peering process for each placement group | |
647 | where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given | |
648 | placement group that accepts client-initiated writes to objects. | |
649 | ||
650 | The set of OSDs that is responsible for a placement group is called the | |
651 | *Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons | |
652 | that are currently responsible for the placement group, or to the Ceph OSD | |
653 | Daemons that were responsible for a particular placement group as of some | |
7c673cae FG |
654 | epoch. |
655 | ||
aee94f69 TL |
656 | The Ceph OSD daemons that are part of an *Acting Set* might not always be |
657 | ``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. | |
658 | The *Up Set* is an important distinction, because Ceph can remap PGs to other | |
659 | Ceph OSD Daemons when an OSD fails. | |
7c673cae | 660 | |
aee94f69 TL |
661 | .. note:: Consider a hypothetical *Acting Set* for a PG that contains |
662 | ``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the | |
663 | *Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the | |
664 | *Primary*, and ``osd.25`` is removed from the *Up Set*. | |
7c673cae FG |
665 | |
666 | .. index:: architecture; Rebalancing | |
667 | ||
668 | Rebalancing | |
669 | ~~~~~~~~~~~ | |
670 | ||
671 | When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets | |
672 | updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes | |
673 | the cluster map. Consequently, it changes object placement, because it changes | |
674 | an input for the calculations. The following diagram depicts the rebalancing | |
675 | process (albeit rather crudely, since it is substantially less impactful with | |
676 | large clusters) where some, but not all of the PGs migrate from existing OSDs | |
677 | (OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is | |
678 | stable. Many of the placement groups remain in their original configuration, | |
679 | and each OSD gets some added capacity, so there are no load spikes on the | |
680 | new OSD after rebalancing is complete. | |
681 | ||
682 | ||
f91f0fd5 TL |
683 | .. ditaa:: |
684 | ||
7c673cae FG |
685 | +--------+ +--------+ |
686 | Before | OSD 1 | | OSD 2 | | |
687 | +--------+ +--------+ | |
688 | | PG #1 | | PG #6 | | |
689 | | PG #2 | | PG #7 | | |
690 | | PG #3 | | PG #8 | | |
691 | | PG #4 | | PG #9 | | |
692 | | PG #5 | | PG #10 | | |
693 | +--------+ +--------+ | |
694 | ||
695 | +--------+ +--------+ +--------+ | |
696 | After | OSD 1 | | OSD 2 | | OSD 3 | | |
697 | +--------+ +--------+ +--------+ | |
698 | | PG #1 | | PG #7 | | PG #3 | | |
699 | | PG #2 | | PG #8 | | PG #6 | | |
700 | | PG #4 | | PG #10 | | PG #9 | | |
701 | | PG #5 | | | | | | |
702 | | | | | | | | |
703 | +--------+ +--------+ +--------+ | |
704 | ||
705 | ||
706 | .. index:: architecture; Data Scrubbing | |
707 | ||
708 | Data Consistency | |
709 | ~~~~~~~~~~~~~~~~ | |
710 | ||
f67539c2 TL |
711 | As part of maintaining data consistency and cleanliness, Ceph OSDs also scrub |
712 | objects within placement groups. That is, Ceph OSDs compare object metadata in | |
713 | one placement group with its replicas in placement groups stored in other | |
714 | OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem | |
715 | errors, often as a result of hardware issues. OSDs also perform deeper | |
716 | scrubbing by comparing data in objects bit-for-bit. Deep scrubbing (by default | |
717 | performed weekly) finds bad blocks on a drive that weren't apparent in a light | |
718 | scrub. | |
7c673cae FG |
719 | |
720 | See `Data Scrubbing`_ for details on configuring scrubbing. | |
721 | ||
722 | ||
723 | ||
724 | ||
725 | ||
726 | .. index:: erasure coding | |
727 | ||
728 | Erasure Coding | |
729 | -------------- | |
730 | ||
731 | An erasure coded pool stores each object as ``K+M`` chunks. It is divided into | |
732 | ``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size | |
733 | of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of | |
734 | the chunk is stored as an attribute of the object. | |
735 | ||
f67539c2 | 736 | For instance an erasure coded pool can be created to use five OSDs (``K+M = 5``) and |
7c673cae FG |
737 | sustain the loss of two of them (``M = 2``). |
738 | ||
739 | Reading and Writing Encoded Chunks | |
740 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
741 | ||
742 | When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure | |
743 | encoding function splits the content into three data chunks simply by dividing | |
744 | the content in three: the first contains ``ABC``, the second ``DEF`` and the | |
745 | last ``GHI``. The content will be padded if the content length is not a multiple | |
746 | of ``K``. The function also creates two coding chunks: the fourth with ``YXY`` | |
11fdf7f2 | 747 | and the fifth with ``QGC``. Each chunk is stored in an OSD in the acting set. |
7c673cae FG |
748 | The chunks are stored in objects that have the same name (**NYAN**) but reside |
749 | on different OSDs. The order in which the chunks were created must be preserved | |
750 | and is stored as an attribute of the object (``shard_t``), in addition to its | |
751 | name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains | |
752 | ``YXY`` and is stored on **OSD3**. | |
753 | ||
754 | ||
755 | .. ditaa:: | |
f91f0fd5 | 756 | |
7c673cae FG |
757 | +-------------------+ |
758 | name | NYAN | | |
759 | +-------------------+ | |
760 | content | ABCDEFGHI | | |
761 | +--------+----------+ | |
762 | | | |
763 | | | |
764 | v | |
765 | +------+------+ | |
766 | +---------------+ encode(3,2) +-----------+ | |
767 | | +--+--+---+---+ | | |
768 | | | | | | | |
769 | | +-------+ | +-----+ | | |
770 | | | | | | | |
771 | +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ | |
772 | name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | | |
773 | +------+ +------+ +------+ +------+ +------+ | |
774 | shard | 1 | | 2 | | 3 | | 4 | | 5 | | |
775 | +------+ +------+ +------+ +------+ +------+ | |
776 | content | ABC | | DEF | | GHI | | YXY | | QGC | | |
777 | +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ | |
778 | | | | | | | |
779 | | | v | | | |
780 | | | +--+---+ | | | |
781 | | | | OSD1 | | | | |
782 | | | +------+ | | | |
783 | | | | | | |
784 | | | +------+ | | | |
785 | | +------>| OSD2 | | | | |
786 | | +------+ | | | |
787 | | | | | |
788 | | +------+ | | | |
789 | | | OSD3 |<----+ | | |
790 | | +------+ | | |
791 | | | | |
792 | | +------+ | | |
793 | | | OSD4 |<--------------+ | |
794 | | +------+ | |
795 | | | |
796 | | +------+ | |
797 | +----------------->| OSD5 | | |
798 | +------+ | |
799 | ||
800 | ||
801 | When the object **NYAN** is read from the erasure coded pool, the decoding | |
802 | function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing | |
803 | ``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content | |
804 | of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2 | |
805 | and 5 are missing (they are called 'erasures'). The chunk 5 could not be read | |
806 | because the **OSD4** is out. The decoding function can be called as soon as | |
807 | three chunks are read: **OSD2** was the slowest and its chunk was not taken into | |
808 | account. | |
809 | ||
810 | .. ditaa:: | |
f91f0fd5 | 811 | |
7c673cae FG |
812 | +-------------------+ |
813 | name | NYAN | | |
814 | +-------------------+ | |
815 | content | ABCDEFGHI | | |
816 | +---------+---------+ | |
817 | ^ | |
818 | | | |
819 | | | |
820 | +-------+-------+ | |
821 | | decode(3,2) | | |
822 | +------------->+ erasures 2,5 +<-+ | |
823 | | | | | | |
824 | | +-------+-------+ | | |
825 | | ^ | | |
826 | | | | | |
827 | | | | | |
828 | +--+---+ +------+ +---+--+ +---+--+ | |
829 | name | NYAN | | NYAN | | NYAN | | NYAN | | |
830 | +------+ +------+ +------+ +------+ | |
831 | shard | 1 | | 2 | | 3 | | 4 | | |
832 | +------+ +------+ +------+ +------+ | |
833 | content | ABC | | DEF | | GHI | | YXY | | |
834 | +--+---+ +--+---+ +--+---+ +--+---+ | |
835 | ^ . ^ ^ | |
836 | | TOO . | | | |
837 | | SLOW . +--+---+ | | |
838 | | ^ | OSD1 | | | |
839 | | | +------+ | | |
840 | | | | | |
841 | | | +------+ | | |
842 | | +-------| OSD2 | | | |
843 | | +------+ | | |
844 | | | | |
845 | | +------+ | | |
846 | | | OSD3 |------+ | |
847 | | +------+ | |
848 | | | |
849 | | +------+ | |
850 | | | OSD4 | OUT | |
851 | | +------+ | |
852 | | | |
853 | | +------+ | |
854 | +------------------| OSD5 | | |
855 | +------+ | |
856 | ||
857 | ||
858 | Interrupted Full Writes | |
859 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
860 | ||
861 | In an erasure coded pool, the primary OSD in the up set receives all write | |
862 | operations. It is responsible for encoding the payload into ``K+M`` chunks and | |
863 | sends them to the other OSDs. It is also responsible for maintaining an | |
864 | authoritative version of the placement group logs. | |
865 | ||
866 | In the following diagram, an erasure coded placement group has been created with | |
9f95a23c | 867 | ``K = 2, M = 1`` and is supported by three OSDs, two for ``K`` and one for |
7c673cae FG |
868 | ``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and |
869 | **OSD 3**. An object has been encoded and stored in the OSDs : the chunk | |
870 | ``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on | |
871 | **OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The | |
872 | placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1, | |
873 | version 1). | |
874 | ||
875 | ||
876 | .. ditaa:: | |
f91f0fd5 | 877 | |
7c673cae FG |
878 | Primary OSD |
879 | ||
880 | +-------------+ | |
881 | | OSD 1 | +-------------+ | |
882 | | log | Write Full | | | |
883 | | +----+ |<------------+ Ceph Client | | |
884 | | |D1v1| 1,1 | v1 | | | |
885 | | +----+ | +-------------+ | |
886 | +------+------+ | |
887 | | | |
888 | | | |
889 | | +-------------+ | |
890 | | | OSD 2 | | |
891 | | | log | | |
892 | +--------->+ +----+ | | |
893 | | | |D2v1| 1,1 | | |
894 | | | +----+ | | |
895 | | +-------------+ | |
896 | | | |
897 | | +-------------+ | |
898 | | | OSD 3 | | |
899 | | | log | | |
900 | +--------->| +----+ | | |
901 | | |C1v1| 1,1 | | |
902 | | +----+ | | |
903 | +-------------+ | |
904 | ||
905 | **OSD 1** is the primary and receives a **WRITE FULL** from a client, which | |
906 | means the payload is to replace the object entirely instead of overwriting a | |
907 | portion of it. Version 2 (v2) of the object is created to override version 1 | |
908 | (v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data | |
909 | chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and | |
910 | ``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent | |
911 | to the target OSD, including the primary OSD which is responsible for storing | |
912 | chunks in addition to handling write operations and maintaining an authoritative | |
913 | version of the placement group logs. When an OSD receives the message | |
914 | instructing it to write the chunk, it also creates a new entry in the placement | |
915 | group logs to reflect the change. For instance, as soon as **OSD 3** stores | |
916 | ``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs. | |
917 | Because the OSDs work asynchronously, some chunks may still be in flight ( such | |
f67539c2 TL |
918 | as ``D2v2`` ) while others are acknowledged and persisted to storage drives |
919 | (such as ``C1v1`` and ``D1v1``). | |
7c673cae FG |
920 | |
921 | .. ditaa:: | |
922 | ||
923 | Primary OSD | |
924 | ||
925 | +-------------+ | |
926 | | OSD 1 | | |
927 | | log | | |
928 | | +----+ | +-------------+ | |
929 | | |D1v2| 1,2 | Write Full | | | |
930 | | +----+ +<------------+ Ceph Client | | |
931 | | | v2 | | | |
932 | | +----+ | +-------------+ | |
933 | | |D1v1| 1,1 | | |
934 | | +----+ | | |
935 | +------+------+ | |
936 | | | |
937 | | | |
938 | | +------+------+ | |
939 | | | OSD 2 | | |
940 | | +------+ | log | | |
941 | +->| D2v2 | | +----+ | | |
942 | | +------+ | |D2v1| 1,1 | | |
943 | | | +----+ | | |
944 | | +-------------+ | |
945 | | | |
946 | | +-------------+ | |
947 | | | OSD 3 | | |
948 | | | log | | |
949 | | | +----+ | | |
950 | | | |C1v2| 1,2 | | |
951 | +---------->+ +----+ | | |
952 | | | | |
953 | | +----+ | | |
954 | | |C1v1| 1,1 | | |
955 | | +----+ | | |
956 | +-------------+ | |
957 | ||
958 | ||
959 | If all goes well, the chunks are acknowledged on each OSD in the acting set and | |
960 | the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``. | |
961 | ||
962 | .. ditaa:: | |
963 | ||
964 | Primary OSD | |
965 | ||
966 | +-------------+ | |
967 | | OSD 1 | | |
968 | | log | | |
969 | | +----+ | +-------------+ | |
970 | | |D1v2| 1,2 | Write Full | | | |
971 | | +----+ +<------------+ Ceph Client | | |
972 | | | v2 | | | |
973 | | +----+ | +-------------+ | |
974 | | |D1v1| 1,1 | | |
975 | | +----+ | | |
976 | +------+------+ | |
977 | | | |
978 | | +-------------+ | |
979 | | | OSD 2 | | |
980 | | | log | | |
981 | | | +----+ | | |
982 | | | |D2v2| 1,2 | | |
983 | +---------->+ +----+ | | |
984 | | | | | |
985 | | | +----+ | | |
986 | | | |D2v1| 1,1 | | |
987 | | | +----+ | | |
988 | | +-------------+ | |
989 | | | |
990 | | +-------------+ | |
991 | | | OSD 3 | | |
992 | | | log | | |
993 | | | +----+ | | |
994 | | | |C1v2| 1,2 | | |
995 | +---------->+ +----+ | | |
996 | | | | |
997 | | +----+ | | |
998 | | |C1v1| 1,1 | | |
999 | | +----+ | | |
1000 | +-------------+ | |
1001 | ||
1002 | ||
1003 | Finally, the files used to store the chunks of the previous version of the | |
1004 | object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1`` | |
1005 | on **OSD 3**. | |
1006 | ||
1007 | .. ditaa:: | |
f91f0fd5 | 1008 | |
7c673cae FG |
1009 | Primary OSD |
1010 | ||
1011 | +-------------+ | |
1012 | | OSD 1 | | |
1013 | | log | | |
1014 | | +----+ | | |
1015 | | |D1v2| 1,2 | | |
1016 | | +----+ | | |
1017 | +------+------+ | |
1018 | | | |
1019 | | | |
1020 | | +-------------+ | |
1021 | | | OSD 2 | | |
1022 | | | log | | |
1023 | +--------->+ +----+ | | |
1024 | | | |D2v2| 1,2 | | |
1025 | | | +----+ | | |
1026 | | +-------------+ | |
1027 | | | |
1028 | | +-------------+ | |
1029 | | | OSD 3 | | |
1030 | | | log | | |
1031 | +--------->| +----+ | | |
1032 | | |C1v2| 1,2 | | |
1033 | | +----+ | | |
1034 | +-------------+ | |
1035 | ||
1036 | ||
1037 | But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight, | |
1038 | the object's version 2 is partially written: **OSD 3** has one chunk but that is | |
1039 | not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the | |
1040 | erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are | |
1041 | available to rebuild the third. **OSD 4** becomes the new primary and finds that | |
1042 | the ``last_complete`` log entry (i.e., all objects before this entry were known | |
1043 | to be available on all OSDs in the previous acting set ) is ``1,1`` and that | |
1044 | will be the head of the new authoritative log. | |
1045 | ||
1046 | .. ditaa:: | |
f91f0fd5 | 1047 | |
7c673cae FG |
1048 | +-------------+ |
1049 | | OSD 1 | | |
1050 | | (down) | | |
1051 | | c333 | | |
1052 | +------+------+ | |
1053 | | | |
1054 | | +-------------+ | |
1055 | | | OSD 2 | | |
1056 | | | log | | |
1057 | | | +----+ | | |
1058 | +---------->+ |D2v1| 1,1 | | |
1059 | | | +----+ | | |
1060 | | | | | |
1061 | | +-------------+ | |
1062 | | | |
1063 | | +-------------+ | |
1064 | | | OSD 3 | | |
1065 | | | log | | |
1066 | | | +----+ | | |
1067 | | | |C1v2| 1,2 | | |
1068 | +---------->+ +----+ | | |
1069 | | | | |
1070 | | +----+ | | |
1071 | | |C1v1| 1,1 | | |
1072 | | +----+ | | |
1073 | +-------------+ | |
1074 | Primary OSD | |
1075 | +-------------+ | |
1076 | | OSD 4 | | |
1077 | | log | | |
1078 | | | | |
1079 | | 1,1 | | |
1080 | | | | |
1081 | +------+------+ | |
1082 | ||
1083 | ||
1084 | ||
1085 | The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log | |
1086 | provided by **OSD 4**: it is discarded and the file containing the ``C1v2`` | |
1087 | chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of | |
1088 | the erasure coding library during scrubbing and stored on the new primary | |
1089 | **OSD 4**. | |
1090 | ||
1091 | ||
1092 | .. ditaa:: | |
f91f0fd5 | 1093 | |
7c673cae FG |
1094 | Primary OSD |
1095 | ||
1096 | +-------------+ | |
1097 | | OSD 4 | | |
1098 | | log | | |
1099 | | +----+ | | |
1100 | | |D1v1| 1,1 | | |
1101 | | +----+ | | |
1102 | +------+------+ | |
1103 | ^ | |
1104 | | | |
1105 | | +-------------+ | |
1106 | | | OSD 2 | | |
1107 | | | log | | |
1108 | +----------+ +----+ | | |
1109 | | | |D2v1| 1,1 | | |
1110 | | | +----+ | | |
1111 | | +-------------+ | |
1112 | | | |
1113 | | +-------------+ | |
1114 | | | OSD 3 | | |
1115 | | | log | | |
1116 | +----------| +----+ | | |
1117 | | |C1v1| 1,1 | | |
1118 | | +----+ | | |
1119 | +-------------+ | |
1120 | ||
1121 | +-------------+ | |
1122 | | OSD 1 | | |
1123 | | (down) | | |
1124 | | c333 | | |
1125 | +-------------+ | |
1126 | ||
1127 | See `Erasure Code Notes`_ for additional details. | |
1128 | ||
1129 | ||
1130 | ||
1131 | Cache Tiering | |
1132 | ------------- | |
1133 | ||
1e59de90 TL |
1134 | .. note:: Cache tiering is deprecated in Reef. |
1135 | ||
7c673cae FG |
1136 | A cache tier provides Ceph Clients with better I/O performance for a subset of |
1137 | the data stored in a backing storage tier. Cache tiering involves creating a | |
1138 | pool of relatively fast/expensive storage devices (e.g., solid state drives) | |
1139 | configured to act as a cache tier, and a backing pool of either erasure-coded | |
1140 | or relatively slower/cheaper devices configured to act as an economical storage | |
1141 | tier. The Ceph objecter handles where to place the objects and the tiering | |
1142 | agent determines when to flush objects from the cache to the backing storage | |
1143 | tier. So the cache tier and the backing storage tier are completely transparent | |
1144 | to Ceph clients. | |
1145 | ||
1146 | ||
f91f0fd5 TL |
1147 | .. ditaa:: |
1148 | ||
7c673cae FG |
1149 | +-------------+ |
1150 | | Ceph Client | | |
1151 | +------+------+ | |
1152 | ^ | |
1153 | Tiering is | | |
1154 | Transparent | Faster I/O | |
1155 | to Ceph | +---------------+ | |
1156 | Client Ops | | | | |
1157 | | +----->+ Cache Tier | | |
1158 | | | | | | |
1159 | | | +-----+---+-----+ | |
1160 | | | | ^ | |
1161 | v v | | Active Data in Cache Tier | |
1162 | +------+----+--+ | | | |
1163 | | Objecter | | | | |
1164 | +-----------+--+ | | | |
1165 | ^ | | Inactive Data in Storage Tier | |
1166 | | v | | |
1167 | | +-----+---+-----+ | |
1168 | | | | | |
1169 | +----->| Storage Tier | | |
1170 | | | | |
1171 | +---------------+ | |
1172 | Slower I/O | |
1173 | ||
f67539c2 TL |
1174 | See `Cache Tiering`_ for additional details. Note that Cache Tiers can be |
1175 | tricky and their use is now discouraged. | |
7c673cae FG |
1176 | |
1177 | ||
1178 | .. index:: Extensibility, Ceph Classes | |
1179 | ||
1180 | Extending Ceph | |
1181 | -------------- | |
1182 | ||
1183 | You can extend Ceph by creating shared object classes called 'Ceph Classes'. | |
1184 | Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically | |
1185 | (i.e., ``$libdir/rados-classes`` by default). When you implement a class, you | |
1186 | can create new object methods that have the ability to call the native methods | |
1187 | in the Ceph Object Store, or other class methods you incorporate via libraries | |
1188 | or create yourself. | |
1189 | ||
1190 | On writes, Ceph Classes can call native or class methods, perform any series of | |
1191 | operations on the inbound data and generate a resulting write transaction that | |
1192 | Ceph will apply atomically. | |
1193 | ||
1194 | On reads, Ceph Classes can call native or class methods, perform any series of | |
1195 | operations on the outbound data and return the data to the client. | |
1196 | ||
1197 | .. topic:: Ceph Class Example | |
1198 | ||
1199 | A Ceph class for a content management system that presents pictures of a | |
1200 | particular size and aspect ratio could take an inbound bitmap image, crop it | |
1201 | to a particular aspect ratio, resize it and embed an invisible copyright or | |
1202 | watermark to help protect the intellectual property; then, save the | |
1203 | resulting bitmap image to the object store. | |
1204 | ||
1205 | See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for | |
1206 | exemplary implementations. | |
1207 | ||
1208 | ||
1209 | Summary | |
1210 | ------- | |
1211 | ||
1212 | Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage | |
1213 | appliances do not fully utilize the CPU and RAM of a typical commodity server, | |
1214 | Ceph does. From heartbeats, to peering, to rebalancing the cluster or | |
1215 | recovering from faults, Ceph offloads work from clients (and from a centralized | |
1216 | gateway which doesn't exist in the Ceph architecture) and uses the computing | |
1217 | power of the OSDs to perform the work. When referring to `Hardware | |
1218 | Recommendations`_ and the `Network Config Reference`_, be cognizant of the | |
1219 | foregoing concepts to understand how Ceph utilizes computing resources. | |
1220 | ||
1221 | .. index:: Ceph Protocol, librados | |
1222 | ||
1223 | Ceph Protocol | |
1224 | ============= | |
1225 | ||
1226 | Ceph Clients use the native protocol for interacting with the Ceph Storage | |
1227 | Cluster. Ceph packages this functionality into the ``librados`` library so that | |
1228 | you can create your own custom Ceph Clients. The following diagram depicts the | |
1229 | basic architecture. | |
1230 | ||
f91f0fd5 TL |
1231 | .. ditaa:: |
1232 | ||
7c673cae FG |
1233 | +---------------------------------+ |
1234 | | Ceph Storage Cluster Protocol | | |
1235 | | (librados) | | |
1236 | +---------------------------------+ | |
1237 | +---------------+ +---------------+ | |
1238 | | OSDs | | Monitors | | |
1239 | +---------------+ +---------------+ | |
1240 | ||
1241 | ||
1242 | Native Protocol and ``librados`` | |
1243 | -------------------------------- | |
1244 | ||
1245 | Modern applications need a simple object storage interface with asynchronous | |
1246 | communication capability. The Ceph Storage Cluster provides a simple object | |
1247 | storage interface with asynchronous communication capability. The interface | |
1248 | provides direct, parallel access to objects throughout the cluster. | |
1249 | ||
1250 | ||
1251 | - Pool Operations | |
1252 | - Snapshots and Copy-on-write Cloning | |
1253 | - Read/Write Objects | |
1254 | - Create or Remove | |
1255 | - Entire Object or Byte Range | |
1256 | - Append or Truncate | |
1257 | - Create/Set/Get/Remove XATTRs | |
1258 | - Create/Set/Get/Remove Key/Value Pairs | |
1259 | - Compound operations and dual-ack semantics | |
1260 | - Object Classes | |
1261 | ||
1262 | ||
1263 | .. index:: architecture; watch/notify | |
1264 | ||
1265 | Object Watch/Notify | |
1266 | ------------------- | |
1267 | ||
1268 | A client can register a persistent interest with an object and keep a session to | |
1269 | the primary OSD open. The client can send a notification message and a payload to | |
1270 | all watchers and receive notification when the watchers receive the | |
1271 | notification. This enables a client to use any object as a | |
1272 | synchronization/communication channel. | |
1273 | ||
1274 | ||
f91f0fd5 TL |
1275 | .. ditaa:: |
1276 | ||
1277 | +----------+ +----------+ +----------+ +---------------+ | |
7c673cae FG |
1278 | | Client 1 | | Client 2 | | Client 3 | | OSD:Object ID | |
1279 | +----------+ +----------+ +----------+ +---------------+ | |
1280 | | | | | | |
1281 | | | | | | |
1282 | | | Watch Object | | | |
1283 | |--------------------------------------------------->| | |
1284 | | | | | | |
1285 | |<---------------------------------------------------| | |
1286 | | | Ack/Commit | | | |
1287 | | | | | | |
1288 | | | Watch Object | | | |
1289 | | |---------------------------------->| | |
1290 | | | | | | |
1291 | | |<----------------------------------| | |
1292 | | | Ack/Commit | | | |
1293 | | | | Watch Object | | |
1294 | | | |----------------->| | |
1295 | | | | | | |
1296 | | | |<-----------------| | |
1297 | | | | Ack/Commit | | |
1298 | | | Notify | | | |
1299 | |--------------------------------------------------->| | |
1300 | | | | | | |
1301 | |<---------------------------------------------------| | |
1302 | | | Notify | | | |
1303 | | | | | | |
1304 | | |<----------------------------------| | |
1305 | | | Notify | | | |
1306 | | | |<-----------------| | |
1307 | | | | Notify | | |
1308 | | | Ack | | | |
1309 | |----------------+---------------------------------->| | |
1310 | | | | | | |
1311 | | | Ack | | | |
1312 | | +---------------------------------->| | |
1313 | | | | | | |
1314 | | | | Ack | | |
1315 | | | |----------------->| | |
1316 | | | | | | |
1317 | |<---------------+----------------+------------------| | |
1318 | | Complete | |
1319 | ||
1320 | .. index:: architecture; Striping | |
1321 | ||
1322 | Data Striping | |
1323 | ------------- | |
1324 | ||
1325 | Storage devices have throughput limitations, which impact performance and | |
1326 | scalability. So storage systems often support `striping`_--storing sequential | |
1327 | pieces of information across multiple storage devices--to increase throughput | |
1328 | and performance. The most common form of data striping comes from `RAID`_. | |
1329 | The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped | |
1330 | volume'. Ceph's striping offers the throughput of RAID 0 striping, the | |
1331 | reliability of n-way RAID mirroring and faster recovery. | |
1332 | ||
9f95a23c | 1333 | Ceph provides three types of clients: Ceph Block Device, Ceph File System, and |
7c673cae FG |
1334 | Ceph Object Storage. A Ceph Client converts its data from the representation |
1335 | format it provides to its users (a block device image, RESTful objects, CephFS | |
1336 | filesystem directories) into objects for storage in the Ceph Storage Cluster. | |
1337 | ||
1338 | .. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped. | |
9f95a23c | 1339 | Ceph Object Storage, Ceph Block Device, and the Ceph File System stripe their |
7c673cae FG |
1340 | data over multiple Ceph Storage Cluster objects. Ceph Clients that write |
1341 | directly to the Ceph Storage Cluster via ``librados`` must perform the | |
1342 | striping (and parallel I/O) for themselves to obtain these benefits. | |
1343 | ||
1344 | The simplest Ceph striping format involves a stripe count of 1 object. Ceph | |
1345 | Clients write stripe units to a Ceph Storage Cluster object until the object is | |
1346 | at its maximum capacity, and then create another object for additional stripes | |
1347 | of data. The simplest form of striping may be sufficient for small block device | |
1348 | images, S3 or Swift objects and CephFS files. However, this simple form doesn't | |
1349 | take maximum advantage of Ceph's ability to distribute data across placement | |
1350 | groups, and consequently doesn't improve performance very much. The following | |
1351 | diagram depicts the simplest form of striping: | |
1352 | ||
f91f0fd5 TL |
1353 | .. ditaa:: |
1354 | ||
7c673cae FG |
1355 | +---------------+ |
1356 | | Client Data | | |
1357 | | Format | | |
1358 | | cCCC | | |
1359 | +---------------+ | |
1360 | | | |
1361 | +--------+-------+ | |
1362 | | | | |
1363 | v v | |
1364 | /-----------\ /-----------\ | |
1365 | | Begin cCCC| | Begin cCCC| | |
1366 | | Object 0 | | Object 1 | | |
1367 | +-----------+ +-----------+ | |
1368 | | stripe | | stripe | | |
1369 | | unit 1 | | unit 5 | | |
1370 | +-----------+ +-----------+ | |
1371 | | stripe | | stripe | | |
1372 | | unit 2 | | unit 6 | | |
1373 | +-----------+ +-----------+ | |
1374 | | stripe | | stripe | | |
1375 | | unit 3 | | unit 7 | | |
1376 | +-----------+ +-----------+ | |
1377 | | stripe | | stripe | | |
1378 | | unit 4 | | unit 8 | | |
1379 | +-----------+ +-----------+ | |
1380 | | End cCCC | | End cCCC | | |
1381 | | Object 0 | | Object 1 | | |
1382 | \-----------/ \-----------/ | |
1383 | ||
1384 | ||
1385 | If you anticipate large images sizes, large S3 or Swift objects (e.g., video), | |
1386 | or large CephFS directories, you may see considerable read/write performance | |
1387 | improvements by striping client data over multiple objects within an object set. | |
1388 | Significant write performance occurs when the client writes the stripe units to | |
1389 | their corresponding objects in parallel. Since objects get mapped to different | |
1390 | placement groups and further mapped to different OSDs, each write occurs in | |
f67539c2 | 1391 | parallel at the maximum write speed. A write to a single drive would be limited |
7c673cae FG |
1392 | by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g. |
1393 | 100MB/s). By spreading that write over multiple objects (which map to different | |
1394 | placement groups and OSDs) Ceph can reduce the number of seeks per drive and | |
1395 | combine the throughput of multiple drives to achieve much faster write (or read) | |
1396 | speeds. | |
1397 | ||
1398 | .. note:: Striping is independent of object replicas. Since CRUSH | |
1399 | replicates objects across OSDs, stripes get replicated automatically. | |
1400 | ||
1401 | In the following diagram, client data gets striped across an object set | |
1402 | (``object set 1`` in the following diagram) consisting of 4 objects, where the | |
1403 | first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe | |
1404 | unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the | |
1405 | client determines if the object set is full. If the object set is not full, the | |
1406 | client begins writing a stripe to the first object again (``object 0`` in the | |
1407 | following diagram). If the object set is full, the client creates a new object | |
1408 | set (``object set 2`` in the following diagram), and begins writing to the first | |
1409 | stripe (``stripe unit 16``) in the first object in the new object set (``object | |
1410 | 4`` in the diagram below). | |
1411 | ||
f91f0fd5 TL |
1412 | .. ditaa:: |
1413 | ||
7c673cae FG |
1414 | +---------------+ |
1415 | | Client Data | | |
1416 | | Format | | |
1417 | | cCCC | | |
1418 | +---------------+ | |
1419 | | | |
1420 | +-----------------+--------+--------+-----------------+ | |
1421 | | | | | +--\ | |
1422 | v v v v | | |
1423 | /-----------\ /-----------\ /-----------\ /-----------\ | | |
1424 | | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| | | |
1425 | | Object 0 | | Object 1 | | Object 2 | | Object 3 | | | |
1426 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1427 | | stripe | | stripe | | stripe | | stripe | | | |
1428 | | unit 0 | | unit 1 | | unit 2 | | unit 3 | | | |
1429 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1430 | | stripe | | stripe | | stripe | | stripe | +-\ | |
1431 | | unit 4 | | unit 5 | | unit 6 | | unit 7 | | Object | |
1432 | +-----------+ +-----------+ +-----------+ +-----------+ +- Set | |
1433 | | stripe | | stripe | | stripe | | stripe | | 1 | |
1434 | | unit 8 | | unit 9 | | unit 10 | | unit 11 | +-/ | |
1435 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1436 | | stripe | | stripe | | stripe | | stripe | | | |
1437 | | unit 12 | | unit 13 | | unit 14 | | unit 15 | | | |
1438 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1439 | | End cCCC | | End cCCC | | End cCCC | | End cCCC | | | |
1440 | | Object 0 | | Object 1 | | Object 2 | | Object 3 | | | |
1441 | \-----------/ \-----------/ \-----------/ \-----------/ | | |
1442 | | | |
1443 | +--/ | |
1444 | ||
1445 | +--\ | |
1446 | | | |
1447 | /-----------\ /-----------\ /-----------\ /-----------\ | | |
1448 | | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| | | |
1449 | | Object 4 | | Object 5 | | Object 6 | | Object 7 | | | |
1450 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1451 | | stripe | | stripe | | stripe | | stripe | | | |
1452 | | unit 16 | | unit 17 | | unit 18 | | unit 19 | | | |
1453 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1454 | | stripe | | stripe | | stripe | | stripe | +-\ | |
1455 | | unit 20 | | unit 21 | | unit 22 | | unit 23 | | Object | |
1456 | +-----------+ +-----------+ +-----------+ +-----------+ +- Set | |
1457 | | stripe | | stripe | | stripe | | stripe | | 2 | |
1458 | | unit 24 | | unit 25 | | unit 26 | | unit 27 | +-/ | |
1459 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1460 | | stripe | | stripe | | stripe | | stripe | | | |
1461 | | unit 28 | | unit 29 | | unit 30 | | unit 31 | | | |
1462 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1463 | | End cCCC | | End cCCC | | End cCCC | | End cCCC | | | |
1464 | | Object 4 | | Object 5 | | Object 6 | | Object 7 | | | |
1465 | \-----------/ \-----------/ \-----------/ \-----------/ | | |
1466 | | | |
1467 | +--/ | |
1468 | ||
1469 | Three important variables determine how Ceph stripes data: | |
1470 | ||
1471 | - **Object Size:** Objects in the Ceph Storage Cluster have a maximum | |
1472 | configurable size (e.g., 2MB, 4MB, etc.). The object size should be large | |
1473 | enough to accommodate many stripe units, and should be a multiple of | |
1474 | the stripe unit. | |
1475 | ||
1476 | - **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb). | |
1477 | The Ceph Client divides the data it will write to objects into equally | |
1478 | sized stripe units, except for the last stripe unit. A stripe width, | |
1479 | should be a fraction of the Object Size so that an object may contain | |
1480 | many stripe units. | |
1481 | ||
1482 | - **Stripe Count:** The Ceph Client writes a sequence of stripe units | |
1483 | over a series of objects determined by the stripe count. The series | |
1484 | of objects is called an object set. After the Ceph Client writes to | |
1485 | the last object in the object set, it returns to the first object in | |
1486 | the object set. | |
1487 | ||
1488 | .. important:: Test the performance of your striping configuration before | |
1489 | putting your cluster into production. You CANNOT change these striping | |
1490 | parameters after you stripe the data and write it to objects. | |
1491 | ||
1492 | Once the Ceph Client has striped data to stripe units and mapped the stripe | |
1493 | units to objects, Ceph's CRUSH algorithm maps the objects to placement groups, | |
1494 | and the placement groups to Ceph OSD Daemons before the objects are stored as | |
f67539c2 | 1495 | files on a storage drive. |
7c673cae FG |
1496 | |
1497 | .. note:: Since a client writes to a single pool, all data striped into objects | |
1498 | get mapped to placement groups in the same pool. So they use the same CRUSH | |
1499 | map and the same access controls. | |
1500 | ||
1501 | ||
1502 | .. index:: architecture; Ceph Clients | |
1503 | ||
1504 | Ceph Clients | |
1505 | ============ | |
1506 | ||
1507 | Ceph Clients include a number of service interfaces. These include: | |
1508 | ||
aee94f69 TL |
1509 | - **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service |
1510 | provides resizable, thin-provisioned block devices that can be snapshotted | |
1511 | and cloned. Ceph stripes a block device across the cluster for high | |
1512 | performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor | |
1513 | that uses ``librbd`` directly--avoiding the kernel object overhead for | |
7c673cae FG |
1514 | virtualized systems. |
1515 | ||
1516 | - **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service | |
1517 | provides RESTful APIs with interfaces that are compatible with Amazon S3 | |
1518 | and OpenStack Swift. | |
1519 | ||
9f95a23c | 1520 | - **Filesystem**: The :term:`Ceph File System` (CephFS) service provides |
7c673cae | 1521 | a POSIX compliant filesystem usable with ``mount`` or as |
11fdf7f2 | 1522 | a filesystem in user space (FUSE). |
7c673cae FG |
1523 | |
1524 | Ceph can run additional instances of OSDs, MDSs, and monitors for scalability | |
1525 | and high availability. The following diagram depicts the high-level | |
1526 | architecture. | |
1527 | ||
1528 | .. ditaa:: | |
f91f0fd5 | 1529 | |
7c673cae | 1530 | +--------------+ +----------------+ +-------------+ |
91327a77 | 1531 | | Block Device | | Object Storage | | CephFS | |
7c673cae FG |
1532 | +--------------+ +----------------+ +-------------+ |
1533 | ||
1534 | +--------------+ +----------------+ +-------------+ | |
1535 | | librbd | | librgw | | libcephfs | | |
1536 | +--------------+ +----------------+ +-------------+ | |
1537 | ||
1538 | +---------------------------------------------------+ | |
1539 | | Ceph Storage Cluster Protocol (librados) | | |
1540 | +---------------------------------------------------+ | |
1541 | ||
1542 | +---------------+ +---------------+ +---------------+ | |
1543 | | OSDs | | MDSs | | Monitors | | |
1544 | +---------------+ +---------------+ +---------------+ | |
1545 | ||
1546 | ||
1547 | .. index:: architecture; Ceph Object Storage | |
1548 | ||
1549 | Ceph Object Storage | |
1550 | ------------------- | |
1551 | ||
1552 | The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides | |
1553 | a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph | |
1554 | Storage Cluster with its own data formats, and maintains its own user database, | |
1555 | authentication, and access control. The RADOS Gateway uses a unified namespace, | |
1556 | which means you can use either the OpenStack Swift-compatible API or the Amazon | |
1557 | S3-compatible API. For example, you can write data using the S3-compatible API | |
1558 | with one application and then read data using the Swift-compatible API with | |
1559 | another application. | |
1560 | ||
1561 | .. topic:: S3/Swift Objects and Store Cluster Objects Compared | |
1562 | ||
1563 | Ceph's Object Storage uses the term *object* to describe the data it stores. | |
1564 | S3 and Swift objects are not the same as the objects that Ceph writes to the | |
1565 | Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage | |
1566 | Cluster objects. The S3 and Swift objects do not necessarily | |
1567 | correspond in a 1:1 manner with an object stored in the storage cluster. It | |
1568 | is possible for an S3 or Swift object to map to multiple Ceph objects. | |
1569 | ||
1570 | See `Ceph Object Storage`_ for details. | |
1571 | ||
1572 | ||
1573 | .. index:: Ceph Block Device; block device; RBD; Rados Block Device | |
1574 | ||
1575 | Ceph Block Device | |
1576 | ----------------- | |
1577 | ||
1578 | A Ceph Block Device stripes a block device image over multiple objects in the | |
1579 | Ceph Storage Cluster, where each object gets mapped to a placement group and | |
1580 | distributed, and the placement groups are spread across separate ``ceph-osd`` | |
1581 | daemons throughout the cluster. | |
1582 | ||
1583 | .. important:: Striping allows RBD block devices to perform better than a single | |
1584 | server could! | |
1585 | ||
1586 | Thin-provisioned snapshottable Ceph Block Devices are an attractive option for | |
1587 | virtualization and cloud computing. In virtual machine scenarios, people | |
1588 | typically deploy a Ceph Block Device with the ``rbd`` network storage driver in | |
1589 | QEMU/KVM, where the host machine uses ``librbd`` to provide a block device | |
1590 | service to the guest. Many cloud computing stacks use ``libvirt`` to integrate | |
1591 | with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and | |
1592 | ``libvirt`` to support OpenStack and CloudStack among other solutions. | |
1593 | ||
1594 | While we do not provide ``librbd`` support with other hypervisors at this time, | |
1595 | you may also use Ceph Block Device kernel objects to provide a block device to a | |
1596 | client. Other virtualization technologies such as Xen can access the Ceph Block | |
1597 | Device kernel object(s). This is done with the command-line tool ``rbd``. | |
1598 | ||
1599 | ||
9f95a23c | 1600 | .. index:: CephFS; Ceph File System; libcephfs; MDS; metadata server; ceph-mds |
91327a77 AA |
1601 | |
1602 | .. _arch-cephfs: | |
7c673cae | 1603 | |
9f95a23c TL |
1604 | Ceph File System |
1605 | ---------------- | |
7c673cae | 1606 | |
9f95a23c | 1607 | The Ceph File System (CephFS) provides a POSIX-compliant filesystem as a |
7c673cae | 1608 | service that is layered on top of the object-based Ceph Storage Cluster. |
91327a77 | 1609 | CephFS files get mapped to objects that Ceph stores in the Ceph Storage |
7c673cae FG |
1610 | Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as |
1611 | a Filesystem in User Space (FUSE). | |
1612 | ||
1613 | .. ditaa:: | |
f91f0fd5 | 1614 | |
7c673cae FG |
1615 | +-----------------------+ +------------------------+ |
1616 | | CephFS Kernel Object | | CephFS FUSE | | |
1617 | +-----------------------+ +------------------------+ | |
1618 | ||
1619 | +---------------------------------------------------+ | |
91327a77 | 1620 | | CephFS Library (libcephfs) | |
7c673cae FG |
1621 | +---------------------------------------------------+ |
1622 | ||
1623 | +---------------------------------------------------+ | |
1624 | | Ceph Storage Cluster Protocol (librados) | | |
1625 | +---------------------------------------------------+ | |
1626 | ||
1627 | +---------------+ +---------------+ +---------------+ | |
1628 | | OSDs | | MDSs | | Monitors | | |
1629 | +---------------+ +---------------+ +---------------+ | |
1630 | ||
1631 | ||
9f95a23c | 1632 | The Ceph File System service includes the Ceph Metadata Server (MDS) deployed |
7c673cae FG |
1633 | with the Ceph Storage cluster. The purpose of the MDS is to store all the |
1634 | filesystem metadata (directories, file ownership, access modes, etc) in | |
1635 | high-availability Ceph Metadata Servers where the metadata resides in memory. | |
1636 | The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem | |
1637 | operations like listing a directory or changing a directory (``ls``, ``cd``) | |
1638 | would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from | |
9f95a23c | 1639 | the data means that the Ceph File System can provide high performance services |
7c673cae FG |
1640 | without taxing the Ceph Storage Cluster. |
1641 | ||
91327a77 | 1642 | CephFS separates the metadata from the data, storing the metadata in the MDS, |
7c673cae FG |
1643 | and storing the file data in one or more objects in the Ceph Storage Cluster. |
1644 | The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a | |
1645 | single process, or it can be distributed out to multiple physical machines, | |
1646 | either for high availability or for scalability. | |
1647 | ||
1648 | - **High Availability**: The extra ``ceph-mds`` instances can be `standby`, | |
1649 | ready to take over the duties of any failed ``ceph-mds`` that was | |
1650 | `active`. This is easy because all the data, including the journal, is | |
1651 | stored on RADOS. The transition is triggered automatically by ``ceph-mon``. | |
1652 | ||
1653 | - **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they | |
1654 | will split the directory tree into subtrees (and shards of a single | |
1655 | busy directory), effectively balancing the load amongst all `active` | |
1656 | servers. | |
1657 | ||
1658 | Combinations of `standby` and `active` etc are possible, for example | |
1659 | running 3 `active` ``ceph-mds`` instances for scaling, and one `standby` | |
1660 | instance for high availability. | |
1661 | ||
1662 | ||
1663 | ||
39ae355f | 1664 | .. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.io/assets/pdfs/weil-rados-pdsw07.pdf |
11fdf7f2 | 1665 | .. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) |
7c673cae FG |
1666 | .. _Monitor Config Reference: ../rados/configuration/mon-config-ref |
1667 | .. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg | |
1668 | .. _Heartbeats: ../rados/configuration/mon-osd-interaction | |
1669 | .. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds | |
39ae355f | 1670 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf |
7c673cae FG |
1671 | .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing |
1672 | .. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure | |
1673 | .. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure | |
1674 | .. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/ | |
1675 | .. _Hardware Recommendations: ../start/hardware-recommendations | |
1676 | .. _Network Config Reference: ../rados/configuration/network-config-ref | |
1677 | .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing | |
11fdf7f2 TL |
1678 | .. _striping: https://en.wikipedia.org/wiki/Data_striping |
1679 | .. _RAID: https://en.wikipedia.org/wiki/RAID | |
1680 | .. _RAID 0: https://en.wikipedia.org/wiki/RAID_0#RAID_0 | |
7c673cae | 1681 | .. _Ceph Object Storage: ../radosgw/ |
11fdf7f2 | 1682 | .. _RESTful: https://en.wikipedia.org/wiki/RESTful |
7c673cae FG |
1683 | .. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst |
1684 | .. _Cache Tiering: ../rados/operations/cache-tiering | |
1685 | .. _Set Pool Values: ../rados/operations/pools#set-pool-values | |
11fdf7f2 | 1686 | .. _Kerberos: https://en.wikipedia.org/wiki/Kerberos_(protocol) |
7c673cae FG |
1687 | .. _Cephx Config Guide: ../rados/configuration/auth-config-ref |
1688 | .. _User Management: ../rados/operations/user-management |