]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============== |
2 | Architecture | |
3 | ============== | |
4 | ||
5 | :term:`Ceph` uniquely delivers **object, block, and file storage** in one | |
6 | unified system. Ceph is highly reliable, easy to manage, and free. The power of | |
7 | Ceph can transform your company's IT infrastructure and your ability to manage | |
8 | vast amounts of data. Ceph delivers extraordinary scalability–thousands of | |
9 | clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages | |
10 | commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster` | |
11 | accommodates large numbers of nodes, which communicate with each other to | |
12 | replicate and redistribute data dynamically. | |
13 | ||
14 | .. image:: images/stack.png | |
15 | ||
16 | ||
17 | The Ceph Storage Cluster | |
18 | ======================== | |
19 | ||
20 | Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon | |
21 | :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read | |
22 | about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale | |
23 | Storage Clusters`_. | |
24 | ||
25 | A Ceph Storage Cluster consists of two types of daemons: | |
26 | ||
27 | - :term:`Ceph Monitor` | |
28 | - :term:`Ceph OSD Daemon` | |
29 | ||
30 | .. ditaa:: +---------------+ +---------------+ | |
31 | | OSDs | | Monitors | | |
32 | +---------------+ +---------------+ | |
33 | ||
34 | A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph | |
35 | monitors ensures high availability should a monitor daemon fail. Storage cluster | |
36 | clients retrieve a copy of the cluster map from the Ceph Monitor. | |
37 | ||
38 | A Ceph OSD Daemon checks its own state and the state of other OSDs and reports | |
39 | back to monitors. | |
40 | ||
41 | Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm | |
42 | to efficiently compute information about data location, instead of having to | |
43 | depend on a central lookup table. Ceph's high-level features include providing a | |
44 | native interface to the Ceph Storage Cluster via ``librados``, and a number of | |
45 | service interfaces built on top of ``librados``. | |
46 | ||
47 | ||
48 | ||
49 | Storing Data | |
50 | ------------ | |
51 | ||
52 | The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it | |
53 | comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the | |
54 | :term:`Ceph Filesystem` or a custom implementation you create using | |
55 | ``librados``--and it stores the data as objects. Each object corresponds to a | |
56 | file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph | |
57 | OSD Daemons handle the read/write operations on the storage disks. | |
58 | ||
59 | .. ditaa:: /-----\ +-----+ +-----+ | |
60 | | obj |------>| {d} |------>| {s} | | |
61 | \-----/ +-----+ +-----+ | |
62 | ||
63 | Object File Disk | |
64 | ||
65 | Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no | |
66 | hierarchy of directories). An object has an identifier, binary data, and | |
67 | metadata consisting of a set of name/value pairs. The semantics are completely | |
68 | up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file | |
69 | attributes such as the file owner, created date, last modified date, and so | |
70 | forth. | |
71 | ||
72 | ||
73 | .. ditaa:: /------+------------------------------+----------------\ | |
74 | | ID | Binary Data | Metadata | | |
75 | +------+------------------------------+----------------+ | |
76 | | 1234 | 0101010101010100110101010010 | name1 = value1 | | |
77 | | | 0101100001010100110101010010 | name2 = value2 | | |
78 | | | 0101100001010100110101010010 | nameN = valueN | | |
79 | \------+------------------------------+----------------/ | |
80 | ||
81 | .. note:: An object ID is unique across the entire cluster, not just the local | |
82 | filesystem. | |
83 | ||
84 | ||
85 | .. index:: architecture; high availability, scalability | |
86 | ||
87 | Scalability and High Availability | |
88 | --------------------------------- | |
89 | ||
90 | In traditional architectures, clients talk to a centralized component (e.g., a | |
91 | gateway, broker, API, facade, etc.), which acts as a single point of entry to a | |
92 | complex subsystem. This imposes a limit to both performance and scalability, | |
93 | while introducing a single point of failure (i.e., if the centralized component | |
94 | goes down, the whole system goes down, too). | |
95 | ||
96 | Ceph eliminates the centralized gateway to enable clients to interact with | |
97 | Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other | |
98 | Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster | |
99 | of monitors to ensure high availability. To eliminate centralization, Ceph | |
100 | uses an algorithm called CRUSH. | |
101 | ||
102 | ||
103 | .. index:: CRUSH; architecture | |
104 | ||
105 | CRUSH Introduction | |
106 | ~~~~~~~~~~~~~~~~~~ | |
107 | ||
108 | Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled | |
109 | Replication Under Scalable Hashing)` algorithm to efficiently compute | |
110 | information about object location, instead of having to depend on a | |
111 | central lookup table. CRUSH provides a better data management mechanism compared | |
112 | to older approaches, and enables massive scale by cleanly distributing the work | |
113 | to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data | |
114 | replication to ensure resiliency, which is better suited to hyper-scale storage. | |
115 | The following sections provide additional details on how CRUSH works. For a | |
116 | detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized | |
117 | Placement of Replicated Data`_. | |
118 | ||
119 | .. index:: architecture; cluster map | |
120 | ||
121 | Cluster Map | |
122 | ~~~~~~~~~~~ | |
123 | ||
124 | Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the | |
125 | cluster topology, which is inclusive of 5 maps collectively referred to as the | |
126 | "Cluster Map": | |
127 | ||
128 | #. **The Monitor Map:** Contains the cluster ``fsid``, the position, name | |
129 | address and port of each monitor. It also indicates the current epoch, | |
130 | when the map was created, and the last time it changed. To view a monitor | |
131 | map, execute ``ceph mon dump``. | |
132 | ||
133 | #. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and | |
134 | last modified, a list of pools, replica sizes, PG numbers, a list of OSDs | |
135 | and their status (e.g., ``up``, ``in``). To view an OSD map, execute | |
136 | ``ceph osd dump``. | |
137 | ||
138 | #. **The PG Map:** Contains the PG version, its time stamp, the last OSD | |
139 | map epoch, the full ratios, and details on each placement group such as | |
140 | the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g., | |
141 | ``active + clean``), and data usage statistics for each pool. | |
142 | ||
143 | #. **The CRUSH Map:** Contains a list of storage devices, the failure domain | |
144 | hierarchy (e.g., device, host, rack, row, room, etc.), and rules for | |
145 | traversing the hierarchy when storing data. To view a CRUSH map, execute | |
146 | ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing | |
147 | ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``. | |
148 | You can view the decompiled map in a text editor or with ``cat``. | |
149 | ||
150 | #. **The MDS Map:** Contains the current MDS map epoch, when the map was | |
151 | created, and the last time it changed. It also contains the pool for | |
152 | storing metadata, a list of metadata servers, and which metadata servers | |
153 | are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``. | |
154 | ||
155 | Each map maintains an iterative history of its operating state changes. Ceph | |
156 | Monitors maintain a master copy of the cluster map including the cluster | |
157 | members, state, changes, and the overall health of the Ceph Storage Cluster. | |
158 | ||
159 | .. index:: high availability; monitor architecture | |
160 | ||
161 | High Availability Monitors | |
162 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
163 | ||
164 | Before Ceph Clients can read or write data, they must contact a Ceph Monitor | |
165 | to obtain the most recent copy of the cluster map. A Ceph Storage Cluster | |
166 | can operate with a single monitor; however, this introduces a single | |
167 | point of failure (i.e., if the monitor goes down, Ceph Clients cannot | |
168 | read or write data). | |
169 | ||
170 | For added reliability and fault tolerance, Ceph supports a cluster of monitors. | |
171 | In a cluster of monitors, latency and other faults can cause one or more | |
172 | monitors to fall behind the current state of the cluster. For this reason, Ceph | |
173 | must have agreement among various monitor instances regarding the state of the | |
174 | cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.) | |
175 | and the `Paxos`_ algorithm to establish a consensus among the monitors about the | |
176 | current state of the cluster. | |
177 | ||
178 | For details on configuring monitors, see the `Monitor Config Reference`_. | |
179 | ||
180 | .. index:: architecture; high availability authentication | |
181 | ||
182 | High Availability Authentication | |
183 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
184 | ||
185 | To identify users and protect against man-in-the-middle attacks, Ceph provides | |
186 | its ``cephx`` authentication system to authenticate users and daemons. | |
187 | ||
188 | .. note:: The ``cephx`` protocol does not address data encryption in transport | |
189 | (e.g., SSL/TLS) or encryption at rest. | |
190 | ||
191 | Cephx uses shared secret keys for authentication, meaning both the client and | |
192 | the monitor cluster have a copy of the client's secret key. The authentication | |
193 | protocol is such that both parties are able to prove to each other they have a | |
194 | copy of the key without actually revealing it. This provides mutual | |
195 | authentication, which means the cluster is sure the user possesses the secret | |
196 | key, and the user is sure that the cluster has a copy of the secret key. | |
197 | ||
198 | A key scalability feature of Ceph is to avoid a centralized interface to the | |
199 | Ceph object store, which means that Ceph clients must be able to interact with | |
200 | OSDs directly. To protect data, Ceph provides its ``cephx`` authentication | |
201 | system, which authenticates users operating Ceph clients. The ``cephx`` protocol | |
202 | operates in a manner with behavior similar to `Kerberos`_. | |
203 | ||
204 | A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each | |
205 | monitor can authenticate users and distribute keys, so there is no single point | |
206 | of failure or bottleneck when using ``cephx``. The monitor returns an | |
207 | authentication data structure similar to a Kerberos ticket that contains a | |
208 | session key for use in obtaining Ceph services. This session key is itself | |
209 | encrypted with the user's permanent secret key, so that only the user can | |
31f18b77 | 210 | request services from the Ceph Monitor(s). The client then uses the session key |
7c673cae FG |
211 | to request its desired services from the monitor, and the monitor provides the |
212 | client with a ticket that will authenticate the client to the OSDs that actually | |
31f18b77 | 213 | handle data. Ceph Monitors and OSDs share a secret, so the client can use the |
7c673cae FG |
214 | ticket provided by the monitor with any OSD or metadata server in the cluster. |
215 | Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired | |
216 | ticket or session key obtained surreptitiously. This form of authentication will | |
217 | prevent attackers with access to the communications medium from either creating | |
218 | bogus messages under another user's identity or altering another user's | |
219 | legitimate messages, as long as the user's secret key is not divulged before it | |
220 | expires. | |
221 | ||
222 | To use ``cephx``, an administrator must set up users first. In the following | |
223 | diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from | |
224 | the command line to generate a username and secret key. Ceph's ``auth`` | |
225 | subsystem generates the username and key, stores a copy with the monitor(s) and | |
226 | transmits the user's secret back to the ``client.admin`` user. This means that | |
227 | the client and the monitor share a secret key. | |
228 | ||
229 | .. note:: The ``client.admin`` user must provide the user ID and | |
230 | secret key to the user in a secure manner. | |
231 | ||
232 | .. ditaa:: +---------+ +---------+ | |
233 | | Client | | Monitor | | |
234 | +---------+ +---------+ | |
235 | | request to | | |
236 | | create a user | | |
237 | |-------------->|----------+ create user | |
238 | | | | and | |
239 | |<--------------|<---------+ store key | |
240 | | transmit key | | |
241 | | | | |
242 | ||
243 | ||
244 | To authenticate with the monitor, the client passes in the user name to the | |
245 | monitor, and the monitor generates a session key and encrypts it with the secret | |
246 | key associated to the user name. Then, the monitor transmits the encrypted | |
247 | ticket back to the client. The client then decrypts the payload with the shared | |
248 | secret key to retrieve the session key. The session key identifies the user for | |
249 | the current session. The client then requests a ticket on behalf of the user | |
250 | signed by the session key. The monitor generates a ticket, encrypts it with the | |
251 | user's secret key and transmits it back to the client. The client decrypts the | |
252 | ticket and uses it to sign requests to OSDs and metadata servers throughout the | |
253 | cluster. | |
254 | ||
255 | .. ditaa:: +---------+ +---------+ | |
256 | | Client | | Monitor | | |
257 | +---------+ +---------+ | |
258 | | authenticate | | |
259 | |-------------->|----------+ generate and | |
260 | | | | encrypt | |
261 | |<--------------|<---------+ session key | |
262 | | transmit | | |
263 | | encrypted | | |
264 | | session key | | |
265 | | | | |
266 | |-----+ decrypt | | |
267 | | | session | | |
268 | |<----+ key | | |
269 | | | | |
270 | | req. ticket | | |
271 | |-------------->|----------+ generate and | |
272 | | | | encrypt | |
273 | |<--------------|<---------+ ticket | |
274 | | recv. ticket | | |
275 | | | | |
276 | |-----+ decrypt | | |
277 | | | ticket | | |
278 | |<----+ | | |
279 | ||
280 | ||
281 | The ``cephx`` protocol authenticates ongoing communications between the client | |
282 | machine and the Ceph servers. Each message sent between a client and server, | |
283 | subsequent to the initial authentication, is signed using a ticket that the | |
284 | monitors, OSDs and metadata servers can verify with their shared secret. | |
285 | ||
286 | .. ditaa:: +---------+ +---------+ +-------+ +-------+ | |
287 | | Client | | Monitor | | MDS | | OSD | | |
288 | +---------+ +---------+ +-------+ +-------+ | |
289 | | request to | | | | |
290 | | create a user | | | | |
291 | |-------------->| mon and | | | |
292 | |<--------------| client share | | | |
293 | | receive | a secret. | | | |
294 | | shared secret | | | | |
295 | | |<------------>| | | |
296 | | |<-------------+------------>| | |
297 | | | mon, mds, | | | |
298 | | authenticate | and osd | | | |
299 | |-------------->| share | | | |
300 | |<--------------| a secret | | | |
301 | | session key | | | | |
302 | | | | | | |
303 | | req. ticket | | | | |
304 | |-------------->| | | | |
305 | |<--------------| | | | |
306 | | recv. ticket | | | | |
307 | | | | | | |
308 | | make request (CephFS only) | | | |
309 | |----------------------------->| | | |
310 | |<-----------------------------| | | |
311 | | receive response (CephFS only) | | |
312 | | | | |
313 | | make request | | |
314 | |------------------------------------------->| | |
315 | |<-------------------------------------------| | |
316 | receive response | |
317 | ||
318 | The protection offered by this authentication is between the Ceph client and the | |
319 | Ceph server hosts. The authentication is not extended beyond the Ceph client. If | |
320 | the user accesses the Ceph client from a remote host, Ceph authentication is not | |
321 | applied to the connection between the user's host and the client host. | |
322 | ||
323 | ||
324 | For configuration details, see `Cephx Config Guide`_. For user management | |
325 | details, see `User Management`_. | |
326 | ||
327 | ||
328 | .. index:: architecture; smart daemons and scalability | |
329 | ||
330 | Smart Daemons Enable Hyperscale | |
331 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
332 | ||
333 | In many clustered architectures, the primary purpose of cluster membership is | |
334 | so that a centralized interface knows which nodes it can access. Then the | |
335 | centralized interface provides services to the client through a double | |
336 | dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale. | |
337 | ||
338 | Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster | |
339 | aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD | |
340 | Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with | |
31f18b77 | 341 | other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients |
7c673cae FG |
342 | to interact directly with Ceph OSD Daemons. |
343 | ||
344 | The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with | |
345 | each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph | |
346 | nodes to easily perform tasks that would bog down a centralized server. The | |
347 | ability to leverage this computing power leads to several major benefits: | |
348 | ||
349 | #. **OSDs Service Clients Directly:** Since any network device has a limit to | |
350 | the number of concurrent connections it can support, a centralized system | |
351 | has a low physical limit at high scales. By enabling Ceph Clients to contact | |
352 | Ceph OSD Daemons directly, Ceph increases both performance and total system | |
353 | capacity simultaneously, while removing a single point of failure. Ceph | |
354 | Clients can maintain a session when they need to, and with a particular Ceph | |
355 | OSD Daemon instead of a centralized server. | |
356 | ||
357 | #. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report | |
358 | on their status. At the lowest level, the Ceph OSD Daemon status is ``up`` | |
359 | or ``down`` reflecting whether or not it is running and able to service | |
360 | Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph | |
361 | Storage Cluster, this status may indicate the failure of the Ceph OSD | |
362 | Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD | |
31f18b77 FG |
363 | Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs |
364 | periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous, | |
365 | and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that | |
366 | message after a configurable period of time then it marks the OSD down. | |
367 | This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will | |
368 | determine if a neighboring OSD is down and report it to the Ceph Monitor(s). | |
369 | This assures that Ceph Monitors are lightweight processes. See `Monitoring | |
370 | OSDs`_ and `Heartbeats`_ for additional details. | |
371 | ||
7c673cae FG |
372 | #. **Data Scrubbing:** As part of maintaining data consistency and cleanliness, |
373 | Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph | |
374 | OSD Daemons can compare object metadata in one placement group with its | |
375 | replicas in placement groups stored on other OSDs. Scrubbing (usually | |
376 | performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also | |
377 | perform deeper scrubbing by comparing data in objects bit-for-bit. Deep | |
378 | scrubbing (usually performed weekly) finds bad sectors on a drive that | |
379 | weren't apparent in a light scrub. See `Data Scrubbing`_ for details on | |
380 | configuring scrubbing. | |
381 | ||
382 | #. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH | |
383 | algorithm, but the Ceph OSD Daemon uses it to compute where replicas of | |
384 | objects should be stored (and for rebalancing). In a typical write scenario, | |
385 | a client uses the CRUSH algorithm to compute where to store an object, maps | |
386 | the object to a pool and placement group, then looks at the CRUSH map to | |
387 | identify the primary OSD for the placement group. | |
388 | ||
389 | The client writes the object to the identified placement group in the | |
390 | primary OSD. Then, the primary OSD with its own copy of the CRUSH map | |
391 | identifies the secondary and tertiary OSDs for replication purposes, and | |
392 | replicates the object to the appropriate placement groups in the secondary | |
393 | and tertiary OSDs (as many OSDs as additional replicas), and responds to the | |
394 | client once it has confirmed the object was stored successfully. | |
395 | ||
396 | .. ditaa:: | |
397 | +----------+ | |
398 | | Client | | |
399 | | | | |
400 | +----------+ | |
401 | * ^ | |
402 | Write (1) | | Ack (6) | |
403 | | | | |
404 | v * | |
405 | +-------------+ | |
406 | | Primary OSD | | |
407 | | | | |
408 | +-------------+ | |
409 | * ^ ^ * | |
410 | Write (2) | | | | Write (3) | |
411 | +------+ | | +------+ | |
412 | | +------+ +------+ | | |
413 | | | Ack (4) Ack (5)| | | |
414 | v * * v | |
415 | +---------------+ +---------------+ | |
416 | | Secondary OSD | | Tertiary OSD | | |
417 | | | | | | |
418 | +---------------+ +---------------+ | |
419 | ||
420 | With the ability to perform data replication, Ceph OSD Daemons relieve Ceph | |
421 | clients from that duty, while ensuring high data availability and data safety. | |
422 | ||
423 | ||
424 | Dynamic Cluster Management | |
425 | -------------------------- | |
426 | ||
427 | In the `Scalability and High Availability`_ section, we explained how Ceph uses | |
428 | CRUSH, cluster awareness and intelligent daemons to scale and maintain high | |
429 | availability. Key to Ceph's design is the autonomous, self-healing, and | |
430 | intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to | |
431 | enable modern cloud storage infrastructures to place data, rebalance the cluster | |
432 | and recover from faults dynamically. | |
433 | ||
434 | .. index:: architecture; pools | |
435 | ||
436 | About Pools | |
437 | ~~~~~~~~~~~ | |
438 | ||
439 | The Ceph storage system supports the notion of 'Pools', which are logical | |
440 | partitions for storing objects. | |
441 | ||
442 | Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to | |
443 | pools. The pool's ``size`` or number of replicas, the CRUSH ruleset and the | |
444 | number of placement groups determine how Ceph will place the data. | |
445 | ||
446 | .. ditaa:: | |
447 | +--------+ Retrieves +---------------+ | |
448 | | Client |------------>| Cluster Map | | |
449 | +--------+ +---------------+ | |
450 | | | |
451 | v Writes | |
452 | /-----\ | |
453 | | obj | | |
454 | \-----/ | |
455 | | To | |
456 | v | |
457 | +--------+ +---------------+ | |
458 | | Pool |---------->| CRUSH Ruleset | | |
459 | +--------+ Selects +---------------+ | |
460 | ||
461 | ||
462 | Pools set at least the following parameters: | |
463 | ||
464 | - Ownership/Access to Objects | |
465 | - The Number of Placement Groups, and | |
466 | - The CRUSH Ruleset to Use. | |
467 | ||
468 | See `Set Pool Values`_ for details. | |
469 | ||
470 | ||
471 | .. index: architecture; placement group mapping | |
472 | ||
473 | Mapping PGs to OSDs | |
474 | ~~~~~~~~~~~~~~~~~~~ | |
475 | ||
476 | Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically. | |
477 | When a Ceph Client stores objects, CRUSH will map each object to a placement | |
478 | group. | |
479 | ||
480 | Mapping objects to placement groups creates a layer of indirection between the | |
481 | Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to | |
482 | grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph | |
483 | Client "knew" which Ceph OSD Daemon had which object, that would create a tight | |
484 | coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH | |
485 | algorithm maps each object to a placement group and then maps each placement | |
486 | group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to | |
487 | rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices | |
488 | come online. The following diagram depicts how CRUSH maps objects to placement | |
489 | groups, and placement groups to OSDs. | |
490 | ||
491 | .. ditaa:: | |
492 | /-----\ /-----\ /-----\ /-----\ /-----\ | |
493 | | obj | | obj | | obj | | obj | | obj | | |
494 | \-----/ \-----/ \-----/ \-----/ \-----/ | |
495 | | | | | | | |
496 | +--------+--------+ +---+----+ | |
497 | | | | |
498 | v v | |
499 | +-----------------------+ +-----------------------+ | |
500 | | Placement Group #1 | | Placement Group #2 | | |
501 | | | | | | |
502 | +-----------------------+ +-----------------------+ | |
503 | | | | |
504 | | +-----------------------+---+ | |
505 | +------+------+-------------+ | | |
506 | | | | | | |
507 | v v v v | |
508 | /----------\ /----------\ /----------\ /----------\ | |
509 | | | | | | | | | | |
510 | | OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 | | |
511 | | | | | | | | | | |
512 | \----------/ \----------/ \----------/ \----------/ | |
513 | ||
514 | With a copy of the cluster map and the CRUSH algorithm, the client can compute | |
515 | exactly which OSD to use when reading or writing a particular object. | |
516 | ||
517 | .. index:: architecture; calculating PG IDs | |
518 | ||
519 | Calculating PG IDs | |
520 | ~~~~~~~~~~~~~~~~~~ | |
521 | ||
522 | When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the | |
523 | `Cluster Map`_. With the cluster map, the client knows about all of the monitors, | |
524 | OSDs, and metadata servers in the cluster. **However, it doesn't know anything | |
525 | about object locations.** | |
526 | ||
527 | .. epigraph:: | |
528 | ||
529 | Object locations get computed. | |
530 | ||
531 | ||
532 | The only input required by the client is the object ID and the pool. | |
533 | It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client | |
534 | wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.) | |
535 | it calculates a placement group using the object name, a hash code, the | |
536 | number of PGs in the pool and the pool name. Ceph clients use the following | |
537 | steps to compute PG IDs. | |
538 | ||
539 | #. The client inputs the pool ID and the object ID. (e.g., pool = "liverpool" | |
540 | and object-id = "john") | |
541 | #. Ceph takes the object ID and hashes it. | |
542 | #. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get | |
543 | a PG ID. | |
544 | #. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``) | |
545 | #. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``). | |
546 | ||
547 | Computing object locations is much faster than performing object location query | |
548 | over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable | |
549 | Hashing)` algorithm allows a client to compute where objects *should* be stored, | |
550 | and enables the client to contact the primary OSD to store or retrieve the | |
551 | objects. | |
552 | ||
553 | .. index:: architecture; PG Peering | |
554 | ||
555 | Peering and Sets | |
556 | ~~~~~~~~~~~~~~~~ | |
557 | ||
558 | In previous sections, we noted that Ceph OSD Daemons check each others | |
559 | heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons | |
560 | do is called 'peering', which is the process of bringing all of the OSDs that | |
561 | store a Placement Group (PG) into agreement about the state of all of the | |
562 | objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report | |
563 | Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve | |
564 | themselves; however, if the problem persists, you may need to refer to the | |
565 | `Troubleshooting Peering Failure`_ section. | |
566 | ||
567 | .. Note:: Agreeing on the state does not mean that the PGs have the latest contents. | |
568 | ||
569 | The Ceph Storage Cluster was designed to store at least two copies of an object | |
570 | (i.e., ``size = 2``), which is the minimum requirement for data safety. For high | |
571 | availability, a Ceph Storage Cluster should store more than two copies of an object | |
572 | (e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a | |
573 | ``degraded`` state while maintaining data safety. | |
574 | ||
575 | Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not | |
576 | name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but | |
577 | rather refer to them as *Primary*, *Secondary*, and so forth. By convention, | |
578 | the *Primary* is the first OSD in the *Acting Set*, and is responsible for | |
579 | coordinating the peering process for each placement group where it acts as | |
580 | the *Primary*, and is the **ONLY** OSD that that will accept client-initiated | |
581 | writes to objects for a given placement group where it acts as the *Primary*. | |
582 | ||
583 | When a series of OSDs are responsible for a placement group, that series of | |
584 | OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph | |
585 | OSD Daemons that are currently responsible for the placement group, or the Ceph | |
586 | OSD Daemons that were responsible for a particular placement group as of some | |
587 | epoch. | |
588 | ||
589 | The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``. | |
590 | When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up | |
591 | Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD | |
592 | Daemons when an OSD fails. | |
593 | ||
594 | .. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and | |
595 | ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails, | |
596 | the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be | |
597 | removed from the *Up Set*. | |
598 | ||
599 | ||
600 | .. index:: architecture; Rebalancing | |
601 | ||
602 | Rebalancing | |
603 | ~~~~~~~~~~~ | |
604 | ||
605 | When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets | |
606 | updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes | |
607 | the cluster map. Consequently, it changes object placement, because it changes | |
608 | an input for the calculations. The following diagram depicts the rebalancing | |
609 | process (albeit rather crudely, since it is substantially less impactful with | |
610 | large clusters) where some, but not all of the PGs migrate from existing OSDs | |
611 | (OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is | |
612 | stable. Many of the placement groups remain in their original configuration, | |
613 | and each OSD gets some added capacity, so there are no load spikes on the | |
614 | new OSD after rebalancing is complete. | |
615 | ||
616 | ||
617 | .. ditaa:: | |
618 | +--------+ +--------+ | |
619 | Before | OSD 1 | | OSD 2 | | |
620 | +--------+ +--------+ | |
621 | | PG #1 | | PG #6 | | |
622 | | PG #2 | | PG #7 | | |
623 | | PG #3 | | PG #8 | | |
624 | | PG #4 | | PG #9 | | |
625 | | PG #5 | | PG #10 | | |
626 | +--------+ +--------+ | |
627 | ||
628 | +--------+ +--------+ +--------+ | |
629 | After | OSD 1 | | OSD 2 | | OSD 3 | | |
630 | +--------+ +--------+ +--------+ | |
631 | | PG #1 | | PG #7 | | PG #3 | | |
632 | | PG #2 | | PG #8 | | PG #6 | | |
633 | | PG #4 | | PG #10 | | PG #9 | | |
634 | | PG #5 | | | | | | |
635 | | | | | | | | |
636 | +--------+ +--------+ +--------+ | |
637 | ||
638 | ||
639 | .. index:: architecture; Data Scrubbing | |
640 | ||
641 | Data Consistency | |
642 | ~~~~~~~~~~~~~~~~ | |
643 | ||
644 | As part of maintaining data consistency and cleanliness, Ceph OSDs can also | |
645 | scrub objects within placement groups. That is, Ceph OSDs can compare object | |
646 | metadata in one placement group with its replicas in placement groups stored in | |
647 | other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem | |
648 | errors. OSDs can also perform deeper scrubbing by comparing data in objects | |
649 | bit-for-bit. Deep scrubbing (usually performed weekly) finds bad sectors on a | |
650 | disk that weren't apparent in a light scrub. | |
651 | ||
652 | See `Data Scrubbing`_ for details on configuring scrubbing. | |
653 | ||
654 | ||
655 | ||
656 | ||
657 | ||
658 | .. index:: erasure coding | |
659 | ||
660 | Erasure Coding | |
661 | -------------- | |
662 | ||
663 | An erasure coded pool stores each object as ``K+M`` chunks. It is divided into | |
664 | ``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size | |
665 | of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of | |
666 | the chunk is stored as an attribute of the object. | |
667 | ||
668 | For instance an erasure coded pool is created to use five OSDs (``K+M = 5``) and | |
669 | sustain the loss of two of them (``M = 2``). | |
670 | ||
671 | Reading and Writing Encoded Chunks | |
672 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
673 | ||
674 | When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure | |
675 | encoding function splits the content into three data chunks simply by dividing | |
676 | the content in three: the first contains ``ABC``, the second ``DEF`` and the | |
677 | last ``GHI``. The content will be padded if the content length is not a multiple | |
678 | of ``K``. The function also creates two coding chunks: the fourth with ``YXY`` | |
679 | and the fifth with ``GQC``. Each chunk is stored in an OSD in the acting set. | |
680 | The chunks are stored in objects that have the same name (**NYAN**) but reside | |
681 | on different OSDs. The order in which the chunks were created must be preserved | |
682 | and is stored as an attribute of the object (``shard_t``), in addition to its | |
683 | name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains | |
684 | ``YXY`` and is stored on **OSD3**. | |
685 | ||
686 | ||
687 | .. ditaa:: | |
688 | +-------------------+ | |
689 | name | NYAN | | |
690 | +-------------------+ | |
691 | content | ABCDEFGHI | | |
692 | +--------+----------+ | |
693 | | | |
694 | | | |
695 | v | |
696 | +------+------+ | |
697 | +---------------+ encode(3,2) +-----------+ | |
698 | | +--+--+---+---+ | | |
699 | | | | | | | |
700 | | +-------+ | +-----+ | | |
701 | | | | | | | |
702 | +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ | |
703 | name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | | |
704 | +------+ +------+ +------+ +------+ +------+ | |
705 | shard | 1 | | 2 | | 3 | | 4 | | 5 | | |
706 | +------+ +------+ +------+ +------+ +------+ | |
707 | content | ABC | | DEF | | GHI | | YXY | | QGC | | |
708 | +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ | |
709 | | | | | | | |
710 | | | v | | | |
711 | | | +--+---+ | | | |
712 | | | | OSD1 | | | | |
713 | | | +------+ | | | |
714 | | | | | | |
715 | | | +------+ | | | |
716 | | +------>| OSD2 | | | | |
717 | | +------+ | | | |
718 | | | | | |
719 | | +------+ | | | |
720 | | | OSD3 |<----+ | | |
721 | | +------+ | | |
722 | | | | |
723 | | +------+ | | |
724 | | | OSD4 |<--------------+ | |
725 | | +------+ | |
726 | | | |
727 | | +------+ | |
728 | +----------------->| OSD5 | | |
729 | +------+ | |
730 | ||
731 | ||
732 | When the object **NYAN** is read from the erasure coded pool, the decoding | |
733 | function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing | |
734 | ``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content | |
735 | of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2 | |
736 | and 5 are missing (they are called 'erasures'). The chunk 5 could not be read | |
737 | because the **OSD4** is out. The decoding function can be called as soon as | |
738 | three chunks are read: **OSD2** was the slowest and its chunk was not taken into | |
739 | account. | |
740 | ||
741 | .. ditaa:: | |
742 | +-------------------+ | |
743 | name | NYAN | | |
744 | +-------------------+ | |
745 | content | ABCDEFGHI | | |
746 | +---------+---------+ | |
747 | ^ | |
748 | | | |
749 | | | |
750 | +-------+-------+ | |
751 | | decode(3,2) | | |
752 | +------------->+ erasures 2,5 +<-+ | |
753 | | | | | | |
754 | | +-------+-------+ | | |
755 | | ^ | | |
756 | | | | | |
757 | | | | | |
758 | +--+---+ +------+ +---+--+ +---+--+ | |
759 | name | NYAN | | NYAN | | NYAN | | NYAN | | |
760 | +------+ +------+ +------+ +------+ | |
761 | shard | 1 | | 2 | | 3 | | 4 | | |
762 | +------+ +------+ +------+ +------+ | |
763 | content | ABC | | DEF | | GHI | | YXY | | |
764 | +--+---+ +--+---+ +--+---+ +--+---+ | |
765 | ^ . ^ ^ | |
766 | | TOO . | | | |
767 | | SLOW . +--+---+ | | |
768 | | ^ | OSD1 | | | |
769 | | | +------+ | | |
770 | | | | | |
771 | | | +------+ | | |
772 | | +-------| OSD2 | | | |
773 | | +------+ | | |
774 | | | | |
775 | | +------+ | | |
776 | | | OSD3 |------+ | |
777 | | +------+ | |
778 | | | |
779 | | +------+ | |
780 | | | OSD4 | OUT | |
781 | | +------+ | |
782 | | | |
783 | | +------+ | |
784 | +------------------| OSD5 | | |
785 | +------+ | |
786 | ||
787 | ||
788 | Interrupted Full Writes | |
789 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
790 | ||
791 | In an erasure coded pool, the primary OSD in the up set receives all write | |
792 | operations. It is responsible for encoding the payload into ``K+M`` chunks and | |
793 | sends them to the other OSDs. It is also responsible for maintaining an | |
794 | authoritative version of the placement group logs. | |
795 | ||
796 | In the following diagram, an erasure coded placement group has been created with | |
797 | ``K = 2 + M = 1`` and is supported by three OSDs, two for ``K`` and one for | |
798 | ``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and | |
799 | **OSD 3**. An object has been encoded and stored in the OSDs : the chunk | |
800 | ``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on | |
801 | **OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The | |
802 | placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1, | |
803 | version 1). | |
804 | ||
805 | ||
806 | .. ditaa:: | |
807 | Primary OSD | |
808 | ||
809 | +-------------+ | |
810 | | OSD 1 | +-------------+ | |
811 | | log | Write Full | | | |
812 | | +----+ |<------------+ Ceph Client | | |
813 | | |D1v1| 1,1 | v1 | | | |
814 | | +----+ | +-------------+ | |
815 | +------+------+ | |
816 | | | |
817 | | | |
818 | | +-------------+ | |
819 | | | OSD 2 | | |
820 | | | log | | |
821 | +--------->+ +----+ | | |
822 | | | |D2v1| 1,1 | | |
823 | | | +----+ | | |
824 | | +-------------+ | |
825 | | | |
826 | | +-------------+ | |
827 | | | OSD 3 | | |
828 | | | log | | |
829 | +--------->| +----+ | | |
830 | | |C1v1| 1,1 | | |
831 | | +----+ | | |
832 | +-------------+ | |
833 | ||
834 | **OSD 1** is the primary and receives a **WRITE FULL** from a client, which | |
835 | means the payload is to replace the object entirely instead of overwriting a | |
836 | portion of it. Version 2 (v2) of the object is created to override version 1 | |
837 | (v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data | |
838 | chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and | |
839 | ``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent | |
840 | to the target OSD, including the primary OSD which is responsible for storing | |
841 | chunks in addition to handling write operations and maintaining an authoritative | |
842 | version of the placement group logs. When an OSD receives the message | |
843 | instructing it to write the chunk, it also creates a new entry in the placement | |
844 | group logs to reflect the change. For instance, as soon as **OSD 3** stores | |
845 | ``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs. | |
846 | Because the OSDs work asynchronously, some chunks may still be in flight ( such | |
847 | as ``D2v2`` ) while others are acknowledged and on disk ( such as ``C1v1`` and | |
848 | ``D1v1``). | |
849 | ||
850 | .. ditaa:: | |
851 | ||
852 | Primary OSD | |
853 | ||
854 | +-------------+ | |
855 | | OSD 1 | | |
856 | | log | | |
857 | | +----+ | +-------------+ | |
858 | | |D1v2| 1,2 | Write Full | | | |
859 | | +----+ +<------------+ Ceph Client | | |
860 | | | v2 | | | |
861 | | +----+ | +-------------+ | |
862 | | |D1v1| 1,1 | | |
863 | | +----+ | | |
864 | +------+------+ | |
865 | | | |
866 | | | |
867 | | +------+------+ | |
868 | | | OSD 2 | | |
869 | | +------+ | log | | |
870 | +->| D2v2 | | +----+ | | |
871 | | +------+ | |D2v1| 1,1 | | |
872 | | | +----+ | | |
873 | | +-------------+ | |
874 | | | |
875 | | +-------------+ | |
876 | | | OSD 3 | | |
877 | | | log | | |
878 | | | +----+ | | |
879 | | | |C1v2| 1,2 | | |
880 | +---------->+ +----+ | | |
881 | | | | |
882 | | +----+ | | |
883 | | |C1v1| 1,1 | | |
884 | | +----+ | | |
885 | +-------------+ | |
886 | ||
887 | ||
888 | If all goes well, the chunks are acknowledged on each OSD in the acting set and | |
889 | the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``. | |
890 | ||
891 | .. ditaa:: | |
892 | ||
893 | Primary OSD | |
894 | ||
895 | +-------------+ | |
896 | | OSD 1 | | |
897 | | log | | |
898 | | +----+ | +-------------+ | |
899 | | |D1v2| 1,2 | Write Full | | | |
900 | | +----+ +<------------+ Ceph Client | | |
901 | | | v2 | | | |
902 | | +----+ | +-------------+ | |
903 | | |D1v1| 1,1 | | |
904 | | +----+ | | |
905 | +------+------+ | |
906 | | | |
907 | | +-------------+ | |
908 | | | OSD 2 | | |
909 | | | log | | |
910 | | | +----+ | | |
911 | | | |D2v2| 1,2 | | |
912 | +---------->+ +----+ | | |
913 | | | | | |
914 | | | +----+ | | |
915 | | | |D2v1| 1,1 | | |
916 | | | +----+ | | |
917 | | +-------------+ | |
918 | | | |
919 | | +-------------+ | |
920 | | | OSD 3 | | |
921 | | | log | | |
922 | | | +----+ | | |
923 | | | |C1v2| 1,2 | | |
924 | +---------->+ +----+ | | |
925 | | | | |
926 | | +----+ | | |
927 | | |C1v1| 1,1 | | |
928 | | +----+ | | |
929 | +-------------+ | |
930 | ||
931 | ||
932 | Finally, the files used to store the chunks of the previous version of the | |
933 | object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1`` | |
934 | on **OSD 3**. | |
935 | ||
936 | .. ditaa:: | |
937 | Primary OSD | |
938 | ||
939 | +-------------+ | |
940 | | OSD 1 | | |
941 | | log | | |
942 | | +----+ | | |
943 | | |D1v2| 1,2 | | |
944 | | +----+ | | |
945 | +------+------+ | |
946 | | | |
947 | | | |
948 | | +-------------+ | |
949 | | | OSD 2 | | |
950 | | | log | | |
951 | +--------->+ +----+ | | |
952 | | | |D2v2| 1,2 | | |
953 | | | +----+ | | |
954 | | +-------------+ | |
955 | | | |
956 | | +-------------+ | |
957 | | | OSD 3 | | |
958 | | | log | | |
959 | +--------->| +----+ | | |
960 | | |C1v2| 1,2 | | |
961 | | +----+ | | |
962 | +-------------+ | |
963 | ||
964 | ||
965 | But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight, | |
966 | the object's version 2 is partially written: **OSD 3** has one chunk but that is | |
967 | not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the | |
968 | erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are | |
969 | available to rebuild the third. **OSD 4** becomes the new primary and finds that | |
970 | the ``last_complete`` log entry (i.e., all objects before this entry were known | |
971 | to be available on all OSDs in the previous acting set ) is ``1,1`` and that | |
972 | will be the head of the new authoritative log. | |
973 | ||
974 | .. ditaa:: | |
975 | +-------------+ | |
976 | | OSD 1 | | |
977 | | (down) | | |
978 | | c333 | | |
979 | +------+------+ | |
980 | | | |
981 | | +-------------+ | |
982 | | | OSD 2 | | |
983 | | | log | | |
984 | | | +----+ | | |
985 | +---------->+ |D2v1| 1,1 | | |
986 | | | +----+ | | |
987 | | | | | |
988 | | +-------------+ | |
989 | | | |
990 | | +-------------+ | |
991 | | | OSD 3 | | |
992 | | | log | | |
993 | | | +----+ | | |
994 | | | |C1v2| 1,2 | | |
995 | +---------->+ +----+ | | |
996 | | | | |
997 | | +----+ | | |
998 | | |C1v1| 1,1 | | |
999 | | +----+ | | |
1000 | +-------------+ | |
1001 | Primary OSD | |
1002 | +-------------+ | |
1003 | | OSD 4 | | |
1004 | | log | | |
1005 | | | | |
1006 | | 1,1 | | |
1007 | | | | |
1008 | +------+------+ | |
1009 | ||
1010 | ||
1011 | ||
1012 | The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log | |
1013 | provided by **OSD 4**: it is discarded and the file containing the ``C1v2`` | |
1014 | chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of | |
1015 | the erasure coding library during scrubbing and stored on the new primary | |
1016 | **OSD 4**. | |
1017 | ||
1018 | ||
1019 | .. ditaa:: | |
1020 | Primary OSD | |
1021 | ||
1022 | +-------------+ | |
1023 | | OSD 4 | | |
1024 | | log | | |
1025 | | +----+ | | |
1026 | | |D1v1| 1,1 | | |
1027 | | +----+ | | |
1028 | +------+------+ | |
1029 | ^ | |
1030 | | | |
1031 | | +-------------+ | |
1032 | | | OSD 2 | | |
1033 | | | log | | |
1034 | +----------+ +----+ | | |
1035 | | | |D2v1| 1,1 | | |
1036 | | | +----+ | | |
1037 | | +-------------+ | |
1038 | | | |
1039 | | +-------------+ | |
1040 | | | OSD 3 | | |
1041 | | | log | | |
1042 | +----------| +----+ | | |
1043 | | |C1v1| 1,1 | | |
1044 | | +----+ | | |
1045 | +-------------+ | |
1046 | ||
1047 | +-------------+ | |
1048 | | OSD 1 | | |
1049 | | (down) | | |
1050 | | c333 | | |
1051 | +-------------+ | |
1052 | ||
1053 | See `Erasure Code Notes`_ for additional details. | |
1054 | ||
1055 | ||
1056 | ||
1057 | Cache Tiering | |
1058 | ------------- | |
1059 | ||
1060 | A cache tier provides Ceph Clients with better I/O performance for a subset of | |
1061 | the data stored in a backing storage tier. Cache tiering involves creating a | |
1062 | pool of relatively fast/expensive storage devices (e.g., solid state drives) | |
1063 | configured to act as a cache tier, and a backing pool of either erasure-coded | |
1064 | or relatively slower/cheaper devices configured to act as an economical storage | |
1065 | tier. The Ceph objecter handles where to place the objects and the tiering | |
1066 | agent determines when to flush objects from the cache to the backing storage | |
1067 | tier. So the cache tier and the backing storage tier are completely transparent | |
1068 | to Ceph clients. | |
1069 | ||
1070 | ||
1071 | .. ditaa:: | |
1072 | +-------------+ | |
1073 | | Ceph Client | | |
1074 | +------+------+ | |
1075 | ^ | |
1076 | Tiering is | | |
1077 | Transparent | Faster I/O | |
1078 | to Ceph | +---------------+ | |
1079 | Client Ops | | | | |
1080 | | +----->+ Cache Tier | | |
1081 | | | | | | |
1082 | | | +-----+---+-----+ | |
1083 | | | | ^ | |
1084 | v v | | Active Data in Cache Tier | |
1085 | +------+----+--+ | | | |
1086 | | Objecter | | | | |
1087 | +-----------+--+ | | | |
1088 | ^ | | Inactive Data in Storage Tier | |
1089 | | v | | |
1090 | | +-----+---+-----+ | |
1091 | | | | | |
1092 | +----->| Storage Tier | | |
1093 | | | | |
1094 | +---------------+ | |
1095 | Slower I/O | |
1096 | ||
1097 | See `Cache Tiering`_ for additional details. | |
1098 | ||
1099 | ||
1100 | .. index:: Extensibility, Ceph Classes | |
1101 | ||
1102 | Extending Ceph | |
1103 | -------------- | |
1104 | ||
1105 | You can extend Ceph by creating shared object classes called 'Ceph Classes'. | |
1106 | Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically | |
1107 | (i.e., ``$libdir/rados-classes`` by default). When you implement a class, you | |
1108 | can create new object methods that have the ability to call the native methods | |
1109 | in the Ceph Object Store, or other class methods you incorporate via libraries | |
1110 | or create yourself. | |
1111 | ||
1112 | On writes, Ceph Classes can call native or class methods, perform any series of | |
1113 | operations on the inbound data and generate a resulting write transaction that | |
1114 | Ceph will apply atomically. | |
1115 | ||
1116 | On reads, Ceph Classes can call native or class methods, perform any series of | |
1117 | operations on the outbound data and return the data to the client. | |
1118 | ||
1119 | .. topic:: Ceph Class Example | |
1120 | ||
1121 | A Ceph class for a content management system that presents pictures of a | |
1122 | particular size and aspect ratio could take an inbound bitmap image, crop it | |
1123 | to a particular aspect ratio, resize it and embed an invisible copyright or | |
1124 | watermark to help protect the intellectual property; then, save the | |
1125 | resulting bitmap image to the object store. | |
1126 | ||
1127 | See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for | |
1128 | exemplary implementations. | |
1129 | ||
1130 | ||
1131 | Summary | |
1132 | ------- | |
1133 | ||
1134 | Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage | |
1135 | appliances do not fully utilize the CPU and RAM of a typical commodity server, | |
1136 | Ceph does. From heartbeats, to peering, to rebalancing the cluster or | |
1137 | recovering from faults, Ceph offloads work from clients (and from a centralized | |
1138 | gateway which doesn't exist in the Ceph architecture) and uses the computing | |
1139 | power of the OSDs to perform the work. When referring to `Hardware | |
1140 | Recommendations`_ and the `Network Config Reference`_, be cognizant of the | |
1141 | foregoing concepts to understand how Ceph utilizes computing resources. | |
1142 | ||
1143 | .. index:: Ceph Protocol, librados | |
1144 | ||
1145 | Ceph Protocol | |
1146 | ============= | |
1147 | ||
1148 | Ceph Clients use the native protocol for interacting with the Ceph Storage | |
1149 | Cluster. Ceph packages this functionality into the ``librados`` library so that | |
1150 | you can create your own custom Ceph Clients. The following diagram depicts the | |
1151 | basic architecture. | |
1152 | ||
1153 | .. ditaa:: | |
1154 | +---------------------------------+ | |
1155 | | Ceph Storage Cluster Protocol | | |
1156 | | (librados) | | |
1157 | +---------------------------------+ | |
1158 | +---------------+ +---------------+ | |
1159 | | OSDs | | Monitors | | |
1160 | +---------------+ +---------------+ | |
1161 | ||
1162 | ||
1163 | Native Protocol and ``librados`` | |
1164 | -------------------------------- | |
1165 | ||
1166 | Modern applications need a simple object storage interface with asynchronous | |
1167 | communication capability. The Ceph Storage Cluster provides a simple object | |
1168 | storage interface with asynchronous communication capability. The interface | |
1169 | provides direct, parallel access to objects throughout the cluster. | |
1170 | ||
1171 | ||
1172 | - Pool Operations | |
1173 | - Snapshots and Copy-on-write Cloning | |
1174 | - Read/Write Objects | |
1175 | - Create or Remove | |
1176 | - Entire Object or Byte Range | |
1177 | - Append or Truncate | |
1178 | - Create/Set/Get/Remove XATTRs | |
1179 | - Create/Set/Get/Remove Key/Value Pairs | |
1180 | - Compound operations and dual-ack semantics | |
1181 | - Object Classes | |
1182 | ||
1183 | ||
1184 | .. index:: architecture; watch/notify | |
1185 | ||
1186 | Object Watch/Notify | |
1187 | ------------------- | |
1188 | ||
1189 | A client can register a persistent interest with an object and keep a session to | |
1190 | the primary OSD open. The client can send a notification message and a payload to | |
1191 | all watchers and receive notification when the watchers receive the | |
1192 | notification. This enables a client to use any object as a | |
1193 | synchronization/communication channel. | |
1194 | ||
1195 | ||
1196 | .. ditaa:: +----------+ +----------+ +----------+ +---------------+ | |
1197 | | Client 1 | | Client 2 | | Client 3 | | OSD:Object ID | | |
1198 | +----------+ +----------+ +----------+ +---------------+ | |
1199 | | | | | | |
1200 | | | | | | |
1201 | | | Watch Object | | | |
1202 | |--------------------------------------------------->| | |
1203 | | | | | | |
1204 | |<---------------------------------------------------| | |
1205 | | | Ack/Commit | | | |
1206 | | | | | | |
1207 | | | Watch Object | | | |
1208 | | |---------------------------------->| | |
1209 | | | | | | |
1210 | | |<----------------------------------| | |
1211 | | | Ack/Commit | | | |
1212 | | | | Watch Object | | |
1213 | | | |----------------->| | |
1214 | | | | | | |
1215 | | | |<-----------------| | |
1216 | | | | Ack/Commit | | |
1217 | | | Notify | | | |
1218 | |--------------------------------------------------->| | |
1219 | | | | | | |
1220 | |<---------------------------------------------------| | |
1221 | | | Notify | | | |
1222 | | | | | | |
1223 | | |<----------------------------------| | |
1224 | | | Notify | | | |
1225 | | | |<-----------------| | |
1226 | | | | Notify | | |
1227 | | | Ack | | | |
1228 | |----------------+---------------------------------->| | |
1229 | | | | | | |
1230 | | | Ack | | | |
1231 | | +---------------------------------->| | |
1232 | | | | | | |
1233 | | | | Ack | | |
1234 | | | |----------------->| | |
1235 | | | | | | |
1236 | |<---------------+----------------+------------------| | |
1237 | | Complete | |
1238 | ||
1239 | .. index:: architecture; Striping | |
1240 | ||
1241 | Data Striping | |
1242 | ------------- | |
1243 | ||
1244 | Storage devices have throughput limitations, which impact performance and | |
1245 | scalability. So storage systems often support `striping`_--storing sequential | |
1246 | pieces of information across multiple storage devices--to increase throughput | |
1247 | and performance. The most common form of data striping comes from `RAID`_. | |
1248 | The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped | |
1249 | volume'. Ceph's striping offers the throughput of RAID 0 striping, the | |
1250 | reliability of n-way RAID mirroring and faster recovery. | |
1251 | ||
1252 | Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and | |
1253 | Ceph Object Storage. A Ceph Client converts its data from the representation | |
1254 | format it provides to its users (a block device image, RESTful objects, CephFS | |
1255 | filesystem directories) into objects for storage in the Ceph Storage Cluster. | |
1256 | ||
1257 | .. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped. | |
1258 | Ceph Object Storage, Ceph Block Device, and the Ceph Filesystem stripe their | |
1259 | data over multiple Ceph Storage Cluster objects. Ceph Clients that write | |
1260 | directly to the Ceph Storage Cluster via ``librados`` must perform the | |
1261 | striping (and parallel I/O) for themselves to obtain these benefits. | |
1262 | ||
1263 | The simplest Ceph striping format involves a stripe count of 1 object. Ceph | |
1264 | Clients write stripe units to a Ceph Storage Cluster object until the object is | |
1265 | at its maximum capacity, and then create another object for additional stripes | |
1266 | of data. The simplest form of striping may be sufficient for small block device | |
1267 | images, S3 or Swift objects and CephFS files. However, this simple form doesn't | |
1268 | take maximum advantage of Ceph's ability to distribute data across placement | |
1269 | groups, and consequently doesn't improve performance very much. The following | |
1270 | diagram depicts the simplest form of striping: | |
1271 | ||
1272 | .. ditaa:: | |
1273 | +---------------+ | |
1274 | | Client Data | | |
1275 | | Format | | |
1276 | | cCCC | | |
1277 | +---------------+ | |
1278 | | | |
1279 | +--------+-------+ | |
1280 | | | | |
1281 | v v | |
1282 | /-----------\ /-----------\ | |
1283 | | Begin cCCC| | Begin cCCC| | |
1284 | | Object 0 | | Object 1 | | |
1285 | +-----------+ +-----------+ | |
1286 | | stripe | | stripe | | |
1287 | | unit 1 | | unit 5 | | |
1288 | +-----------+ +-----------+ | |
1289 | | stripe | | stripe | | |
1290 | | unit 2 | | unit 6 | | |
1291 | +-----------+ +-----------+ | |
1292 | | stripe | | stripe | | |
1293 | | unit 3 | | unit 7 | | |
1294 | +-----------+ +-----------+ | |
1295 | | stripe | | stripe | | |
1296 | | unit 4 | | unit 8 | | |
1297 | +-----------+ +-----------+ | |
1298 | | End cCCC | | End cCCC | | |
1299 | | Object 0 | | Object 1 | | |
1300 | \-----------/ \-----------/ | |
1301 | ||
1302 | ||
1303 | If you anticipate large images sizes, large S3 or Swift objects (e.g., video), | |
1304 | or large CephFS directories, you may see considerable read/write performance | |
1305 | improvements by striping client data over multiple objects within an object set. | |
1306 | Significant write performance occurs when the client writes the stripe units to | |
1307 | their corresponding objects in parallel. Since objects get mapped to different | |
1308 | placement groups and further mapped to different OSDs, each write occurs in | |
1309 | parallel at the maximum write speed. A write to a single disk would be limited | |
1310 | by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g. | |
1311 | 100MB/s). By spreading that write over multiple objects (which map to different | |
1312 | placement groups and OSDs) Ceph can reduce the number of seeks per drive and | |
1313 | combine the throughput of multiple drives to achieve much faster write (or read) | |
1314 | speeds. | |
1315 | ||
1316 | .. note:: Striping is independent of object replicas. Since CRUSH | |
1317 | replicates objects across OSDs, stripes get replicated automatically. | |
1318 | ||
1319 | In the following diagram, client data gets striped across an object set | |
1320 | (``object set 1`` in the following diagram) consisting of 4 objects, where the | |
1321 | first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe | |
1322 | unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the | |
1323 | client determines if the object set is full. If the object set is not full, the | |
1324 | client begins writing a stripe to the first object again (``object 0`` in the | |
1325 | following diagram). If the object set is full, the client creates a new object | |
1326 | set (``object set 2`` in the following diagram), and begins writing to the first | |
1327 | stripe (``stripe unit 16``) in the first object in the new object set (``object | |
1328 | 4`` in the diagram below). | |
1329 | ||
1330 | .. ditaa:: | |
1331 | +---------------+ | |
1332 | | Client Data | | |
1333 | | Format | | |
1334 | | cCCC | | |
1335 | +---------------+ | |
1336 | | | |
1337 | +-----------------+--------+--------+-----------------+ | |
1338 | | | | | +--\ | |
1339 | v v v v | | |
1340 | /-----------\ /-----------\ /-----------\ /-----------\ | | |
1341 | | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| | | |
1342 | | Object 0 | | Object 1 | | Object 2 | | Object 3 | | | |
1343 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1344 | | stripe | | stripe | | stripe | | stripe | | | |
1345 | | unit 0 | | unit 1 | | unit 2 | | unit 3 | | | |
1346 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1347 | | stripe | | stripe | | stripe | | stripe | +-\ | |
1348 | | unit 4 | | unit 5 | | unit 6 | | unit 7 | | Object | |
1349 | +-----------+ +-----------+ +-----------+ +-----------+ +- Set | |
1350 | | stripe | | stripe | | stripe | | stripe | | 1 | |
1351 | | unit 8 | | unit 9 | | unit 10 | | unit 11 | +-/ | |
1352 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1353 | | stripe | | stripe | | stripe | | stripe | | | |
1354 | | unit 12 | | unit 13 | | unit 14 | | unit 15 | | | |
1355 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1356 | | End cCCC | | End cCCC | | End cCCC | | End cCCC | | | |
1357 | | Object 0 | | Object 1 | | Object 2 | | Object 3 | | | |
1358 | \-----------/ \-----------/ \-----------/ \-----------/ | | |
1359 | | | |
1360 | +--/ | |
1361 | ||
1362 | +--\ | |
1363 | | | |
1364 | /-----------\ /-----------\ /-----------\ /-----------\ | | |
1365 | | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| | | |
1366 | | Object 4 | | Object 5 | | Object 6 | | Object 7 | | | |
1367 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1368 | | stripe | | stripe | | stripe | | stripe | | | |
1369 | | unit 16 | | unit 17 | | unit 18 | | unit 19 | | | |
1370 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1371 | | stripe | | stripe | | stripe | | stripe | +-\ | |
1372 | | unit 20 | | unit 21 | | unit 22 | | unit 23 | | Object | |
1373 | +-----------+ +-----------+ +-----------+ +-----------+ +- Set | |
1374 | | stripe | | stripe | | stripe | | stripe | | 2 | |
1375 | | unit 24 | | unit 25 | | unit 26 | | unit 27 | +-/ | |
1376 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1377 | | stripe | | stripe | | stripe | | stripe | | | |
1378 | | unit 28 | | unit 29 | | unit 30 | | unit 31 | | | |
1379 | +-----------+ +-----------+ +-----------+ +-----------+ | | |
1380 | | End cCCC | | End cCCC | | End cCCC | | End cCCC | | | |
1381 | | Object 4 | | Object 5 | | Object 6 | | Object 7 | | | |
1382 | \-----------/ \-----------/ \-----------/ \-----------/ | | |
1383 | | | |
1384 | +--/ | |
1385 | ||
1386 | Three important variables determine how Ceph stripes data: | |
1387 | ||
1388 | - **Object Size:** Objects in the Ceph Storage Cluster have a maximum | |
1389 | configurable size (e.g., 2MB, 4MB, etc.). The object size should be large | |
1390 | enough to accommodate many stripe units, and should be a multiple of | |
1391 | the stripe unit. | |
1392 | ||
1393 | - **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb). | |
1394 | The Ceph Client divides the data it will write to objects into equally | |
1395 | sized stripe units, except for the last stripe unit. A stripe width, | |
1396 | should be a fraction of the Object Size so that an object may contain | |
1397 | many stripe units. | |
1398 | ||
1399 | - **Stripe Count:** The Ceph Client writes a sequence of stripe units | |
1400 | over a series of objects determined by the stripe count. The series | |
1401 | of objects is called an object set. After the Ceph Client writes to | |
1402 | the last object in the object set, it returns to the first object in | |
1403 | the object set. | |
1404 | ||
1405 | .. important:: Test the performance of your striping configuration before | |
1406 | putting your cluster into production. You CANNOT change these striping | |
1407 | parameters after you stripe the data and write it to objects. | |
1408 | ||
1409 | Once the Ceph Client has striped data to stripe units and mapped the stripe | |
1410 | units to objects, Ceph's CRUSH algorithm maps the objects to placement groups, | |
1411 | and the placement groups to Ceph OSD Daemons before the objects are stored as | |
1412 | files on a storage disk. | |
1413 | ||
1414 | .. note:: Since a client writes to a single pool, all data striped into objects | |
1415 | get mapped to placement groups in the same pool. So they use the same CRUSH | |
1416 | map and the same access controls. | |
1417 | ||
1418 | ||
1419 | .. index:: architecture; Ceph Clients | |
1420 | ||
1421 | Ceph Clients | |
1422 | ============ | |
1423 | ||
1424 | Ceph Clients include a number of service interfaces. These include: | |
1425 | ||
1426 | - **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service | |
1427 | provides resizable, thin-provisioned block devices with snapshotting and | |
1428 | cloning. Ceph stripes a block device across the cluster for high | |
1429 | performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor | |
1430 | that uses ``librbd`` directly--avoiding the kernel object overhead for | |
1431 | virtualized systems. | |
1432 | ||
1433 | - **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service | |
1434 | provides RESTful APIs with interfaces that are compatible with Amazon S3 | |
1435 | and OpenStack Swift. | |
1436 | ||
1437 | - **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides | |
1438 | a POSIX compliant filesystem usable with ``mount`` or as | |
1439 | a filesytem in user space (FUSE). | |
1440 | ||
1441 | Ceph can run additional instances of OSDs, MDSs, and monitors for scalability | |
1442 | and high availability. The following diagram depicts the high-level | |
1443 | architecture. | |
1444 | ||
1445 | .. ditaa:: | |
1446 | +--------------+ +----------------+ +-------------+ | |
1447 | | Block Device | | Object Storage | | Ceph FS | | |
1448 | +--------------+ +----------------+ +-------------+ | |
1449 | ||
1450 | +--------------+ +----------------+ +-------------+ | |
1451 | | librbd | | librgw | | libcephfs | | |
1452 | +--------------+ +----------------+ +-------------+ | |
1453 | ||
1454 | +---------------------------------------------------+ | |
1455 | | Ceph Storage Cluster Protocol (librados) | | |
1456 | +---------------------------------------------------+ | |
1457 | ||
1458 | +---------------+ +---------------+ +---------------+ | |
1459 | | OSDs | | MDSs | | Monitors | | |
1460 | +---------------+ +---------------+ +---------------+ | |
1461 | ||
1462 | ||
1463 | .. index:: architecture; Ceph Object Storage | |
1464 | ||
1465 | Ceph Object Storage | |
1466 | ------------------- | |
1467 | ||
1468 | The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides | |
1469 | a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph | |
1470 | Storage Cluster with its own data formats, and maintains its own user database, | |
1471 | authentication, and access control. The RADOS Gateway uses a unified namespace, | |
1472 | which means you can use either the OpenStack Swift-compatible API or the Amazon | |
1473 | S3-compatible API. For example, you can write data using the S3-compatible API | |
1474 | with one application and then read data using the Swift-compatible API with | |
1475 | another application. | |
1476 | ||
1477 | .. topic:: S3/Swift Objects and Store Cluster Objects Compared | |
1478 | ||
1479 | Ceph's Object Storage uses the term *object* to describe the data it stores. | |
1480 | S3 and Swift objects are not the same as the objects that Ceph writes to the | |
1481 | Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage | |
1482 | Cluster objects. The S3 and Swift objects do not necessarily | |
1483 | correspond in a 1:1 manner with an object stored in the storage cluster. It | |
1484 | is possible for an S3 or Swift object to map to multiple Ceph objects. | |
1485 | ||
1486 | See `Ceph Object Storage`_ for details. | |
1487 | ||
1488 | ||
1489 | .. index:: Ceph Block Device; block device; RBD; Rados Block Device | |
1490 | ||
1491 | Ceph Block Device | |
1492 | ----------------- | |
1493 | ||
1494 | A Ceph Block Device stripes a block device image over multiple objects in the | |
1495 | Ceph Storage Cluster, where each object gets mapped to a placement group and | |
1496 | distributed, and the placement groups are spread across separate ``ceph-osd`` | |
1497 | daemons throughout the cluster. | |
1498 | ||
1499 | .. important:: Striping allows RBD block devices to perform better than a single | |
1500 | server could! | |
1501 | ||
1502 | Thin-provisioned snapshottable Ceph Block Devices are an attractive option for | |
1503 | virtualization and cloud computing. In virtual machine scenarios, people | |
1504 | typically deploy a Ceph Block Device with the ``rbd`` network storage driver in | |
1505 | QEMU/KVM, where the host machine uses ``librbd`` to provide a block device | |
1506 | service to the guest. Many cloud computing stacks use ``libvirt`` to integrate | |
1507 | with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and | |
1508 | ``libvirt`` to support OpenStack and CloudStack among other solutions. | |
1509 | ||
1510 | While we do not provide ``librbd`` support with other hypervisors at this time, | |
1511 | you may also use Ceph Block Device kernel objects to provide a block device to a | |
1512 | client. Other virtualization technologies such as Xen can access the Ceph Block | |
1513 | Device kernel object(s). This is done with the command-line tool ``rbd``. | |
1514 | ||
1515 | ||
1516 | .. index:: Ceph FS; Ceph Filesystem; libcephfs; MDS; metadata server; ceph-mds | |
1517 | ||
1518 | Ceph Filesystem | |
1519 | --------------- | |
1520 | ||
1521 | The Ceph Filesystem (Ceph FS) provides a POSIX-compliant filesystem as a | |
1522 | service that is layered on top of the object-based Ceph Storage Cluster. | |
1523 | Ceph FS files get mapped to objects that Ceph stores in the Ceph Storage | |
1524 | Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as | |
1525 | a Filesystem in User Space (FUSE). | |
1526 | ||
1527 | .. ditaa:: | |
1528 | +-----------------------+ +------------------------+ | |
1529 | | CephFS Kernel Object | | CephFS FUSE | | |
1530 | +-----------------------+ +------------------------+ | |
1531 | ||
1532 | +---------------------------------------------------+ | |
1533 | | Ceph FS Library (libcephfs) | | |
1534 | +---------------------------------------------------+ | |
1535 | ||
1536 | +---------------------------------------------------+ | |
1537 | | Ceph Storage Cluster Protocol (librados) | | |
1538 | +---------------------------------------------------+ | |
1539 | ||
1540 | +---------------+ +---------------+ +---------------+ | |
1541 | | OSDs | | MDSs | | Monitors | | |
1542 | +---------------+ +---------------+ +---------------+ | |
1543 | ||
1544 | ||
1545 | The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed | |
1546 | with the Ceph Storage cluster. The purpose of the MDS is to store all the | |
1547 | filesystem metadata (directories, file ownership, access modes, etc) in | |
1548 | high-availability Ceph Metadata Servers where the metadata resides in memory. | |
1549 | The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem | |
1550 | operations like listing a directory or changing a directory (``ls``, ``cd``) | |
1551 | would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from | |
1552 | the data means that the Ceph Filesystem can provide high performance services | |
1553 | without taxing the Ceph Storage Cluster. | |
1554 | ||
1555 | Ceph FS separates the metadata from the data, storing the metadata in the MDS, | |
1556 | and storing the file data in one or more objects in the Ceph Storage Cluster. | |
1557 | The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a | |
1558 | single process, or it can be distributed out to multiple physical machines, | |
1559 | either for high availability or for scalability. | |
1560 | ||
1561 | - **High Availability**: The extra ``ceph-mds`` instances can be `standby`, | |
1562 | ready to take over the duties of any failed ``ceph-mds`` that was | |
1563 | `active`. This is easy because all the data, including the journal, is | |
1564 | stored on RADOS. The transition is triggered automatically by ``ceph-mon``. | |
1565 | ||
1566 | - **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they | |
1567 | will split the directory tree into subtrees (and shards of a single | |
1568 | busy directory), effectively balancing the load amongst all `active` | |
1569 | servers. | |
1570 | ||
1571 | Combinations of `standby` and `active` etc are possible, for example | |
1572 | running 3 `active` ``ceph-mds`` instances for scaling, and one `standby` | |
1573 | instance for high availability. | |
1574 | ||
1575 | ||
1576 | ||
1577 | ||
1578 | .. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf | |
1579 | .. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) | |
1580 | .. _Monitor Config Reference: ../rados/configuration/mon-config-ref | |
1581 | .. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg | |
1582 | .. _Heartbeats: ../rados/configuration/mon-osd-interaction | |
1583 | .. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds | |
31f18b77 | 1584 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf |
7c673cae FG |
1585 | .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing |
1586 | .. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure | |
1587 | .. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure | |
1588 | .. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/ | |
1589 | .. _Hardware Recommendations: ../start/hardware-recommendations | |
1590 | .. _Network Config Reference: ../rados/configuration/network-config-ref | |
1591 | .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing | |
1592 | .. _striping: http://en.wikipedia.org/wiki/Data_striping | |
1593 | .. _RAID: http://en.wikipedia.org/wiki/RAID | |
1594 | .. _RAID 0: http://en.wikipedia.org/wiki/RAID_0#RAID_0 | |
1595 | .. _Ceph Object Storage: ../radosgw/ | |
1596 | .. _RESTful: http://en.wikipedia.org/wiki/RESTful | |
1597 | .. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst | |
1598 | .. _Cache Tiering: ../rados/operations/cache-tiering | |
1599 | .. _Set Pool Values: ../rados/operations/pools#set-pool-values | |
1600 | .. _Kerberos: http://en.wikipedia.org/wiki/Kerberos_(protocol) | |
1601 | .. _Cephx Config Guide: ../rados/configuration/auth-config-ref | |
1602 | .. _User Management: ../rados/operations/user-management |