]> git.proxmox.com Git - ceph.git/blame - ceph/doc/architecture.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / architecture.rst
CommitLineData
7c673cae
FG
1==============
2 Architecture
3==============
4
5:term:`Ceph` uniquely delivers **object, block, and file storage** in one
6unified system. Ceph is highly reliable, easy to manage, and free. The power of
7Ceph can transform your company's IT infrastructure and your ability to manage
8vast amounts of data. Ceph delivers extraordinary scalability–thousands of
9clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
10commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
11accommodates large numbers of nodes, which communicate with each other to
12replicate and redistribute data dynamically.
13
14.. image:: images/stack.png
15
39ae355f 16.. _arch-ceph-storage-cluster:
7c673cae
FG
17
18The Ceph Storage Cluster
19========================
20
21Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
22:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
23about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
24Storage Clusters`_.
25
f67539c2 26A Ceph Storage Cluster consists of multiple types of daemons:
7c673cae
FG
27
28- :term:`Ceph Monitor`
29- :term:`Ceph OSD Daemon`
f67539c2
TL
30- :term:`Ceph Manager`
31- :term:`Ceph Metadata Server`
7c673cae 32
aee94f69 33.. _arch_monitor:
7c673cae 34
aee94f69
TL
35Ceph Monitors maintain the master copy of the cluster map, which they provide
36to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
37availability in the event that one of the monitor daemons or its host fails.
38The Ceph monitor provides copies of the cluster map to storage cluster clients.
7c673cae
FG
39
40A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
41back to monitors.
42
aee94f69 43A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
f67539c2
TL
44modules.
45
46A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
47provide file services.
48
aee94f69
TL
49Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
50to compute information about data location. This means that clients and OSDs
51are not bottlenecked by a central lookup table. Ceph's high-level features
52include a native interface to the Ceph Storage Cluster via ``librados``, and a
53number of service interfaces built on top of ``librados``.
7c673cae
FG
54
55Storing Data
56------------
57
39ae355f 58The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
7c673cae 59comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
aee94f69
TL
60:term:`Ceph File System`, or a custom implementation that you create by using
61``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
62objects. Each object is stored on an :term:`Object Storage Device` (this is
63also called an "OSD"). Ceph OSDs control read, write, and replication
64operations on storage drives. The default BlueStore back end stores objects
65in a monolithic, database-like fashion.
7c673cae 66
f91f0fd5
TL
67.. ditaa::
68
aee94f69
TL
69 /------\ +-----+ +-----+
70 | obj |------>| {d} |------>| {s} |
71 \------/ +-----+ +-----+
7c673cae 72
f67539c2 73 Object OSD Drive
7c673cae 74
aee94f69
TL
75Ceph OSD Daemons store data as objects in a flat namespace. This means that
76objects are not stored in a hierarchy of directories. An object has an
77identifier, binary data, and metadata consisting of name/value pairs.
78:term:`Ceph Client`\s determine the semantics of the object data. For example,
79CephFS uses metadata to store file attributes such as the file owner, the
80created date, and the last modified date.
7c673cae
FG
81
82
f91f0fd5
TL
83.. ditaa::
84
85 /------+------------------------------+----------------\
7c673cae
FG
86 | ID | Binary Data | Metadata |
87 +------+------------------------------+----------------+
88 | 1234 | 0101010101010100110101010010 | name1 = value1 |
89 | | 0101100001010100110101010010 | name2 = value2 |
90 | | 0101100001010100110101010010 | nameN = valueN |
91 \------+------------------------------+----------------/
92
93.. note:: An object ID is unique across the entire cluster, not just the local
94 filesystem.
95
96
97.. index:: architecture; high availability, scalability
98
aee94f69
TL
99.. _arch_scalability_and_high_availability:
100
7c673cae
FG
101Scalability and High Availability
102---------------------------------
103
aee94f69
TL
104In traditional architectures, clients talk to a centralized component. This
105centralized component might be a gateway, a broker, an API, or a facade. A
106centralized component of this kind acts as a single point of entry to a complex
107subsystem. Architectures that rely upon such a centralized component have a
108single point of failure and incur limits to performance and scalability. If
109the centralized component goes down, the whole system becomes unavailable.
7c673cae 110
aee94f69
TL
111Ceph eliminates this centralized component. This enables clients to interact
112with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
113to ensure data safety and high availability. Ceph also uses a cluster of
114monitors to ensure high availability. To eliminate centralization, Ceph uses an
115algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
7c673cae
FG
116
117
118.. index:: CRUSH; architecture
119
120CRUSH Introduction
121~~~~~~~~~~~~~~~~~~
122
123Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
aee94f69
TL
124Replication Under Scalable Hashing)` algorithm to compute information about
125object location instead of relying upon a central lookup table. CRUSH provides
126a better data management mechanism than do older approaches, and CRUSH enables
127massive scale by distributing the work to all the OSD daemons in the cluster
128and all the clients that communicate with them. CRUSH uses intelligent data
129replication to ensure resiliency, which is better suited to hyper-scale
130storage. The following sections provide additional details on how CRUSH works.
131For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
132Decentralized Placement of Replicated Data`_.
7c673cae
FG
133
134.. index:: architecture; cluster map
135
39ae355f
TL
136.. _architecture_cluster_map:
137
7c673cae
FG
138Cluster Map
139~~~~~~~~~~~
140
aee94f69
TL
141In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
142must have current information about the cluster's topology. Current information
143is stored in the "Cluster Map", which is in fact a collection of five maps. The
144five maps that constitute the cluster map are:
7c673cae 145
aee94f69
TL
146#. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
147 the address, and the TCP port of each monitor. The monitor map specifies the
148 current epoch, the time of the monitor map's creation, and the time of the
149 monitor map's last modification. To view a monitor map, run ``ceph mon
150 dump``.
7c673cae 151
aee94f69
TL
152#. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
153 creation, the time of the OSD map's last modification, a list of pools, a
154 list of replica sizes, a list of PG numbers, and a list of OSDs and their
155 statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
156 osd dump``.
7c673cae 157
aee94f69
TL
158#. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
159 epoch, the full ratios, and the details of each placement group. This
160 includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
161 example, ``active + clean``), and data usage statistics for each pool.
7c673cae
FG
162
163#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
aee94f69
TL
164 hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
165 and rules for traversing the hierarchy when storing data. To view a CRUSH
166 map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
167 running ``crushtool -d {comp-crushmap-filename} -o
168 {decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
169 decompiled map.
7c673cae
FG
170
171#. **The MDS Map:** Contains the current MDS map epoch, when the map was
172 created, and the last time it changed. It also contains the pool for
173 storing metadata, a list of metadata servers, and which metadata servers
174 are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
175
aee94f69
TL
176Each map maintains a history of changes to its operating state. Ceph Monitors
177maintain a master copy of the cluster map. This master copy includes the
178cluster members, the state of the cluster, changes to the cluster, and
179information recording the overall health of the Ceph Storage Cluster.
7c673cae
FG
180
181.. index:: high availability; monitor architecture
182
183High Availability Monitors
184~~~~~~~~~~~~~~~~~~~~~~~~~~
185
aee94f69
TL
186A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
187cluster map in order to read data from or to write data to the Ceph cluster.
188
189It is possible for a Ceph cluster to function properly with only a single
190monitor, but a Ceph cluster that has only a single monitor has a single point
191of failure: if the monitor goes down, Ceph clients will be unable to read data
192from or write data to the cluster.
7c673cae 193
aee94f69
TL
194Ceph leverages a cluster of monitors in order to increase reliability and fault
195tolerance. When a cluster of monitors is used, however, one or more of the
196monitors in the cluster can fall behind due to latency or other faults. Ceph
197mitigates these negative effects by requiring multiple monitor instances to
198agree about the state of the cluster. To establish consensus among the monitors
199regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
200majority of monitors (for example, one in a cluster that contains only one
201monitor, two in a cluster that contains three monitors, three in a cluster that
202contains five monitors, four in a cluster that contains six monitors, and so
203on).
7c673cae 204
aee94f69 205See the `Monitor Config Reference`_ for more detail on configuring monitors.
7c673cae
FG
206
207.. index:: architecture; high availability authentication
208
1e59de90
TL
209.. _arch_high_availability_authentication:
210
7c673cae
FG
211High Availability Authentication
212~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
213
aee94f69
TL
214The ``cephx`` authentication system is used by Ceph to authenticate users and
215daemons and to protect against man-in-the-middle attacks.
7c673cae
FG
216
217.. note:: The ``cephx`` protocol does not address data encryption in transport
aee94f69
TL
218 (for example, SSL/TLS) or encryption at rest.
219
220``cephx`` uses shared secret keys for authentication. This means that both the
221client and the monitor cluster keep a copy of the client's secret key.
222
223The ``cephx`` protocol makes it possible for each party to prove to the other
224that it has a copy of the key without revealing it. This provides mutual
225authentication and allows the cluster to confirm (1) that the user has the
226secret key and (2) that the user can be confident that the cluster has a copy
227of the secret key.
228
229As stated in :ref:`Scalability and High Availability
230<arch_scalability_and_high_availability>`, Ceph does not have any centralized
231interface between clients and the Ceph object store. By avoiding such a
232centralized interface, Ceph avoids the bottlenecks that attend such centralized
233interfaces. However, this means that clients must interact directly with OSDs.
234Direct interactions between Ceph clients and OSDs require authenticated
235connections. The ``cephx`` authentication system establishes and sustains these
236authenticated connections.
237
238The ``cephx`` protocol operates in a manner similar to `Kerberos`_.
239
240A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
241monitor can authenticate users and distribute keys, which means that there is
242no single point of failure and no bottleneck when using ``cephx``. The monitor
243returns an authentication data structure that is similar to a Kerberos ticket.
244This authentication data structure contains a session key for use in obtaining
245Ceph services. The session key is itself encrypted with the user's permanent
246secret key, which means that only the user can request services from the Ceph
247Monitors. The client then uses the session key to request services from the
248monitors, and the monitors provide the client with a ticket that authenticates
249the client against the OSDs that actually handle data. Ceph Monitors and OSDs
250share a secret, which means that the clients can use the ticket provided by the
251monitors to authenticate against any OSD or metadata server in the cluster.
252
253Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
254expired ticket or session key that has been obtained surreptitiously. This form
255of authentication prevents attackers who have access to the communications
256medium from creating bogus messages under another user's identity and prevents
257attackers from altering another user's legitimate messages, as long as the
258user's secret key is not divulged before it expires.
259
260An administrator must set up users before using ``cephx``. In the following
261diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
7c673cae 262the command line to generate a username and secret key. Ceph's ``auth``
aee94f69
TL
263subsystem generates the username and key, stores a copy on the monitor(s), and
264transmits the user's secret back to the ``client.admin`` user. This means that
7c673cae
FG
265the client and the monitor share a secret key.
266
267.. note:: The ``client.admin`` user must provide the user ID and
268 secret key to the user in a secure manner.
269
f91f0fd5
TL
270.. ditaa::
271
272 +---------+ +---------+
7c673cae
FG
273 | Client | | Monitor |
274 +---------+ +---------+
275 | request to |
276 | create a user |
277 |-------------->|----------+ create user
278 | | | and
279 |<--------------|<---------+ store key
280 | transmit key |
281 | |
282
aee94f69
TL
283Here is how a client authenticates with a monitor. The client passes the user
284name to the monitor. The monitor generates a session key that is encrypted with
285the secret key associated with the ``username``. The monitor transmits the
286encrypted ticket to the client. The client uses the shared secret key to
287decrypt the payload. The session key identifies the user, and this act of
288identification will last for the duration of the session. The client requests
289a ticket for the user, and the ticket is signed with the session key. The
290monitor generates a ticket and uses the user's secret key to encrypt it. The
291encrypted ticket is transmitted to the client. The client decrypts the ticket
292and uses it to sign requests to OSDs and to metadata servers in the cluster.
7c673cae 293
f91f0fd5
TL
294.. ditaa::
295
296 +---------+ +---------+
7c673cae
FG
297 | Client | | Monitor |
298 +---------+ +---------+
299 | authenticate |
300 |-------------->|----------+ generate and
301 | | | encrypt
302 |<--------------|<---------+ session key
303 | transmit |
304 | encrypted |
305 | session key |
306 | |
307 |-----+ decrypt |
308 | | session |
309 |<----+ key |
310 | |
311 | req. ticket |
312 |-------------->|----------+ generate and
313 | | | encrypt
314 |<--------------|<---------+ ticket
315 | recv. ticket |
316 | |
317 |-----+ decrypt |
318 | | ticket |
319 |<----+ |
320
321
aee94f69
TL
322The ``cephx`` protocol authenticates ongoing communications between the clients
323and Ceph daemons. After initial authentication, each message sent between a
324client and a daemon is signed using a ticket that can be verified by monitors,
325OSDs, and metadata daemons. This ticket is verified by using the secret shared
326between the client and the daemon.
7c673cae 327
f91f0fd5
TL
328.. ditaa::
329
330 +---------+ +---------+ +-------+ +-------+
7c673cae
FG
331 | Client | | Monitor | | MDS | | OSD |
332 +---------+ +---------+ +-------+ +-------+
333 | request to | | |
334 | create a user | | |
335 |-------------->| mon and | |
336 |<--------------| client share | |
337 | receive | a secret. | |
338 | shared secret | | |
339 | |<------------>| |
340 | |<-------------+------------>|
341 | | mon, mds, | |
342 | authenticate | and osd | |
343 |-------------->| share | |
344 |<--------------| a secret | |
345 | session key | | |
346 | | | |
347 | req. ticket | | |
348 |-------------->| | |
349 |<--------------| | |
350 | recv. ticket | | |
351 | | | |
352 | make request (CephFS only) | |
353 |----------------------------->| |
354 |<-----------------------------| |
355 | receive response (CephFS only) |
356 | |
357 | make request |
358 |------------------------------------------->|
359 |<-------------------------------------------|
360 receive response
361
aee94f69
TL
362This authentication protects only the connections between Ceph clients and Ceph
363daemons. The authentication is not extended beyond the Ceph client. If a user
364accesses the Ceph client from a remote host, cephx authentication will not be
7c673cae
FG
365applied to the connection between the user's host and the client host.
366
aee94f69 367See `Cephx Config Guide`_ for more on configuration details.
7c673cae 368
aee94f69 369See `User Management`_ for more on user management.
7c673cae 370
aee94f69
TL
371See :ref:`A Detailed Description of the Cephx Authentication Protocol
372<cephx_2012_peter>` for more on the distinction between authorization and
373authentication and for a step-by-step explanation of the setup of ``cephx``
374tickets and session keys.
7c673cae
FG
375
376.. index:: architecture; smart daemons and scalability
377
378Smart Daemons Enable Hyperscale
379~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
aee94f69
TL
380A feature of many storage clusters is a centralized interface that keeps track
381of the nodes that clients are permitted to access. Such centralized
382architectures provide services to clients by means of a double dispatch. At the
383petabyte-to-exabyte scale, such double dispatches are a significant
384bottleneck.
385
386Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
387cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
388OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
389with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being
390cluster-aware makes it possible for Ceph clients to interact directly with Ceph
391OSD Daemons.
392
393Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
394another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
395resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
396easily perform tasks that a cluster with a centralized interface would struggle
397to perform. The ability of Ceph nodes to make use of the computing power of
398the greater cluster provides several benefits:
399
400#. **OSDs Service Clients Directly:** Network devices can support only a
401 limited number of concurrent connections. Because Ceph clients contact
402 Ceph OSD daemons directly without first connecting to a central interface,
403 Ceph enjoys improved perfomance and increased system capacity relative to
404 storage redundancy strategies that include a central interface. Ceph clients
405 maintain sessions only when needed, and maintain those sessions with only
406 particular Ceph OSD daemons, not with a centralized interface.
407
408#. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
409 report their status. At the lowest level, the Ceph OSD Daemon status is
410 ``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
411 able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
412 ``in`` the Ceph Storage Cluster, this status may indicate the failure of the
413 Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
414 the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
415 OSDs periodically send messages to the Ceph Monitor (in releases prior to
416 Luminous, this was done by means of ``MPGStats``, and beginning with the
417 Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
418 Monitors receive no such message after a configurable period of time,
419 then they mark the OSD ``down``. This mechanism is a failsafe, however.
420 Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
421 report it to the Ceph Monitors. This contributes to making Ceph Monitors
422 lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
423 additional details.
424
425#. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
426 RADOS objects. Ceph OSD Daemons compare the metadata of their own local
427 objects against the metadata of the replicas of those objects, which are
428 stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
429 mismatches in object size and finds metadata mismatches, and is usually
430 performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
431 data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
432 bad sectors on drives that are not detectable with light scrubs. See `Data
433 Scrubbing`_ for details on configuring scrubbing.
434
435#. **Replication:** Data replication involves a collaboration between Ceph
436 Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
437 determine the storage location of object replicas. Ceph clients use the
438 CRUSH algorithm to determine the storage location of an object, then the
439 object is mapped to a pool and to a placement group, and then the client
440 consults the CRUSH map to identify the placement group's primary OSD.
441
442 After identifying the target placement group, the client writes the object
443 to the identified placement group's primary OSD. The primary OSD then
444 consults its own copy of the CRUSH map to identify secondary and tertiary
445 OSDS, replicates the object to the placement groups in those secondary and
446 tertiary OSDs, confirms that the object was stored successfully in the
447 secondary and tertiary OSDs, and reports to the client that the object
448 was stored successfully.
7c673cae 449
f91f0fd5
TL
450.. ditaa::
451
7c673cae
FG
452 +----------+
453 | Client |
454 | |
455 +----------+
456 * ^
457 Write (1) | | Ack (6)
458 | |
459 v *
460 +-------------+
461 | Primary OSD |
462 | |
463 +-------------+
464 * ^ ^ *
465 Write (2) | | | | Write (3)
466 +------+ | | +------+
467 | +------+ +------+ |
468 | | Ack (4) Ack (5)| |
469 v * * v
470 +---------------+ +---------------+
471 | Secondary OSD | | Tertiary OSD |
472 | | | |
473 +---------------+ +---------------+
474
aee94f69
TL
475By performing this act of data replication, Ceph OSD Daemons relieve Ceph
476clients of the burden of replicating data.
7c673cae
FG
477
478Dynamic Cluster Management
479--------------------------
480
481In the `Scalability and High Availability`_ section, we explained how Ceph uses
aee94f69 482CRUSH, cluster topology, and intelligent daemons to scale and maintain high
7c673cae
FG
483availability. Key to Ceph's design is the autonomous, self-healing, and
484intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
aee94f69
TL
485enable modern cloud storage infrastructures to place data, rebalance the
486cluster, and adaptively place and balance data and recover from faults.
7c673cae
FG
487
488.. index:: architecture; pools
489
490About Pools
491~~~~~~~~~~~
492
493The Ceph storage system supports the notion of 'Pools', which are logical
494partitions for storing objects.
aee94f69
TL
495
496Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
497objects to pools. The way that Ceph places the data in the pools is determined
498by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
499placement groups in the pool.
7c673cae 500
f91f0fd5
TL
501.. ditaa::
502
7c673cae
FG
503 +--------+ Retrieves +---------------+
504 | Client |------------>| Cluster Map |
505 +--------+ +---------------+
506 |
507 v Writes
508 /-----\
509 | obj |
510 \-----/
511 | To
512 v
513 +--------+ +---------------+
b32b8144 514 | Pool |---------->| CRUSH Rule |
7c673cae
FG
515 +--------+ Selects +---------------+
516
517
518Pools set at least the following parameters:
519
520- Ownership/Access to Objects
521- The Number of Placement Groups, and
b32b8144 522- The CRUSH Rule to Use.
7c673cae
FG
523
524See `Set Pool Values`_ for details.
525
526
527.. index: architecture; placement group mapping
528
529Mapping PGs to OSDs
530~~~~~~~~~~~~~~~~~~~
531
aee94f69
TL
532Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
533maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
534object to a PG.
535
536This mapping of RADOS objects to PGs implements an abstraction and indirection
537layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
538be able to grow (or shrink) and redistribute data adaptively when the internal
539topology changes.
540
541If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
542tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
543But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
544RADOS object to a placement group and then maps each placement group to one or
545more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
546dynamically when new Ceph OSD Daemons and their underlying OSD devices come
547online. The following diagram shows how the CRUSH algorithm maps objects to
548placement groups, and how it maps placement groups to OSDs.
7c673cae 549
f91f0fd5
TL
550.. ditaa::
551
7c673cae
FG
552 /-----\ /-----\ /-----\ /-----\ /-----\
553 | obj | | obj | | obj | | obj | | obj |
554 \-----/ \-----/ \-----/ \-----/ \-----/
555 | | | | |
556 +--------+--------+ +---+----+
557 | |
558 v v
559 +-----------------------+ +-----------------------+
560 | Placement Group #1 | | Placement Group #2 |
561 | | | |
562 +-----------------------+ +-----------------------+
563 | |
564 | +-----------------------+---+
565 +------+------+-------------+ |
566 | | | |
567 v v v v
568 /----------\ /----------\ /----------\ /----------\
569 | | | | | | | |
570 | OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 |
571 | | | | | | | |
572 \----------/ \----------/ \----------/ \----------/
573
aee94f69
TL
574The client uses its copy of the cluster map and the CRUSH algorithm to compute
575precisely which OSD it will use when reading or writing a particular object.
7c673cae
FG
576
577.. index:: architecture; calculating PG IDs
578
579Calculating PG IDs
580~~~~~~~~~~~~~~~~~~
581
aee94f69
TL
582When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
583the `Cluster Map`_. When a client has been equipped with a copy of the cluster
584map, it is aware of all the monitors, OSDs, and metadata servers in the
585cluster. **However, even equipped with a copy of the latest version of the
586cluster map, the client doesn't know anything about object locations.**
587
588**Object locations must be computed.**
589
590The client requies only the object ID and the name of the pool in order to
591compute the object location.
592
593Ceph stores data in named pools (for example, "liverpool"). When a client
594stores a named object (for example, "john", "paul", "george", or "ringo") it
595calculates a placement group by using the object name, a hash code, the number
596of PGs in the pool, and the pool name. Ceph clients use the following steps to
597compute PG IDs.
598
599#. The client inputs the pool name and the object ID. (for example: pool =
600 "liverpool" and object-id = "john")
601#. Ceph hashes the object ID.
602#. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
603 get a PG ID.
604#. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
605 ``4``)
606#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).
607
608It is much faster to compute object locations than to perform object location
609query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
610Scalable Hashing)` algorithm allows a client to compute where objects are
611expected to be stored, and enables the client to contact the primary OSD to
612store or retrieve the objects.
7c673cae
FG
613
614.. index:: architecture; PG Peering
615
616Peering and Sets
617~~~~~~~~~~~~~~~~
618
39ae355f 619In previous sections, we noted that Ceph OSD Daemons check each other's
aee94f69
TL
620heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
621which is the process of bringing all of the OSDs that store a Placement Group
622(PG) into agreement about the state of all of the RADOS objects (and their
623metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
624Monitors. Peering issues usually resolve themselves; however, if the problem
625persists, you may need to refer to the `Troubleshooting Peering Failure`_
626section.
7c673cae 627
aee94f69
TL
628.. Note:: PGs that agree on the state of the cluster do not necessarily have
629 the current data yet.
7c673cae
FG
630
631The Ceph Storage Cluster was designed to store at least two copies of an object
aee94f69
TL
632(that is, ``size = 2``), which is the minimum requirement for data safety. For
633high availability, a Ceph Storage Cluster should store more than two copies of
634an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
635to run in a ``degraded`` state while maintaining data safety.
636
637.. warning:: Although we say here that R2 (replication with two copies) is the
638 minimum requirement for data safety, R3 (replication with three copies) is
639 recommended. On a long enough timeline, data stored with an R2 strategy will
640 be lost.
641
642As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
643name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
644etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
645convention, the *Primary* is the first OSD in the *Acting Set*, and is
646responsible for orchestrating the peering process for each placement group
647where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
648placement group that accepts client-initiated writes to objects.
649
650The set of OSDs that is responsible for a placement group is called the
651*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
652that are currently responsible for the placement group, or to the Ceph OSD
653Daemons that were responsible for a particular placement group as of some
7c673cae
FG
654epoch.
655
aee94f69
TL
656The Ceph OSD daemons that are part of an *Acting Set* might not always be
657``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
658The *Up Set* is an important distinction, because Ceph can remap PGs to other
659Ceph OSD Daemons when an OSD fails.
7c673cae 660
aee94f69
TL
661.. note:: Consider a hypothetical *Acting Set* for a PG that contains
662 ``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
663 *Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
664 *Primary*, and ``osd.25`` is removed from the *Up Set*.
7c673cae
FG
665
666.. index:: architecture; Rebalancing
667
668Rebalancing
669~~~~~~~~~~~
670
671When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
672updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
673the cluster map. Consequently, it changes object placement, because it changes
674an input for the calculations. The following diagram depicts the rebalancing
675process (albeit rather crudely, since it is substantially less impactful with
676large clusters) where some, but not all of the PGs migrate from existing OSDs
677(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
678stable. Many of the placement groups remain in their original configuration,
679and each OSD gets some added capacity, so there are no load spikes on the
680new OSD after rebalancing is complete.
681
682
f91f0fd5
TL
683.. ditaa::
684
7c673cae
FG
685 +--------+ +--------+
686 Before | OSD 1 | | OSD 2 |
687 +--------+ +--------+
688 | PG #1 | | PG #6 |
689 | PG #2 | | PG #7 |
690 | PG #3 | | PG #8 |
691 | PG #4 | | PG #9 |
692 | PG #5 | | PG #10 |
693 +--------+ +--------+
694
695 +--------+ +--------+ +--------+
696 After | OSD 1 | | OSD 2 | | OSD 3 |
697 +--------+ +--------+ +--------+
698 | PG #1 | | PG #7 | | PG #3 |
699 | PG #2 | | PG #8 | | PG #6 |
700 | PG #4 | | PG #10 | | PG #9 |
701 | PG #5 | | | | |
702 | | | | | |
703 +--------+ +--------+ +--------+
704
705
706.. index:: architecture; Data Scrubbing
707
708Data Consistency
709~~~~~~~~~~~~~~~~
710
f67539c2
TL
711As part of maintaining data consistency and cleanliness, Ceph OSDs also scrub
712objects within placement groups. That is, Ceph OSDs compare object metadata in
713one placement group with its replicas in placement groups stored in other
714OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
715errors, often as a result of hardware issues. OSDs also perform deeper
716scrubbing by comparing data in objects bit-for-bit. Deep scrubbing (by default
717performed weekly) finds bad blocks on a drive that weren't apparent in a light
718scrub.
7c673cae
FG
719
720See `Data Scrubbing`_ for details on configuring scrubbing.
721
722
723
724
725
726.. index:: erasure coding
727
728Erasure Coding
729--------------
730
731An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
732``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
733of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
734the chunk is stored as an attribute of the object.
735
f67539c2 736For instance an erasure coded pool can be created to use five OSDs (``K+M = 5``) and
7c673cae
FG
737sustain the loss of two of them (``M = 2``).
738
739Reading and Writing Encoded Chunks
740~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
741
742When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
743encoding function splits the content into three data chunks simply by dividing
744the content in three: the first contains ``ABC``, the second ``DEF`` and the
745last ``GHI``. The content will be padded if the content length is not a multiple
746of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
11fdf7f2 747and the fifth with ``QGC``. Each chunk is stored in an OSD in the acting set.
7c673cae
FG
748The chunks are stored in objects that have the same name (**NYAN**) but reside
749on different OSDs. The order in which the chunks were created must be preserved
750and is stored as an attribute of the object (``shard_t``), in addition to its
751name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
752``YXY`` and is stored on **OSD3**.
753
754
755.. ditaa::
f91f0fd5 756
7c673cae
FG
757 +-------------------+
758 name | NYAN |
759 +-------------------+
760 content | ABCDEFGHI |
761 +--------+----------+
762 |
763 |
764 v
765 +------+------+
766 +---------------+ encode(3,2) +-----------+
767 | +--+--+---+---+ |
768 | | | | |
769 | +-------+ | +-----+ |
770 | | | | |
771 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
772 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
773 +------+ +------+ +------+ +------+ +------+
774 shard | 1 | | 2 | | 3 | | 4 | | 5 |
775 +------+ +------+ +------+ +------+ +------+
776 content | ABC | | DEF | | GHI | | YXY | | QGC |
777 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
778 | | | | |
779 | | v | |
780 | | +--+---+ | |
781 | | | OSD1 | | |
782 | | +------+ | |
783 | | | |
784 | | +------+ | |
785 | +------>| OSD2 | | |
786 | +------+ | |
787 | | |
788 | +------+ | |
789 | | OSD3 |<----+ |
790 | +------+ |
791 | |
792 | +------+ |
793 | | OSD4 |<--------------+
794 | +------+
795 |
796 | +------+
797 +----------------->| OSD5 |
798 +------+
799
800
801When the object **NYAN** is read from the erasure coded pool, the decoding
802function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
803``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
804of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
805and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
806because the **OSD4** is out. The decoding function can be called as soon as
807three chunks are read: **OSD2** was the slowest and its chunk was not taken into
808account.
809
810.. ditaa::
f91f0fd5 811
7c673cae
FG
812 +-------------------+
813 name | NYAN |
814 +-------------------+
815 content | ABCDEFGHI |
816 +---------+---------+
817 ^
818 |
819 |
820 +-------+-------+
821 | decode(3,2) |
822 +------------->+ erasures 2,5 +<-+
823 | | | |
824 | +-------+-------+ |
825 | ^ |
826 | | |
827 | | |
828 +--+---+ +------+ +---+--+ +---+--+
829 name | NYAN | | NYAN | | NYAN | | NYAN |
830 +------+ +------+ +------+ +------+
831 shard | 1 | | 2 | | 3 | | 4 |
832 +------+ +------+ +------+ +------+
833 content | ABC | | DEF | | GHI | | YXY |
834 +--+---+ +--+---+ +--+---+ +--+---+
835 ^ . ^ ^
836 | TOO . | |
837 | SLOW . +--+---+ |
838 | ^ | OSD1 | |
839 | | +------+ |
840 | | |
841 | | +------+ |
842 | +-------| OSD2 | |
843 | +------+ |
844 | |
845 | +------+ |
846 | | OSD3 |------+
847 | +------+
848 |
849 | +------+
850 | | OSD4 | OUT
851 | +------+
852 |
853 | +------+
854 +------------------| OSD5 |
855 +------+
856
857
858Interrupted Full Writes
859~~~~~~~~~~~~~~~~~~~~~~~
860
861In an erasure coded pool, the primary OSD in the up set receives all write
862operations. It is responsible for encoding the payload into ``K+M`` chunks and
863sends them to the other OSDs. It is also responsible for maintaining an
864authoritative version of the placement group logs.
865
866In the following diagram, an erasure coded placement group has been created with
9f95a23c 867``K = 2, M = 1`` and is supported by three OSDs, two for ``K`` and one for
7c673cae
FG
868``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
869**OSD 3**. An object has been encoded and stored in the OSDs : the chunk
870``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
871**OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
872placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
873version 1).
874
875
876.. ditaa::
f91f0fd5 877
7c673cae
FG
878 Primary OSD
879
880 +-------------+
881 | OSD 1 | +-------------+
882 | log | Write Full | |
883 | +----+ |<------------+ Ceph Client |
884 | |D1v1| 1,1 | v1 | |
885 | +----+ | +-------------+
886 +------+------+
887 |
888 |
889 | +-------------+
890 | | OSD 2 |
891 | | log |
892 +--------->+ +----+ |
893 | | |D2v1| 1,1 |
894 | | +----+ |
895 | +-------------+
896 |
897 | +-------------+
898 | | OSD 3 |
899 | | log |
900 +--------->| +----+ |
901 | |C1v1| 1,1 |
902 | +----+ |
903 +-------------+
904
905**OSD 1** is the primary and receives a **WRITE FULL** from a client, which
906means the payload is to replace the object entirely instead of overwriting a
907portion of it. Version 2 (v2) of the object is created to override version 1
908(v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
909chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
910``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
911to the target OSD, including the primary OSD which is responsible for storing
912chunks in addition to handling write operations and maintaining an authoritative
913version of the placement group logs. When an OSD receives the message
914instructing it to write the chunk, it also creates a new entry in the placement
915group logs to reflect the change. For instance, as soon as **OSD 3** stores
916``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
917Because the OSDs work asynchronously, some chunks may still be in flight ( such
f67539c2
TL
918as ``D2v2`` ) while others are acknowledged and persisted to storage drives
919(such as ``C1v1`` and ``D1v1``).
7c673cae
FG
920
921.. ditaa::
922
923 Primary OSD
924
925 +-------------+
926 | OSD 1 |
927 | log |
928 | +----+ | +-------------+
929 | |D1v2| 1,2 | Write Full | |
930 | +----+ +<------------+ Ceph Client |
931 | | v2 | |
932 | +----+ | +-------------+
933 | |D1v1| 1,1 |
934 | +----+ |
935 +------+------+
936 |
937 |
938 | +------+------+
939 | | OSD 2 |
940 | +------+ | log |
941 +->| D2v2 | | +----+ |
942 | +------+ | |D2v1| 1,1 |
943 | | +----+ |
944 | +-------------+
945 |
946 | +-------------+
947 | | OSD 3 |
948 | | log |
949 | | +----+ |
950 | | |C1v2| 1,2 |
951 +---------->+ +----+ |
952 | |
953 | +----+ |
954 | |C1v1| 1,1 |
955 | +----+ |
956 +-------------+
957
958
959If all goes well, the chunks are acknowledged on each OSD in the acting set and
960the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
961
962.. ditaa::
963
964 Primary OSD
965
966 +-------------+
967 | OSD 1 |
968 | log |
969 | +----+ | +-------------+
970 | |D1v2| 1,2 | Write Full | |
971 | +----+ +<------------+ Ceph Client |
972 | | v2 | |
973 | +----+ | +-------------+
974 | |D1v1| 1,1 |
975 | +----+ |
976 +------+------+
977 |
978 | +-------------+
979 | | OSD 2 |
980 | | log |
981 | | +----+ |
982 | | |D2v2| 1,2 |
983 +---------->+ +----+ |
984 | | |
985 | | +----+ |
986 | | |D2v1| 1,1 |
987 | | +----+ |
988 | +-------------+
989 |
990 | +-------------+
991 | | OSD 3 |
992 | | log |
993 | | +----+ |
994 | | |C1v2| 1,2 |
995 +---------->+ +----+ |
996 | |
997 | +----+ |
998 | |C1v1| 1,1 |
999 | +----+ |
1000 +-------------+
1001
1002
1003Finally, the files used to store the chunks of the previous version of the
1004object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
1005on **OSD 3**.
1006
1007.. ditaa::
f91f0fd5 1008
7c673cae
FG
1009 Primary OSD
1010
1011 +-------------+
1012 | OSD 1 |
1013 | log |
1014 | +----+ |
1015 | |D1v2| 1,2 |
1016 | +----+ |
1017 +------+------+
1018 |
1019 |
1020 | +-------------+
1021 | | OSD 2 |
1022 | | log |
1023 +--------->+ +----+ |
1024 | | |D2v2| 1,2 |
1025 | | +----+ |
1026 | +-------------+
1027 |
1028 | +-------------+
1029 | | OSD 3 |
1030 | | log |
1031 +--------->| +----+ |
1032 | |C1v2| 1,2 |
1033 | +----+ |
1034 +-------------+
1035
1036
1037But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
1038the object's version 2 is partially written: **OSD 3** has one chunk but that is
1039not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
1040erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
1041available to rebuild the third. **OSD 4** becomes the new primary and finds that
1042the ``last_complete`` log entry (i.e., all objects before this entry were known
1043to be available on all OSDs in the previous acting set ) is ``1,1`` and that
1044will be the head of the new authoritative log.
1045
1046.. ditaa::
f91f0fd5 1047
7c673cae
FG
1048 +-------------+
1049 | OSD 1 |
1050 | (down) |
1051 | c333 |
1052 +------+------+
1053 |
1054 | +-------------+
1055 | | OSD 2 |
1056 | | log |
1057 | | +----+ |
1058 +---------->+ |D2v1| 1,1 |
1059 | | +----+ |
1060 | | |
1061 | +-------------+
1062 |
1063 | +-------------+
1064 | | OSD 3 |
1065 | | log |
1066 | | +----+ |
1067 | | |C1v2| 1,2 |
1068 +---------->+ +----+ |
1069 | |
1070 | +----+ |
1071 | |C1v1| 1,1 |
1072 | +----+ |
1073 +-------------+
1074 Primary OSD
1075 +-------------+
1076 | OSD 4 |
1077 | log |
1078 | |
1079 | 1,1 |
1080 | |
1081 +------+------+
1082
1083
1084
1085The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
1086provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
1087chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
1088the erasure coding library during scrubbing and stored on the new primary
1089**OSD 4**.
1090
1091
1092.. ditaa::
f91f0fd5 1093
7c673cae
FG
1094 Primary OSD
1095
1096 +-------------+
1097 | OSD 4 |
1098 | log |
1099 | +----+ |
1100 | |D1v1| 1,1 |
1101 | +----+ |
1102 +------+------+
1103 ^
1104 |
1105 | +-------------+
1106 | | OSD 2 |
1107 | | log |
1108 +----------+ +----+ |
1109 | | |D2v1| 1,1 |
1110 | | +----+ |
1111 | +-------------+
1112 |
1113 | +-------------+
1114 | | OSD 3 |
1115 | | log |
1116 +----------| +----+ |
1117 | |C1v1| 1,1 |
1118 | +----+ |
1119 +-------------+
1120
1121 +-------------+
1122 | OSD 1 |
1123 | (down) |
1124 | c333 |
1125 +-------------+
1126
1127See `Erasure Code Notes`_ for additional details.
1128
1129
1130
1131Cache Tiering
1132-------------
1133
1e59de90
TL
1134.. note:: Cache tiering is deprecated in Reef.
1135
7c673cae
FG
1136A cache tier provides Ceph Clients with better I/O performance for a subset of
1137the data stored in a backing storage tier. Cache tiering involves creating a
1138pool of relatively fast/expensive storage devices (e.g., solid state drives)
1139configured to act as a cache tier, and a backing pool of either erasure-coded
1140or relatively slower/cheaper devices configured to act as an economical storage
1141tier. The Ceph objecter handles where to place the objects and the tiering
1142agent determines when to flush objects from the cache to the backing storage
1143tier. So the cache tier and the backing storage tier are completely transparent
1144to Ceph clients.
1145
1146
f91f0fd5
TL
1147.. ditaa::
1148
7c673cae
FG
1149 +-------------+
1150 | Ceph Client |
1151 +------+------+
1152 ^
1153 Tiering is |
1154 Transparent | Faster I/O
1155 to Ceph | +---------------+
1156 Client Ops | | |
1157 | +----->+ Cache Tier |
1158 | | | |
1159 | | +-----+---+-----+
1160 | | | ^
1161 v v | | Active Data in Cache Tier
1162 +------+----+--+ | |
1163 | Objecter | | |
1164 +-----------+--+ | |
1165 ^ | | Inactive Data in Storage Tier
1166 | v |
1167 | +-----+---+-----+
1168 | | |
1169 +----->| Storage Tier |
1170 | |
1171 +---------------+
1172 Slower I/O
1173
f67539c2
TL
1174See `Cache Tiering`_ for additional details. Note that Cache Tiers can be
1175tricky and their use is now discouraged.
7c673cae
FG
1176
1177
1178.. index:: Extensibility, Ceph Classes
1179
1180Extending Ceph
1181--------------
1182
1183You can extend Ceph by creating shared object classes called 'Ceph Classes'.
1184Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
1185(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
1186can create new object methods that have the ability to call the native methods
1187in the Ceph Object Store, or other class methods you incorporate via libraries
1188or create yourself.
1189
1190On writes, Ceph Classes can call native or class methods, perform any series of
1191operations on the inbound data and generate a resulting write transaction that
1192Ceph will apply atomically.
1193
1194On reads, Ceph Classes can call native or class methods, perform any series of
1195operations on the outbound data and return the data to the client.
1196
1197.. topic:: Ceph Class Example
1198
1199 A Ceph class for a content management system that presents pictures of a
1200 particular size and aspect ratio could take an inbound bitmap image, crop it
1201 to a particular aspect ratio, resize it and embed an invisible copyright or
1202 watermark to help protect the intellectual property; then, save the
1203 resulting bitmap image to the object store.
1204
1205See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for
1206exemplary implementations.
1207
1208
1209Summary
1210-------
1211
1212Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
1213appliances do not fully utilize the CPU and RAM of a typical commodity server,
1214Ceph does. From heartbeats, to peering, to rebalancing the cluster or
1215recovering from faults, Ceph offloads work from clients (and from a centralized
1216gateway which doesn't exist in the Ceph architecture) and uses the computing
1217power of the OSDs to perform the work. When referring to `Hardware
1218Recommendations`_ and the `Network Config Reference`_, be cognizant of the
1219foregoing concepts to understand how Ceph utilizes computing resources.
1220
1221.. index:: Ceph Protocol, librados
1222
1223Ceph Protocol
1224=============
1225
1226Ceph Clients use the native protocol for interacting with the Ceph Storage
1227Cluster. Ceph packages this functionality into the ``librados`` library so that
1228you can create your own custom Ceph Clients. The following diagram depicts the
1229basic architecture.
1230
f91f0fd5
TL
1231.. ditaa::
1232
7c673cae
FG
1233 +---------------------------------+
1234 | Ceph Storage Cluster Protocol |
1235 | (librados) |
1236 +---------------------------------+
1237 +---------------+ +---------------+
1238 | OSDs | | Monitors |
1239 +---------------+ +---------------+
1240
1241
1242Native Protocol and ``librados``
1243--------------------------------
1244
1245Modern applications need a simple object storage interface with asynchronous
1246communication capability. The Ceph Storage Cluster provides a simple object
1247storage interface with asynchronous communication capability. The interface
1248provides direct, parallel access to objects throughout the cluster.
1249
1250
1251- Pool Operations
1252- Snapshots and Copy-on-write Cloning
1253- Read/Write Objects
1254 - Create or Remove
1255 - Entire Object or Byte Range
1256 - Append or Truncate
1257- Create/Set/Get/Remove XATTRs
1258- Create/Set/Get/Remove Key/Value Pairs
1259- Compound operations and dual-ack semantics
1260- Object Classes
1261
1262
1263.. index:: architecture; watch/notify
1264
1265Object Watch/Notify
1266-------------------
1267
1268A client can register a persistent interest with an object and keep a session to
1269the primary OSD open. The client can send a notification message and a payload to
1270all watchers and receive notification when the watchers receive the
1271notification. This enables a client to use any object as a
1272synchronization/communication channel.
1273
1274
f91f0fd5
TL
1275.. ditaa::
1276
1277 +----------+ +----------+ +----------+ +---------------+
7c673cae
FG
1278 | Client 1 | | Client 2 | | Client 3 | | OSD:Object ID |
1279 +----------+ +----------+ +----------+ +---------------+
1280 | | | |
1281 | | | |
1282 | | Watch Object | |
1283 |--------------------------------------------------->|
1284 | | | |
1285 |<---------------------------------------------------|
1286 | | Ack/Commit | |
1287 | | | |
1288 | | Watch Object | |
1289 | |---------------------------------->|
1290 | | | |
1291 | |<----------------------------------|
1292 | | Ack/Commit | |
1293 | | | Watch Object |
1294 | | |----------------->|
1295 | | | |
1296 | | |<-----------------|
1297 | | | Ack/Commit |
1298 | | Notify | |
1299 |--------------------------------------------------->|
1300 | | | |
1301 |<---------------------------------------------------|
1302 | | Notify | |
1303 | | | |
1304 | |<----------------------------------|
1305 | | Notify | |
1306 | | |<-----------------|
1307 | | | Notify |
1308 | | Ack | |
1309 |----------------+---------------------------------->|
1310 | | | |
1311 | | Ack | |
1312 | +---------------------------------->|
1313 | | | |
1314 | | | Ack |
1315 | | |----------------->|
1316 | | | |
1317 |<---------------+----------------+------------------|
1318 | Complete
1319
1320.. index:: architecture; Striping
1321
1322Data Striping
1323-------------
1324
1325Storage devices have throughput limitations, which impact performance and
1326scalability. So storage systems often support `striping`_--storing sequential
1327pieces of information across multiple storage devices--to increase throughput
1328and performance. The most common form of data striping comes from `RAID`_.
1329The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
1330volume'. Ceph's striping offers the throughput of RAID 0 striping, the
1331reliability of n-way RAID mirroring and faster recovery.
1332
9f95a23c 1333Ceph provides three types of clients: Ceph Block Device, Ceph File System, and
7c673cae
FG
1334Ceph Object Storage. A Ceph Client converts its data from the representation
1335format it provides to its users (a block device image, RESTful objects, CephFS
1336filesystem directories) into objects for storage in the Ceph Storage Cluster.
1337
1338.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped.
9f95a23c 1339 Ceph Object Storage, Ceph Block Device, and the Ceph File System stripe their
7c673cae
FG
1340 data over multiple Ceph Storage Cluster objects. Ceph Clients that write
1341 directly to the Ceph Storage Cluster via ``librados`` must perform the
1342 striping (and parallel I/O) for themselves to obtain these benefits.
1343
1344The simplest Ceph striping format involves a stripe count of 1 object. Ceph
1345Clients write stripe units to a Ceph Storage Cluster object until the object is
1346at its maximum capacity, and then create another object for additional stripes
1347of data. The simplest form of striping may be sufficient for small block device
1348images, S3 or Swift objects and CephFS files. However, this simple form doesn't
1349take maximum advantage of Ceph's ability to distribute data across placement
1350groups, and consequently doesn't improve performance very much. The following
1351diagram depicts the simplest form of striping:
1352
f91f0fd5
TL
1353.. ditaa::
1354
7c673cae
FG
1355 +---------------+
1356 | Client Data |
1357 | Format |
1358 | cCCC |
1359 +---------------+
1360 |
1361 +--------+-------+
1362 | |
1363 v v
1364 /-----------\ /-----------\
1365 | Begin cCCC| | Begin cCCC|
1366 | Object 0 | | Object 1 |
1367 +-----------+ +-----------+
1368 | stripe | | stripe |
1369 | unit 1 | | unit 5 |
1370 +-----------+ +-----------+
1371 | stripe | | stripe |
1372 | unit 2 | | unit 6 |
1373 +-----------+ +-----------+
1374 | stripe | | stripe |
1375 | unit 3 | | unit 7 |
1376 +-----------+ +-----------+
1377 | stripe | | stripe |
1378 | unit 4 | | unit 8 |
1379 +-----------+ +-----------+
1380 | End cCCC | | End cCCC |
1381 | Object 0 | | Object 1 |
1382 \-----------/ \-----------/
1383
1384
1385If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
1386or large CephFS directories, you may see considerable read/write performance
1387improvements by striping client data over multiple objects within an object set.
1388Significant write performance occurs when the client writes the stripe units to
1389their corresponding objects in parallel. Since objects get mapped to different
1390placement groups and further mapped to different OSDs, each write occurs in
f67539c2 1391parallel at the maximum write speed. A write to a single drive would be limited
7c673cae
FG
1392by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
1393100MB/s). By spreading that write over multiple objects (which map to different
1394placement groups and OSDs) Ceph can reduce the number of seeks per drive and
1395combine the throughput of multiple drives to achieve much faster write (or read)
1396speeds.
1397
1398.. note:: Striping is independent of object replicas. Since CRUSH
1399 replicates objects across OSDs, stripes get replicated automatically.
1400
1401In the following diagram, client data gets striped across an object set
1402(``object set 1`` in the following diagram) consisting of 4 objects, where the
1403first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
1404unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
1405client determines if the object set is full. If the object set is not full, the
1406client begins writing a stripe to the first object again (``object 0`` in the
1407following diagram). If the object set is full, the client creates a new object
1408set (``object set 2`` in the following diagram), and begins writing to the first
1409stripe (``stripe unit 16``) in the first object in the new object set (``object
14104`` in the diagram below).
1411
f91f0fd5
TL
1412.. ditaa::
1413
7c673cae
FG
1414 +---------------+
1415 | Client Data |
1416 | Format |
1417 | cCCC |
1418 +---------------+
1419 |
1420 +-----------------+--------+--------+-----------------+
1421 | | | | +--\
1422 v v v v |
1423 /-----------\ /-----------\ /-----------\ /-----------\ |
1424 | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| |
1425 | Object 0 | | Object 1 | | Object 2 | | Object 3 | |
1426 +-----------+ +-----------+ +-----------+ +-----------+ |
1427 | stripe | | stripe | | stripe | | stripe | |
1428 | unit 0 | | unit 1 | | unit 2 | | unit 3 | |
1429 +-----------+ +-----------+ +-----------+ +-----------+ |
1430 | stripe | | stripe | | stripe | | stripe | +-\
1431 | unit 4 | | unit 5 | | unit 6 | | unit 7 | | Object
1432 +-----------+ +-----------+ +-----------+ +-----------+ +- Set
1433 | stripe | | stripe | | stripe | | stripe | | 1
1434 | unit 8 | | unit 9 | | unit 10 | | unit 11 | +-/
1435 +-----------+ +-----------+ +-----------+ +-----------+ |
1436 | stripe | | stripe | | stripe | | stripe | |
1437 | unit 12 | | unit 13 | | unit 14 | | unit 15 | |
1438 +-----------+ +-----------+ +-----------+ +-----------+ |
1439 | End cCCC | | End cCCC | | End cCCC | | End cCCC | |
1440 | Object 0 | | Object 1 | | Object 2 | | Object 3 | |
1441 \-----------/ \-----------/ \-----------/ \-----------/ |
1442 |
1443 +--/
1444
1445 +--\
1446 |
1447 /-----------\ /-----------\ /-----------\ /-----------\ |
1448 | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| |
1449 | Object 4 | | Object 5 | | Object 6 | | Object 7 | |
1450 +-----------+ +-----------+ +-----------+ +-----------+ |
1451 | stripe | | stripe | | stripe | | stripe | |
1452 | unit 16 | | unit 17 | | unit 18 | | unit 19 | |
1453 +-----------+ +-----------+ +-----------+ +-----------+ |
1454 | stripe | | stripe | | stripe | | stripe | +-\
1455 | unit 20 | | unit 21 | | unit 22 | | unit 23 | | Object
1456 +-----------+ +-----------+ +-----------+ +-----------+ +- Set
1457 | stripe | | stripe | | stripe | | stripe | | 2
1458 | unit 24 | | unit 25 | | unit 26 | | unit 27 | +-/
1459 +-----------+ +-----------+ +-----------+ +-----------+ |
1460 | stripe | | stripe | | stripe | | stripe | |
1461 | unit 28 | | unit 29 | | unit 30 | | unit 31 | |
1462 +-----------+ +-----------+ +-----------+ +-----------+ |
1463 | End cCCC | | End cCCC | | End cCCC | | End cCCC | |
1464 | Object 4 | | Object 5 | | Object 6 | | Object 7 | |
1465 \-----------/ \-----------/ \-----------/ \-----------/ |
1466 |
1467 +--/
1468
1469Three important variables determine how Ceph stripes data:
1470
1471- **Object Size:** Objects in the Ceph Storage Cluster have a maximum
1472 configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
1473 enough to accommodate many stripe units, and should be a multiple of
1474 the stripe unit.
1475
1476- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
1477 The Ceph Client divides the data it will write to objects into equally
1478 sized stripe units, except for the last stripe unit. A stripe width,
1479 should be a fraction of the Object Size so that an object may contain
1480 many stripe units.
1481
1482- **Stripe Count:** The Ceph Client writes a sequence of stripe units
1483 over a series of objects determined by the stripe count. The series
1484 of objects is called an object set. After the Ceph Client writes to
1485 the last object in the object set, it returns to the first object in
1486 the object set.
1487
1488.. important:: Test the performance of your striping configuration before
1489 putting your cluster into production. You CANNOT change these striping
1490 parameters after you stripe the data and write it to objects.
1491
1492Once the Ceph Client has striped data to stripe units and mapped the stripe
1493units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
1494and the placement groups to Ceph OSD Daemons before the objects are stored as
f67539c2 1495files on a storage drive.
7c673cae
FG
1496
1497.. note:: Since a client writes to a single pool, all data striped into objects
1498 get mapped to placement groups in the same pool. So they use the same CRUSH
1499 map and the same access controls.
1500
1501
1502.. index:: architecture; Ceph Clients
1503
1504Ceph Clients
1505============
1506
1507Ceph Clients include a number of service interfaces. These include:
1508
aee94f69
TL
1509- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
1510 provides resizable, thin-provisioned block devices that can be snapshotted
1511 and cloned. Ceph stripes a block device across the cluster for high
1512 performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
1513 that uses ``librbd`` directly--avoiding the kernel object overhead for
7c673cae
FG
1514 virtualized systems.
1515
1516- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
1517 provides RESTful APIs with interfaces that are compatible with Amazon S3
1518 and OpenStack Swift.
1519
9f95a23c 1520- **Filesystem**: The :term:`Ceph File System` (CephFS) service provides
7c673cae 1521 a POSIX compliant filesystem usable with ``mount`` or as
11fdf7f2 1522 a filesystem in user space (FUSE).
7c673cae
FG
1523
1524Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
1525and high availability. The following diagram depicts the high-level
1526architecture.
1527
1528.. ditaa::
f91f0fd5 1529
7c673cae 1530 +--------------+ +----------------+ +-------------+
91327a77 1531 | Block Device | | Object Storage | | CephFS |
7c673cae
FG
1532 +--------------+ +----------------+ +-------------+
1533
1534 +--------------+ +----------------+ +-------------+
1535 | librbd | | librgw | | libcephfs |
1536 +--------------+ +----------------+ +-------------+
1537
1538 +---------------------------------------------------+
1539 | Ceph Storage Cluster Protocol (librados) |
1540 +---------------------------------------------------+
1541
1542 +---------------+ +---------------+ +---------------+
1543 | OSDs | | MDSs | | Monitors |
1544 +---------------+ +---------------+ +---------------+
1545
1546
1547.. index:: architecture; Ceph Object Storage
1548
1549Ceph Object Storage
1550-------------------
1551
1552The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
1553a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
1554Storage Cluster with its own data formats, and maintains its own user database,
1555authentication, and access control. The RADOS Gateway uses a unified namespace,
1556which means you can use either the OpenStack Swift-compatible API or the Amazon
1557S3-compatible API. For example, you can write data using the S3-compatible API
1558with one application and then read data using the Swift-compatible API with
1559another application.
1560
1561.. topic:: S3/Swift Objects and Store Cluster Objects Compared
1562
1563 Ceph's Object Storage uses the term *object* to describe the data it stores.
1564 S3 and Swift objects are not the same as the objects that Ceph writes to the
1565 Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
1566 Cluster objects. The S3 and Swift objects do not necessarily
1567 correspond in a 1:1 manner with an object stored in the storage cluster. It
1568 is possible for an S3 or Swift object to map to multiple Ceph objects.
1569
1570See `Ceph Object Storage`_ for details.
1571
1572
1573.. index:: Ceph Block Device; block device; RBD; Rados Block Device
1574
1575Ceph Block Device
1576-----------------
1577
1578A Ceph Block Device stripes a block device image over multiple objects in the
1579Ceph Storage Cluster, where each object gets mapped to a placement group and
1580distributed, and the placement groups are spread across separate ``ceph-osd``
1581daemons throughout the cluster.
1582
1583.. important:: Striping allows RBD block devices to perform better than a single
1584 server could!
1585
1586Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
1587virtualization and cloud computing. In virtual machine scenarios, people
1588typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
1589QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
1590service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
1591with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
1592``libvirt`` to support OpenStack and CloudStack among other solutions.
1593
1594While we do not provide ``librbd`` support with other hypervisors at this time,
1595you may also use Ceph Block Device kernel objects to provide a block device to a
1596client. Other virtualization technologies such as Xen can access the Ceph Block
1597Device kernel object(s). This is done with the command-line tool ``rbd``.
1598
1599
9f95a23c 1600.. index:: CephFS; Ceph File System; libcephfs; MDS; metadata server; ceph-mds
91327a77
AA
1601
1602.. _arch-cephfs:
7c673cae 1603
9f95a23c
TL
1604Ceph File System
1605----------------
7c673cae 1606
9f95a23c 1607The Ceph File System (CephFS) provides a POSIX-compliant filesystem as a
7c673cae 1608service that is layered on top of the object-based Ceph Storage Cluster.
91327a77 1609CephFS files get mapped to objects that Ceph stores in the Ceph Storage
7c673cae
FG
1610Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
1611a Filesystem in User Space (FUSE).
1612
1613.. ditaa::
f91f0fd5 1614
7c673cae
FG
1615 +-----------------------+ +------------------------+
1616 | CephFS Kernel Object | | CephFS FUSE |
1617 +-----------------------+ +------------------------+
1618
1619 +---------------------------------------------------+
91327a77 1620 | CephFS Library (libcephfs) |
7c673cae
FG
1621 +---------------------------------------------------+
1622
1623 +---------------------------------------------------+
1624 | Ceph Storage Cluster Protocol (librados) |
1625 +---------------------------------------------------+
1626
1627 +---------------+ +---------------+ +---------------+
1628 | OSDs | | MDSs | | Monitors |
1629 +---------------+ +---------------+ +---------------+
1630
1631
9f95a23c 1632The Ceph File System service includes the Ceph Metadata Server (MDS) deployed
7c673cae
FG
1633with the Ceph Storage cluster. The purpose of the MDS is to store all the
1634filesystem metadata (directories, file ownership, access modes, etc) in
1635high-availability Ceph Metadata Servers where the metadata resides in memory.
1636The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
1637operations like listing a directory or changing a directory (``ls``, ``cd``)
1638would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
9f95a23c 1639the data means that the Ceph File System can provide high performance services
7c673cae
FG
1640without taxing the Ceph Storage Cluster.
1641
91327a77 1642CephFS separates the metadata from the data, storing the metadata in the MDS,
7c673cae
FG
1643and storing the file data in one or more objects in the Ceph Storage Cluster.
1644The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
1645single process, or it can be distributed out to multiple physical machines,
1646either for high availability or for scalability.
1647
1648- **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
1649 ready to take over the duties of any failed ``ceph-mds`` that was
1650 `active`. This is easy because all the data, including the journal, is
1651 stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
1652
1653- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
1654 will split the directory tree into subtrees (and shards of a single
1655 busy directory), effectively balancing the load amongst all `active`
1656 servers.
1657
1658Combinations of `standby` and `active` etc are possible, for example
1659running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
1660instance for high availability.
1661
1662
1663
39ae355f 1664.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.io/assets/pdfs/weil-rados-pdsw07.pdf
11fdf7f2 1665.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
7c673cae
FG
1666.. _Monitor Config Reference: ../rados/configuration/mon-config-ref
1667.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
1668.. _Heartbeats: ../rados/configuration/mon-osd-interaction
1669.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
39ae355f 1670.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
7c673cae
FG
1671.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1672.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
1673.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
1674.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
1675.. _Hardware Recommendations: ../start/hardware-recommendations
1676.. _Network Config Reference: ../rados/configuration/network-config-ref
1677.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
11fdf7f2
TL
1678.. _striping: https://en.wikipedia.org/wiki/Data_striping
1679.. _RAID: https://en.wikipedia.org/wiki/RAID
1680.. _RAID 0: https://en.wikipedia.org/wiki/RAID_0#RAID_0
7c673cae 1681.. _Ceph Object Storage: ../radosgw/
11fdf7f2 1682.. _RESTful: https://en.wikipedia.org/wiki/RESTful
7c673cae
FG
1683.. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
1684.. _Cache Tiering: ../rados/operations/cache-tiering
1685.. _Set Pool Values: ../rados/operations/pools#set-pool-values
11fdf7f2 1686.. _Kerberos: https://en.wikipedia.org/wiki/Kerberos_(protocol)
7c673cae
FG
1687.. _Cephx Config Guide: ../rados/configuration/auth-config-ref
1688.. _User Management: ../rados/operations/user-management