]> git.proxmox.com Git - ceph.git/blame - ceph/doc/architecture.rst
add subtree-ish sources for 12.0.3
[ceph.git] / ceph / doc / architecture.rst
CommitLineData
7c673cae
FG
1==============
2 Architecture
3==============
4
5:term:`Ceph` uniquely delivers **object, block, and file storage** in one
6unified system. Ceph is highly reliable, easy to manage, and free. The power of
7Ceph can transform your company's IT infrastructure and your ability to manage
8vast amounts of data. Ceph delivers extraordinary scalability–thousands of
9clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
10commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
11accommodates large numbers of nodes, which communicate with each other to
12replicate and redistribute data dynamically.
13
14.. image:: images/stack.png
15
16
17The Ceph Storage Cluster
18========================
19
20Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
21:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
22about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
23Storage Clusters`_.
24
25A Ceph Storage Cluster consists of two types of daemons:
26
27- :term:`Ceph Monitor`
28- :term:`Ceph OSD Daemon`
29
30.. ditaa:: +---------------+ +---------------+
31 | OSDs | | Monitors |
32 +---------------+ +---------------+
33
34A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
35monitors ensures high availability should a monitor daemon fail. Storage cluster
36clients retrieve a copy of the cluster map from the Ceph Monitor.
37
38A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
39back to monitors.
40
41Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
42to efficiently compute information about data location, instead of having to
43depend on a central lookup table. Ceph's high-level features include providing a
44native interface to the Ceph Storage Cluster via ``librados``, and a number of
45service interfaces built on top of ``librados``.
46
47
48
49Storing Data
50------------
51
52The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it
53comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
54:term:`Ceph Filesystem` or a custom implementation you create using
55``librados``--and it stores the data as objects. Each object corresponds to a
56file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph
57OSD Daemons handle the read/write operations on the storage disks.
58
59.. ditaa:: /-----\ +-----+ +-----+
60 | obj |------>| {d} |------>| {s} |
61 \-----/ +-----+ +-----+
62
63 Object File Disk
64
65Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no
66hierarchy of directories). An object has an identifier, binary data, and
67metadata consisting of a set of name/value pairs. The semantics are completely
68up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file
69attributes such as the file owner, created date, last modified date, and so
70forth.
71
72
73.. ditaa:: /------+------------------------------+----------------\
74 | ID | Binary Data | Metadata |
75 +------+------------------------------+----------------+
76 | 1234 | 0101010101010100110101010010 | name1 = value1 |
77 | | 0101100001010100110101010010 | name2 = value2 |
78 | | 0101100001010100110101010010 | nameN = valueN |
79 \------+------------------------------+----------------/
80
81.. note:: An object ID is unique across the entire cluster, not just the local
82 filesystem.
83
84
85.. index:: architecture; high availability, scalability
86
87Scalability and High Availability
88---------------------------------
89
90In traditional architectures, clients talk to a centralized component (e.g., a
91gateway, broker, API, facade, etc.), which acts as a single point of entry to a
92complex subsystem. This imposes a limit to both performance and scalability,
93while introducing a single point of failure (i.e., if the centralized component
94goes down, the whole system goes down, too).
95
96Ceph eliminates the centralized gateway to enable clients to interact with
97Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
98Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
99of monitors to ensure high availability. To eliminate centralization, Ceph
100uses an algorithm called CRUSH.
101
102
103.. index:: CRUSH; architecture
104
105CRUSH Introduction
106~~~~~~~~~~~~~~~~~~
107
108Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
109Replication Under Scalable Hashing)` algorithm to efficiently compute
110information about object location, instead of having to depend on a
111central lookup table. CRUSH provides a better data management mechanism compared
112to older approaches, and enables massive scale by cleanly distributing the work
113to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
114replication to ensure resiliency, which is better suited to hyper-scale storage.
115The following sections provide additional details on how CRUSH works. For a
116detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
117Placement of Replicated Data`_.
118
119.. index:: architecture; cluster map
120
121Cluster Map
122~~~~~~~~~~~
123
124Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
125cluster topology, which is inclusive of 5 maps collectively referred to as the
126"Cluster Map":
127
128#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
129 address and port of each monitor. It also indicates the current epoch,
130 when the map was created, and the last time it changed. To view a monitor
131 map, execute ``ceph mon dump``.
132
133#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
134 last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
135 and their status (e.g., ``up``, ``in``). To view an OSD map, execute
136 ``ceph osd dump``.
137
138#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
139 map epoch, the full ratios, and details on each placement group such as
140 the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
141 ``active + clean``), and data usage statistics for each pool.
142
143#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
144 hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
145 traversing the hierarchy when storing data. To view a CRUSH map, execute
146 ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
147 ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
148 You can view the decompiled map in a text editor or with ``cat``.
149
150#. **The MDS Map:** Contains the current MDS map epoch, when the map was
151 created, and the last time it changed. It also contains the pool for
152 storing metadata, a list of metadata servers, and which metadata servers
153 are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
154
155Each map maintains an iterative history of its operating state changes. Ceph
156Monitors maintain a master copy of the cluster map including the cluster
157members, state, changes, and the overall health of the Ceph Storage Cluster.
158
159.. index:: high availability; monitor architecture
160
161High Availability Monitors
162~~~~~~~~~~~~~~~~~~~~~~~~~~
163
164Before Ceph Clients can read or write data, they must contact a Ceph Monitor
165to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
166can operate with a single monitor; however, this introduces a single
167point of failure (i.e., if the monitor goes down, Ceph Clients cannot
168read or write data).
169
170For added reliability and fault tolerance, Ceph supports a cluster of monitors.
171In a cluster of monitors, latency and other faults can cause one or more
172monitors to fall behind the current state of the cluster. For this reason, Ceph
173must have agreement among various monitor instances regarding the state of the
174cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
175and the `Paxos`_ algorithm to establish a consensus among the monitors about the
176current state of the cluster.
177
178For details on configuring monitors, see the `Monitor Config Reference`_.
179
180.. index:: architecture; high availability authentication
181
182High Availability Authentication
183~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
184
185To identify users and protect against man-in-the-middle attacks, Ceph provides
186its ``cephx`` authentication system to authenticate users and daemons.
187
188.. note:: The ``cephx`` protocol does not address data encryption in transport
189 (e.g., SSL/TLS) or encryption at rest.
190
191Cephx uses shared secret keys for authentication, meaning both the client and
192the monitor cluster have a copy of the client's secret key. The authentication
193protocol is such that both parties are able to prove to each other they have a
194copy of the key without actually revealing it. This provides mutual
195authentication, which means the cluster is sure the user possesses the secret
196key, and the user is sure that the cluster has a copy of the secret key.
197
198A key scalability feature of Ceph is to avoid a centralized interface to the
199Ceph object store, which means that Ceph clients must be able to interact with
200OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
201system, which authenticates users operating Ceph clients. The ``cephx`` protocol
202operates in a manner with behavior similar to `Kerberos`_.
203
204A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
205monitor can authenticate users and distribute keys, so there is no single point
206of failure or bottleneck when using ``cephx``. The monitor returns an
207authentication data structure similar to a Kerberos ticket that contains a
208session key for use in obtaining Ceph services. This session key is itself
209encrypted with the user's permanent secret key, so that only the user can
210request services from the Ceph monitor(s). The client then uses the session key
211to request its desired services from the monitor, and the monitor provides the
212client with a ticket that will authenticate the client to the OSDs that actually
213handle data. Ceph monitors and OSDs share a secret, so the client can use the
214ticket provided by the monitor with any OSD or metadata server in the cluster.
215Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
216ticket or session key obtained surreptitiously. This form of authentication will
217prevent attackers with access to the communications medium from either creating
218bogus messages under another user's identity or altering another user's
219legitimate messages, as long as the user's secret key is not divulged before it
220expires.
221
222To use ``cephx``, an administrator must set up users first. In the following
223diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
224the command line to generate a username and secret key. Ceph's ``auth``
225subsystem generates the username and key, stores a copy with the monitor(s) and
226transmits the user's secret back to the ``client.admin`` user. This means that
227the client and the monitor share a secret key.
228
229.. note:: The ``client.admin`` user must provide the user ID and
230 secret key to the user in a secure manner.
231
232.. ditaa:: +---------+ +---------+
233 | Client | | Monitor |
234 +---------+ +---------+
235 | request to |
236 | create a user |
237 |-------------->|----------+ create user
238 | | | and
239 |<--------------|<---------+ store key
240 | transmit key |
241 | |
242
243
244To authenticate with the monitor, the client passes in the user name to the
245monitor, and the monitor generates a session key and encrypts it with the secret
246key associated to the user name. Then, the monitor transmits the encrypted
247ticket back to the client. The client then decrypts the payload with the shared
248secret key to retrieve the session key. The session key identifies the user for
249the current session. The client then requests a ticket on behalf of the user
250signed by the session key. The monitor generates a ticket, encrypts it with the
251user's secret key and transmits it back to the client. The client decrypts the
252ticket and uses it to sign requests to OSDs and metadata servers throughout the
253cluster.
254
255.. ditaa:: +---------+ +---------+
256 | Client | | Monitor |
257 +---------+ +---------+
258 | authenticate |
259 |-------------->|----------+ generate and
260 | | | encrypt
261 |<--------------|<---------+ session key
262 | transmit |
263 | encrypted |
264 | session key |
265 | |
266 |-----+ decrypt |
267 | | session |
268 |<----+ key |
269 | |
270 | req. ticket |
271 |-------------->|----------+ generate and
272 | | | encrypt
273 |<--------------|<---------+ ticket
274 | recv. ticket |
275 | |
276 |-----+ decrypt |
277 | | ticket |
278 |<----+ |
279
280
281The ``cephx`` protocol authenticates ongoing communications between the client
282machine and the Ceph servers. Each message sent between a client and server,
283subsequent to the initial authentication, is signed using a ticket that the
284monitors, OSDs and metadata servers can verify with their shared secret.
285
286.. ditaa:: +---------+ +---------+ +-------+ +-------+
287 | Client | | Monitor | | MDS | | OSD |
288 +---------+ +---------+ +-------+ +-------+
289 | request to | | |
290 | create a user | | |
291 |-------------->| mon and | |
292 |<--------------| client share | |
293 | receive | a secret. | |
294 | shared secret | | |
295 | |<------------>| |
296 | |<-------------+------------>|
297 | | mon, mds, | |
298 | authenticate | and osd | |
299 |-------------->| share | |
300 |<--------------| a secret | |
301 | session key | | |
302 | | | |
303 | req. ticket | | |
304 |-------------->| | |
305 |<--------------| | |
306 | recv. ticket | | |
307 | | | |
308 | make request (CephFS only) | |
309 |----------------------------->| |
310 |<-----------------------------| |
311 | receive response (CephFS only) |
312 | |
313 | make request |
314 |------------------------------------------->|
315 |<-------------------------------------------|
316 receive response
317
318The protection offered by this authentication is between the Ceph client and the
319Ceph server hosts. The authentication is not extended beyond the Ceph client. If
320the user accesses the Ceph client from a remote host, Ceph authentication is not
321applied to the connection between the user's host and the client host.
322
323
324For configuration details, see `Cephx Config Guide`_. For user management
325details, see `User Management`_.
326
327
328.. index:: architecture; smart daemons and scalability
329
330Smart Daemons Enable Hyperscale
331~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
332
333In many clustered architectures, the primary purpose of cluster membership is
334so that a centralized interface knows which nodes it can access. Then the
335centralized interface provides services to the client through a double
336dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
337
338Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
339aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
340Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with
341other Ceph OSD Daemons and Ceph monitors. Additionally, it enables Ceph Clients
342to interact directly with Ceph OSD Daemons.
343
344The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
345each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
346nodes to easily perform tasks that would bog down a centralized server. The
347ability to leverage this computing power leads to several major benefits:
348
349#. **OSDs Service Clients Directly:** Since any network device has a limit to
350 the number of concurrent connections it can support, a centralized system
351 has a low physical limit at high scales. By enabling Ceph Clients to contact
352 Ceph OSD Daemons directly, Ceph increases both performance and total system
353 capacity simultaneously, while removing a single point of failure. Ceph
354 Clients can maintain a session when they need to, and with a particular Ceph
355 OSD Daemon instead of a centralized server.
356
357#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
358 on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
359 or ``down`` reflecting whether or not it is running and able to service
360 Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
361 Storage Cluster, this status may indicate the failure of the Ceph OSD
362 Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
363 Daemon cannot notify the Ceph Monitor that it is ``down``. The Ceph Monitor
364 can ping a Ceph OSD Daemon periodically to ensure that it is running.
365 However, Ceph also empowers Ceph OSD Daemons to determine if a neighboring
366 OSD is ``down``, to update the cluster map and to report it to the Ceph
367 monitor(s). This means that Ceph monitors can remain light weight processes.
368 See `Monitoring OSDs`_ and `Heartbeats`_ for additional details.
369
370#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
371 Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph
372 OSD Daemons can compare object metadata in one placement group with its
373 replicas in placement groups stored on other OSDs. Scrubbing (usually
374 performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also
375 perform deeper scrubbing by comparing data in objects bit-for-bit. Deep
376 scrubbing (usually performed weekly) finds bad sectors on a drive that
377 weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
378 configuring scrubbing.
379
380#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
381 algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
382 objects should be stored (and for rebalancing). In a typical write scenario,
383 a client uses the CRUSH algorithm to compute where to store an object, maps
384 the object to a pool and placement group, then looks at the CRUSH map to
385 identify the primary OSD for the placement group.
386
387 The client writes the object to the identified placement group in the
388 primary OSD. Then, the primary OSD with its own copy of the CRUSH map
389 identifies the secondary and tertiary OSDs for replication purposes, and
390 replicates the object to the appropriate placement groups in the secondary
391 and tertiary OSDs (as many OSDs as additional replicas), and responds to the
392 client once it has confirmed the object was stored successfully.
393
394.. ditaa::
395 +----------+
396 | Client |
397 | |
398 +----------+
399 * ^
400 Write (1) | | Ack (6)
401 | |
402 v *
403 +-------------+
404 | Primary OSD |
405 | |
406 +-------------+
407 * ^ ^ *
408 Write (2) | | | | Write (3)
409 +------+ | | +------+
410 | +------+ +------+ |
411 | | Ack (4) Ack (5)| |
412 v * * v
413 +---------------+ +---------------+
414 | Secondary OSD | | Tertiary OSD |
415 | | | |
416 +---------------+ +---------------+
417
418With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
419clients from that duty, while ensuring high data availability and data safety.
420
421
422Dynamic Cluster Management
423--------------------------
424
425In the `Scalability and High Availability`_ section, we explained how Ceph uses
426CRUSH, cluster awareness and intelligent daemons to scale and maintain high
427availability. Key to Ceph's design is the autonomous, self-healing, and
428intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
429enable modern cloud storage infrastructures to place data, rebalance the cluster
430and recover from faults dynamically.
431
432.. index:: architecture; pools
433
434About Pools
435~~~~~~~~~~~
436
437The Ceph storage system supports the notion of 'Pools', which are logical
438partitions for storing objects.
439
440Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
441pools. The pool's ``size`` or number of replicas, the CRUSH ruleset and the
442number of placement groups determine how Ceph will place the data.
443
444.. ditaa::
445 +--------+ Retrieves +---------------+
446 | Client |------------>| Cluster Map |
447 +--------+ +---------------+
448 |
449 v Writes
450 /-----\
451 | obj |
452 \-----/
453 | To
454 v
455 +--------+ +---------------+
456 | Pool |---------->| CRUSH Ruleset |
457 +--------+ Selects +---------------+
458
459
460Pools set at least the following parameters:
461
462- Ownership/Access to Objects
463- The Number of Placement Groups, and
464- The CRUSH Ruleset to Use.
465
466See `Set Pool Values`_ for details.
467
468
469.. index: architecture; placement group mapping
470
471Mapping PGs to OSDs
472~~~~~~~~~~~~~~~~~~~
473
474Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
475When a Ceph Client stores objects, CRUSH will map each object to a placement
476group.
477
478Mapping objects to placement groups creates a layer of indirection between the
479Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
480grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
481Client "knew" which Ceph OSD Daemon had which object, that would create a tight
482coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
483algorithm maps each object to a placement group and then maps each placement
484group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
485rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
486come online. The following diagram depicts how CRUSH maps objects to placement
487groups, and placement groups to OSDs.
488
489.. ditaa::
490 /-----\ /-----\ /-----\ /-----\ /-----\
491 | obj | | obj | | obj | | obj | | obj |
492 \-----/ \-----/ \-----/ \-----/ \-----/
493 | | | | |
494 +--------+--------+ +---+----+
495 | |
496 v v
497 +-----------------------+ +-----------------------+
498 | Placement Group #1 | | Placement Group #2 |
499 | | | |
500 +-----------------------+ +-----------------------+
501 | |
502 | +-----------------------+---+
503 +------+------+-------------+ |
504 | | | |
505 v v v v
506 /----------\ /----------\ /----------\ /----------\
507 | | | | | | | |
508 | OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 |
509 | | | | | | | |
510 \----------/ \----------/ \----------/ \----------/
511
512With a copy of the cluster map and the CRUSH algorithm, the client can compute
513exactly which OSD to use when reading or writing a particular object.
514
515.. index:: architecture; calculating PG IDs
516
517Calculating PG IDs
518~~~~~~~~~~~~~~~~~~
519
520When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
521`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
522OSDs, and metadata servers in the cluster. **However, it doesn't know anything
523about object locations.**
524
525.. epigraph::
526
527 Object locations get computed.
528
529
530The only input required by the client is the object ID and the pool.
531It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
532wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
533it calculates a placement group using the object name, a hash code, the
534number of PGs in the pool and the pool name. Ceph clients use the following
535steps to compute PG IDs.
536
537#. The client inputs the pool ID and the object ID. (e.g., pool = "liverpool"
538 and object-id = "john")
539#. Ceph takes the object ID and hashes it.
540#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
541 a PG ID.
542#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
543#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
544
545Computing object locations is much faster than performing object location query
546over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
547Hashing)` algorithm allows a client to compute where objects *should* be stored,
548and enables the client to contact the primary OSD to store or retrieve the
549objects.
550
551.. index:: architecture; PG Peering
552
553Peering and Sets
554~~~~~~~~~~~~~~~~
555
556In previous sections, we noted that Ceph OSD Daemons check each others
557heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
558do is called 'peering', which is the process of bringing all of the OSDs that
559store a Placement Group (PG) into agreement about the state of all of the
560objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
561Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve
562themselves; however, if the problem persists, you may need to refer to the
563`Troubleshooting Peering Failure`_ section.
564
565.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
566
567The Ceph Storage Cluster was designed to store at least two copies of an object
568(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
569availability, a Ceph Storage Cluster should store more than two copies of an object
570(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
571``degraded`` state while maintaining data safety.
572
573Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
574name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
575rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
576the *Primary* is the first OSD in the *Acting Set*, and is responsible for
577coordinating the peering process for each placement group where it acts as
578the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
579writes to objects for a given placement group where it acts as the *Primary*.
580
581When a series of OSDs are responsible for a placement group, that series of
582OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
583OSD Daemons that are currently responsible for the placement group, or the Ceph
584OSD Daemons that were responsible for a particular placement group as of some
585epoch.
586
587The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``.
588When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up
589Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
590Daemons when an OSD fails.
591
592.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
593 ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
594 the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
595 removed from the *Up Set*.
596
597
598.. index:: architecture; Rebalancing
599
600Rebalancing
601~~~~~~~~~~~
602
603When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
604updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
605the cluster map. Consequently, it changes object placement, because it changes
606an input for the calculations. The following diagram depicts the rebalancing
607process (albeit rather crudely, since it is substantially less impactful with
608large clusters) where some, but not all of the PGs migrate from existing OSDs
609(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
610stable. Many of the placement groups remain in their original configuration,
611and each OSD gets some added capacity, so there are no load spikes on the
612new OSD after rebalancing is complete.
613
614
615.. ditaa::
616 +--------+ +--------+
617 Before | OSD 1 | | OSD 2 |
618 +--------+ +--------+
619 | PG #1 | | PG #6 |
620 | PG #2 | | PG #7 |
621 | PG #3 | | PG #8 |
622 | PG #4 | | PG #9 |
623 | PG #5 | | PG #10 |
624 +--------+ +--------+
625
626 +--------+ +--------+ +--------+
627 After | OSD 1 | | OSD 2 | | OSD 3 |
628 +--------+ +--------+ +--------+
629 | PG #1 | | PG #7 | | PG #3 |
630 | PG #2 | | PG #8 | | PG #6 |
631 | PG #4 | | PG #10 | | PG #9 |
632 | PG #5 | | | | |
633 | | | | | |
634 +--------+ +--------+ +--------+
635
636
637.. index:: architecture; Data Scrubbing
638
639Data Consistency
640~~~~~~~~~~~~~~~~
641
642As part of maintaining data consistency and cleanliness, Ceph OSDs can also
643scrub objects within placement groups. That is, Ceph OSDs can compare object
644metadata in one placement group with its replicas in placement groups stored in
645other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
646errors. OSDs can also perform deeper scrubbing by comparing data in objects
647bit-for-bit. Deep scrubbing (usually performed weekly) finds bad sectors on a
648disk that weren't apparent in a light scrub.
649
650See `Data Scrubbing`_ for details on configuring scrubbing.
651
652
653
654
655
656.. index:: erasure coding
657
658Erasure Coding
659--------------
660
661An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
662``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
663of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
664the chunk is stored as an attribute of the object.
665
666For instance an erasure coded pool is created to use five OSDs (``K+M = 5``) and
667sustain the loss of two of them (``M = 2``).
668
669Reading and Writing Encoded Chunks
670~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
671
672When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
673encoding function splits the content into three data chunks simply by dividing
674the content in three: the first contains ``ABC``, the second ``DEF`` and the
675last ``GHI``. The content will be padded if the content length is not a multiple
676of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
677and the fifth with ``GQC``. Each chunk is stored in an OSD in the acting set.
678The chunks are stored in objects that have the same name (**NYAN**) but reside
679on different OSDs. The order in which the chunks were created must be preserved
680and is stored as an attribute of the object (``shard_t``), in addition to its
681name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
682``YXY`` and is stored on **OSD3**.
683
684
685.. ditaa::
686 +-------------------+
687 name | NYAN |
688 +-------------------+
689 content | ABCDEFGHI |
690 +--------+----------+
691 |
692 |
693 v
694 +------+------+
695 +---------------+ encode(3,2) +-----------+
696 | +--+--+---+---+ |
697 | | | | |
698 | +-------+ | +-----+ |
699 | | | | |
700 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
701 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
702 +------+ +------+ +------+ +------+ +------+
703 shard | 1 | | 2 | | 3 | | 4 | | 5 |
704 +------+ +------+ +------+ +------+ +------+
705 content | ABC | | DEF | | GHI | | YXY | | QGC |
706 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
707 | | | | |
708 | | v | |
709 | | +--+---+ | |
710 | | | OSD1 | | |
711 | | +------+ | |
712 | | | |
713 | | +------+ | |
714 | +------>| OSD2 | | |
715 | +------+ | |
716 | | |
717 | +------+ | |
718 | | OSD3 |<----+ |
719 | +------+ |
720 | |
721 | +------+ |
722 | | OSD4 |<--------------+
723 | +------+
724 |
725 | +------+
726 +----------------->| OSD5 |
727 +------+
728
729
730When the object **NYAN** is read from the erasure coded pool, the decoding
731function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
732``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
733of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
734and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
735because the **OSD4** is out. The decoding function can be called as soon as
736three chunks are read: **OSD2** was the slowest and its chunk was not taken into
737account.
738
739.. ditaa::
740 +-------------------+
741 name | NYAN |
742 +-------------------+
743 content | ABCDEFGHI |
744 +---------+---------+
745 ^
746 |
747 |
748 +-------+-------+
749 | decode(3,2) |
750 +------------->+ erasures 2,5 +<-+
751 | | | |
752 | +-------+-------+ |
753 | ^ |
754 | | |
755 | | |
756 +--+---+ +------+ +---+--+ +---+--+
757 name | NYAN | | NYAN | | NYAN | | NYAN |
758 +------+ +------+ +------+ +------+
759 shard | 1 | | 2 | | 3 | | 4 |
760 +------+ +------+ +------+ +------+
761 content | ABC | | DEF | | GHI | | YXY |
762 +--+---+ +--+---+ +--+---+ +--+---+
763 ^ . ^ ^
764 | TOO . | |
765 | SLOW . +--+---+ |
766 | ^ | OSD1 | |
767 | | +------+ |
768 | | |
769 | | +------+ |
770 | +-------| OSD2 | |
771 | +------+ |
772 | |
773 | +------+ |
774 | | OSD3 |------+
775 | +------+
776 |
777 | +------+
778 | | OSD4 | OUT
779 | +------+
780 |
781 | +------+
782 +------------------| OSD5 |
783 +------+
784
785
786Interrupted Full Writes
787~~~~~~~~~~~~~~~~~~~~~~~
788
789In an erasure coded pool, the primary OSD in the up set receives all write
790operations. It is responsible for encoding the payload into ``K+M`` chunks and
791sends them to the other OSDs. It is also responsible for maintaining an
792authoritative version of the placement group logs.
793
794In the following diagram, an erasure coded placement group has been created with
795``K = 2 + M = 1`` and is supported by three OSDs, two for ``K`` and one for
796``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
797**OSD 3**. An object has been encoded and stored in the OSDs : the chunk
798``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
799**OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
800placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
801version 1).
802
803
804.. ditaa::
805 Primary OSD
806
807 +-------------+
808 | OSD 1 | +-------------+
809 | log | Write Full | |
810 | +----+ |<------------+ Ceph Client |
811 | |D1v1| 1,1 | v1 | |
812 | +----+ | +-------------+
813 +------+------+
814 |
815 |
816 | +-------------+
817 | | OSD 2 |
818 | | log |
819 +--------->+ +----+ |
820 | | |D2v1| 1,1 |
821 | | +----+ |
822 | +-------------+
823 |
824 | +-------------+
825 | | OSD 3 |
826 | | log |
827 +--------->| +----+ |
828 | |C1v1| 1,1 |
829 | +----+ |
830 +-------------+
831
832**OSD 1** is the primary and receives a **WRITE FULL** from a client, which
833means the payload is to replace the object entirely instead of overwriting a
834portion of it. Version 2 (v2) of the object is created to override version 1
835(v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
836chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
837``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
838to the target OSD, including the primary OSD which is responsible for storing
839chunks in addition to handling write operations and maintaining an authoritative
840version of the placement group logs. When an OSD receives the message
841instructing it to write the chunk, it also creates a new entry in the placement
842group logs to reflect the change. For instance, as soon as **OSD 3** stores
843``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
844Because the OSDs work asynchronously, some chunks may still be in flight ( such
845as ``D2v2`` ) while others are acknowledged and on disk ( such as ``C1v1`` and
846``D1v1``).
847
848.. ditaa::
849
850 Primary OSD
851
852 +-------------+
853 | OSD 1 |
854 | log |
855 | +----+ | +-------------+
856 | |D1v2| 1,2 | Write Full | |
857 | +----+ +<------------+ Ceph Client |
858 | | v2 | |
859 | +----+ | +-------------+
860 | |D1v1| 1,1 |
861 | +----+ |
862 +------+------+
863 |
864 |
865 | +------+------+
866 | | OSD 2 |
867 | +------+ | log |
868 +->| D2v2 | | +----+ |
869 | +------+ | |D2v1| 1,1 |
870 | | +----+ |
871 | +-------------+
872 |
873 | +-------------+
874 | | OSD 3 |
875 | | log |
876 | | +----+ |
877 | | |C1v2| 1,2 |
878 +---------->+ +----+ |
879 | |
880 | +----+ |
881 | |C1v1| 1,1 |
882 | +----+ |
883 +-------------+
884
885
886If all goes well, the chunks are acknowledged on each OSD in the acting set and
887the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
888
889.. ditaa::
890
891 Primary OSD
892
893 +-------------+
894 | OSD 1 |
895 | log |
896 | +----+ | +-------------+
897 | |D1v2| 1,2 | Write Full | |
898 | +----+ +<------------+ Ceph Client |
899 | | v2 | |
900 | +----+ | +-------------+
901 | |D1v1| 1,1 |
902 | +----+ |
903 +------+------+
904 |
905 | +-------------+
906 | | OSD 2 |
907 | | log |
908 | | +----+ |
909 | | |D2v2| 1,2 |
910 +---------->+ +----+ |
911 | | |
912 | | +----+ |
913 | | |D2v1| 1,1 |
914 | | +----+ |
915 | +-------------+
916 |
917 | +-------------+
918 | | OSD 3 |
919 | | log |
920 | | +----+ |
921 | | |C1v2| 1,2 |
922 +---------->+ +----+ |
923 | |
924 | +----+ |
925 | |C1v1| 1,1 |
926 | +----+ |
927 +-------------+
928
929
930Finally, the files used to store the chunks of the previous version of the
931object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
932on **OSD 3**.
933
934.. ditaa::
935 Primary OSD
936
937 +-------------+
938 | OSD 1 |
939 | log |
940 | +----+ |
941 | |D1v2| 1,2 |
942 | +----+ |
943 +------+------+
944 |
945 |
946 | +-------------+
947 | | OSD 2 |
948 | | log |
949 +--------->+ +----+ |
950 | | |D2v2| 1,2 |
951 | | +----+ |
952 | +-------------+
953 |
954 | +-------------+
955 | | OSD 3 |
956 | | log |
957 +--------->| +----+ |
958 | |C1v2| 1,2 |
959 | +----+ |
960 +-------------+
961
962
963But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
964the object's version 2 is partially written: **OSD 3** has one chunk but that is
965not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
966erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
967available to rebuild the third. **OSD 4** becomes the new primary and finds that
968the ``last_complete`` log entry (i.e., all objects before this entry were known
969to be available on all OSDs in the previous acting set ) is ``1,1`` and that
970will be the head of the new authoritative log.
971
972.. ditaa::
973 +-------------+
974 | OSD 1 |
975 | (down) |
976 | c333 |
977 +------+------+
978 |
979 | +-------------+
980 | | OSD 2 |
981 | | log |
982 | | +----+ |
983 +---------->+ |D2v1| 1,1 |
984 | | +----+ |
985 | | |
986 | +-------------+
987 |
988 | +-------------+
989 | | OSD 3 |
990 | | log |
991 | | +----+ |
992 | | |C1v2| 1,2 |
993 +---------->+ +----+ |
994 | |
995 | +----+ |
996 | |C1v1| 1,1 |
997 | +----+ |
998 +-------------+
999 Primary OSD
1000 +-------------+
1001 | OSD 4 |
1002 | log |
1003 | |
1004 | 1,1 |
1005 | |
1006 +------+------+
1007
1008
1009
1010The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
1011provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
1012chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
1013the erasure coding library during scrubbing and stored on the new primary
1014**OSD 4**.
1015
1016
1017.. ditaa::
1018 Primary OSD
1019
1020 +-------------+
1021 | OSD 4 |
1022 | log |
1023 | +----+ |
1024 | |D1v1| 1,1 |
1025 | +----+ |
1026 +------+------+
1027 ^
1028 |
1029 | +-------------+
1030 | | OSD 2 |
1031 | | log |
1032 +----------+ +----+ |
1033 | | |D2v1| 1,1 |
1034 | | +----+ |
1035 | +-------------+
1036 |
1037 | +-------------+
1038 | | OSD 3 |
1039 | | log |
1040 +----------| +----+ |
1041 | |C1v1| 1,1 |
1042 | +----+ |
1043 +-------------+
1044
1045 +-------------+
1046 | OSD 1 |
1047 | (down) |
1048 | c333 |
1049 +-------------+
1050
1051See `Erasure Code Notes`_ for additional details.
1052
1053
1054
1055Cache Tiering
1056-------------
1057
1058A cache tier provides Ceph Clients with better I/O performance for a subset of
1059the data stored in a backing storage tier. Cache tiering involves creating a
1060pool of relatively fast/expensive storage devices (e.g., solid state drives)
1061configured to act as a cache tier, and a backing pool of either erasure-coded
1062or relatively slower/cheaper devices configured to act as an economical storage
1063tier. The Ceph objecter handles where to place the objects and the tiering
1064agent determines when to flush objects from the cache to the backing storage
1065tier. So the cache tier and the backing storage tier are completely transparent
1066to Ceph clients.
1067
1068
1069.. ditaa::
1070 +-------------+
1071 | Ceph Client |
1072 +------+------+
1073 ^
1074 Tiering is |
1075 Transparent | Faster I/O
1076 to Ceph | +---------------+
1077 Client Ops | | |
1078 | +----->+ Cache Tier |
1079 | | | |
1080 | | +-----+---+-----+
1081 | | | ^
1082 v v | | Active Data in Cache Tier
1083 +------+----+--+ | |
1084 | Objecter | | |
1085 +-----------+--+ | |
1086 ^ | | Inactive Data in Storage Tier
1087 | v |
1088 | +-----+---+-----+
1089 | | |
1090 +----->| Storage Tier |
1091 | |
1092 +---------------+
1093 Slower I/O
1094
1095See `Cache Tiering`_ for additional details.
1096
1097
1098.. index:: Extensibility, Ceph Classes
1099
1100Extending Ceph
1101--------------
1102
1103You can extend Ceph by creating shared object classes called 'Ceph Classes'.
1104Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
1105(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
1106can create new object methods that have the ability to call the native methods
1107in the Ceph Object Store, or other class methods you incorporate via libraries
1108or create yourself.
1109
1110On writes, Ceph Classes can call native or class methods, perform any series of
1111operations on the inbound data and generate a resulting write transaction that
1112Ceph will apply atomically.
1113
1114On reads, Ceph Classes can call native or class methods, perform any series of
1115operations on the outbound data and return the data to the client.
1116
1117.. topic:: Ceph Class Example
1118
1119 A Ceph class for a content management system that presents pictures of a
1120 particular size and aspect ratio could take an inbound bitmap image, crop it
1121 to a particular aspect ratio, resize it and embed an invisible copyright or
1122 watermark to help protect the intellectual property; then, save the
1123 resulting bitmap image to the object store.
1124
1125See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for
1126exemplary implementations.
1127
1128
1129Summary
1130-------
1131
1132Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
1133appliances do not fully utilize the CPU and RAM of a typical commodity server,
1134Ceph does. From heartbeats, to peering, to rebalancing the cluster or
1135recovering from faults, Ceph offloads work from clients (and from a centralized
1136gateway which doesn't exist in the Ceph architecture) and uses the computing
1137power of the OSDs to perform the work. When referring to `Hardware
1138Recommendations`_ and the `Network Config Reference`_, be cognizant of the
1139foregoing concepts to understand how Ceph utilizes computing resources.
1140
1141.. index:: Ceph Protocol, librados
1142
1143Ceph Protocol
1144=============
1145
1146Ceph Clients use the native protocol for interacting with the Ceph Storage
1147Cluster. Ceph packages this functionality into the ``librados`` library so that
1148you can create your own custom Ceph Clients. The following diagram depicts the
1149basic architecture.
1150
1151.. ditaa::
1152 +---------------------------------+
1153 | Ceph Storage Cluster Protocol |
1154 | (librados) |
1155 +---------------------------------+
1156 +---------------+ +---------------+
1157 | OSDs | | Monitors |
1158 +---------------+ +---------------+
1159
1160
1161Native Protocol and ``librados``
1162--------------------------------
1163
1164Modern applications need a simple object storage interface with asynchronous
1165communication capability. The Ceph Storage Cluster provides a simple object
1166storage interface with asynchronous communication capability. The interface
1167provides direct, parallel access to objects throughout the cluster.
1168
1169
1170- Pool Operations
1171- Snapshots and Copy-on-write Cloning
1172- Read/Write Objects
1173 - Create or Remove
1174 - Entire Object or Byte Range
1175 - Append or Truncate
1176- Create/Set/Get/Remove XATTRs
1177- Create/Set/Get/Remove Key/Value Pairs
1178- Compound operations and dual-ack semantics
1179- Object Classes
1180
1181
1182.. index:: architecture; watch/notify
1183
1184Object Watch/Notify
1185-------------------
1186
1187A client can register a persistent interest with an object and keep a session to
1188the primary OSD open. The client can send a notification message and a payload to
1189all watchers and receive notification when the watchers receive the
1190notification. This enables a client to use any object as a
1191synchronization/communication channel.
1192
1193
1194.. ditaa:: +----------+ +----------+ +----------+ +---------------+
1195 | Client 1 | | Client 2 | | Client 3 | | OSD:Object ID |
1196 +----------+ +----------+ +----------+ +---------------+
1197 | | | |
1198 | | | |
1199 | | Watch Object | |
1200 |--------------------------------------------------->|
1201 | | | |
1202 |<---------------------------------------------------|
1203 | | Ack/Commit | |
1204 | | | |
1205 | | Watch Object | |
1206 | |---------------------------------->|
1207 | | | |
1208 | |<----------------------------------|
1209 | | Ack/Commit | |
1210 | | | Watch Object |
1211 | | |----------------->|
1212 | | | |
1213 | | |<-----------------|
1214 | | | Ack/Commit |
1215 | | Notify | |
1216 |--------------------------------------------------->|
1217 | | | |
1218 |<---------------------------------------------------|
1219 | | Notify | |
1220 | | | |
1221 | |<----------------------------------|
1222 | | Notify | |
1223 | | |<-----------------|
1224 | | | Notify |
1225 | | Ack | |
1226 |----------------+---------------------------------->|
1227 | | | |
1228 | | Ack | |
1229 | +---------------------------------->|
1230 | | | |
1231 | | | Ack |
1232 | | |----------------->|
1233 | | | |
1234 |<---------------+----------------+------------------|
1235 | Complete
1236
1237.. index:: architecture; Striping
1238
1239Data Striping
1240-------------
1241
1242Storage devices have throughput limitations, which impact performance and
1243scalability. So storage systems often support `striping`_--storing sequential
1244pieces of information across multiple storage devices--to increase throughput
1245and performance. The most common form of data striping comes from `RAID`_.
1246The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
1247volume'. Ceph's striping offers the throughput of RAID 0 striping, the
1248reliability of n-way RAID mirroring and faster recovery.
1249
1250Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and
1251Ceph Object Storage. A Ceph Client converts its data from the representation
1252format it provides to its users (a block device image, RESTful objects, CephFS
1253filesystem directories) into objects for storage in the Ceph Storage Cluster.
1254
1255.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped.
1256 Ceph Object Storage, Ceph Block Device, and the Ceph Filesystem stripe their
1257 data over multiple Ceph Storage Cluster objects. Ceph Clients that write
1258 directly to the Ceph Storage Cluster via ``librados`` must perform the
1259 striping (and parallel I/O) for themselves to obtain these benefits.
1260
1261The simplest Ceph striping format involves a stripe count of 1 object. Ceph
1262Clients write stripe units to a Ceph Storage Cluster object until the object is
1263at its maximum capacity, and then create another object for additional stripes
1264of data. The simplest form of striping may be sufficient for small block device
1265images, S3 or Swift objects and CephFS files. However, this simple form doesn't
1266take maximum advantage of Ceph's ability to distribute data across placement
1267groups, and consequently doesn't improve performance very much. The following
1268diagram depicts the simplest form of striping:
1269
1270.. ditaa::
1271 +---------------+
1272 | Client Data |
1273 | Format |
1274 | cCCC |
1275 +---------------+
1276 |
1277 +--------+-------+
1278 | |
1279 v v
1280 /-----------\ /-----------\
1281 | Begin cCCC| | Begin cCCC|
1282 | Object 0 | | Object 1 |
1283 +-----------+ +-----------+
1284 | stripe | | stripe |
1285 | unit 1 | | unit 5 |
1286 +-----------+ +-----------+
1287 | stripe | | stripe |
1288 | unit 2 | | unit 6 |
1289 +-----------+ +-----------+
1290 | stripe | | stripe |
1291 | unit 3 | | unit 7 |
1292 +-----------+ +-----------+
1293 | stripe | | stripe |
1294 | unit 4 | | unit 8 |
1295 +-----------+ +-----------+
1296 | End cCCC | | End cCCC |
1297 | Object 0 | | Object 1 |
1298 \-----------/ \-----------/
1299
1300
1301If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
1302or large CephFS directories, you may see considerable read/write performance
1303improvements by striping client data over multiple objects within an object set.
1304Significant write performance occurs when the client writes the stripe units to
1305their corresponding objects in parallel. Since objects get mapped to different
1306placement groups and further mapped to different OSDs, each write occurs in
1307parallel at the maximum write speed. A write to a single disk would be limited
1308by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
1309100MB/s). By spreading that write over multiple objects (which map to different
1310placement groups and OSDs) Ceph can reduce the number of seeks per drive and
1311combine the throughput of multiple drives to achieve much faster write (or read)
1312speeds.
1313
1314.. note:: Striping is independent of object replicas. Since CRUSH
1315 replicates objects across OSDs, stripes get replicated automatically.
1316
1317In the following diagram, client data gets striped across an object set
1318(``object set 1`` in the following diagram) consisting of 4 objects, where the
1319first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
1320unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
1321client determines if the object set is full. If the object set is not full, the
1322client begins writing a stripe to the first object again (``object 0`` in the
1323following diagram). If the object set is full, the client creates a new object
1324set (``object set 2`` in the following diagram), and begins writing to the first
1325stripe (``stripe unit 16``) in the first object in the new object set (``object
13264`` in the diagram below).
1327
1328.. ditaa::
1329 +---------------+
1330 | Client Data |
1331 | Format |
1332 | cCCC |
1333 +---------------+
1334 |
1335 +-----------------+--------+--------+-----------------+
1336 | | | | +--\
1337 v v v v |
1338 /-----------\ /-----------\ /-----------\ /-----------\ |
1339 | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| |
1340 | Object 0 | | Object 1 | | Object 2 | | Object 3 | |
1341 +-----------+ +-----------+ +-----------+ +-----------+ |
1342 | stripe | | stripe | | stripe | | stripe | |
1343 | unit 0 | | unit 1 | | unit 2 | | unit 3 | |
1344 +-----------+ +-----------+ +-----------+ +-----------+ |
1345 | stripe | | stripe | | stripe | | stripe | +-\
1346 | unit 4 | | unit 5 | | unit 6 | | unit 7 | | Object
1347 +-----------+ +-----------+ +-----------+ +-----------+ +- Set
1348 | stripe | | stripe | | stripe | | stripe | | 1
1349 | unit 8 | | unit 9 | | unit 10 | | unit 11 | +-/
1350 +-----------+ +-----------+ +-----------+ +-----------+ |
1351 | stripe | | stripe | | stripe | | stripe | |
1352 | unit 12 | | unit 13 | | unit 14 | | unit 15 | |
1353 +-----------+ +-----------+ +-----------+ +-----------+ |
1354 | End cCCC | | End cCCC | | End cCCC | | End cCCC | |
1355 | Object 0 | | Object 1 | | Object 2 | | Object 3 | |
1356 \-----------/ \-----------/ \-----------/ \-----------/ |
1357 |
1358 +--/
1359
1360 +--\
1361 |
1362 /-----------\ /-----------\ /-----------\ /-----------\ |
1363 | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| |
1364 | Object 4 | | Object 5 | | Object 6 | | Object 7 | |
1365 +-----------+ +-----------+ +-----------+ +-----------+ |
1366 | stripe | | stripe | | stripe | | stripe | |
1367 | unit 16 | | unit 17 | | unit 18 | | unit 19 | |
1368 +-----------+ +-----------+ +-----------+ +-----------+ |
1369 | stripe | | stripe | | stripe | | stripe | +-\
1370 | unit 20 | | unit 21 | | unit 22 | | unit 23 | | Object
1371 +-----------+ +-----------+ +-----------+ +-----------+ +- Set
1372 | stripe | | stripe | | stripe | | stripe | | 2
1373 | unit 24 | | unit 25 | | unit 26 | | unit 27 | +-/
1374 +-----------+ +-----------+ +-----------+ +-----------+ |
1375 | stripe | | stripe | | stripe | | stripe | |
1376 | unit 28 | | unit 29 | | unit 30 | | unit 31 | |
1377 +-----------+ +-----------+ +-----------+ +-----------+ |
1378 | End cCCC | | End cCCC | | End cCCC | | End cCCC | |
1379 | Object 4 | | Object 5 | | Object 6 | | Object 7 | |
1380 \-----------/ \-----------/ \-----------/ \-----------/ |
1381 |
1382 +--/
1383
1384Three important variables determine how Ceph stripes data:
1385
1386- **Object Size:** Objects in the Ceph Storage Cluster have a maximum
1387 configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
1388 enough to accommodate many stripe units, and should be a multiple of
1389 the stripe unit.
1390
1391- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
1392 The Ceph Client divides the data it will write to objects into equally
1393 sized stripe units, except for the last stripe unit. A stripe width,
1394 should be a fraction of the Object Size so that an object may contain
1395 many stripe units.
1396
1397- **Stripe Count:** The Ceph Client writes a sequence of stripe units
1398 over a series of objects determined by the stripe count. The series
1399 of objects is called an object set. After the Ceph Client writes to
1400 the last object in the object set, it returns to the first object in
1401 the object set.
1402
1403.. important:: Test the performance of your striping configuration before
1404 putting your cluster into production. You CANNOT change these striping
1405 parameters after you stripe the data and write it to objects.
1406
1407Once the Ceph Client has striped data to stripe units and mapped the stripe
1408units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
1409and the placement groups to Ceph OSD Daemons before the objects are stored as
1410files on a storage disk.
1411
1412.. note:: Since a client writes to a single pool, all data striped into objects
1413 get mapped to placement groups in the same pool. So they use the same CRUSH
1414 map and the same access controls.
1415
1416
1417.. index:: architecture; Ceph Clients
1418
1419Ceph Clients
1420============
1421
1422Ceph Clients include a number of service interfaces. These include:
1423
1424- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
1425 provides resizable, thin-provisioned block devices with snapshotting and
1426 cloning. Ceph stripes a block device across the cluster for high
1427 performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
1428 that uses ``librbd`` directly--avoiding the kernel object overhead for
1429 virtualized systems.
1430
1431- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
1432 provides RESTful APIs with interfaces that are compatible with Amazon S3
1433 and OpenStack Swift.
1434
1435- **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides
1436 a POSIX compliant filesystem usable with ``mount`` or as
1437 a filesytem in user space (FUSE).
1438
1439Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
1440and high availability. The following diagram depicts the high-level
1441architecture.
1442
1443.. ditaa::
1444 +--------------+ +----------------+ +-------------+
1445 | Block Device | | Object Storage | | Ceph FS |
1446 +--------------+ +----------------+ +-------------+
1447
1448 +--------------+ +----------------+ +-------------+
1449 | librbd | | librgw | | libcephfs |
1450 +--------------+ +----------------+ +-------------+
1451
1452 +---------------------------------------------------+
1453 | Ceph Storage Cluster Protocol (librados) |
1454 +---------------------------------------------------+
1455
1456 +---------------+ +---------------+ +---------------+
1457 | OSDs | | MDSs | | Monitors |
1458 +---------------+ +---------------+ +---------------+
1459
1460
1461.. index:: architecture; Ceph Object Storage
1462
1463Ceph Object Storage
1464-------------------
1465
1466The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
1467a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
1468Storage Cluster with its own data formats, and maintains its own user database,
1469authentication, and access control. The RADOS Gateway uses a unified namespace,
1470which means you can use either the OpenStack Swift-compatible API or the Amazon
1471S3-compatible API. For example, you can write data using the S3-compatible API
1472with one application and then read data using the Swift-compatible API with
1473another application.
1474
1475.. topic:: S3/Swift Objects and Store Cluster Objects Compared
1476
1477 Ceph's Object Storage uses the term *object* to describe the data it stores.
1478 S3 and Swift objects are not the same as the objects that Ceph writes to the
1479 Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
1480 Cluster objects. The S3 and Swift objects do not necessarily
1481 correspond in a 1:1 manner with an object stored in the storage cluster. It
1482 is possible for an S3 or Swift object to map to multiple Ceph objects.
1483
1484See `Ceph Object Storage`_ for details.
1485
1486
1487.. index:: Ceph Block Device; block device; RBD; Rados Block Device
1488
1489Ceph Block Device
1490-----------------
1491
1492A Ceph Block Device stripes a block device image over multiple objects in the
1493Ceph Storage Cluster, where each object gets mapped to a placement group and
1494distributed, and the placement groups are spread across separate ``ceph-osd``
1495daemons throughout the cluster.
1496
1497.. important:: Striping allows RBD block devices to perform better than a single
1498 server could!
1499
1500Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
1501virtualization and cloud computing. In virtual machine scenarios, people
1502typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
1503QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
1504service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
1505with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
1506``libvirt`` to support OpenStack and CloudStack among other solutions.
1507
1508While we do not provide ``librbd`` support with other hypervisors at this time,
1509you may also use Ceph Block Device kernel objects to provide a block device to a
1510client. Other virtualization technologies such as Xen can access the Ceph Block
1511Device kernel object(s). This is done with the command-line tool ``rbd``.
1512
1513
1514.. index:: Ceph FS; Ceph Filesystem; libcephfs; MDS; metadata server; ceph-mds
1515
1516Ceph Filesystem
1517---------------
1518
1519The Ceph Filesystem (Ceph FS) provides a POSIX-compliant filesystem as a
1520service that is layered on top of the object-based Ceph Storage Cluster.
1521Ceph FS files get mapped to objects that Ceph stores in the Ceph Storage
1522Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
1523a Filesystem in User Space (FUSE).
1524
1525.. ditaa::
1526 +-----------------------+ +------------------------+
1527 | CephFS Kernel Object | | CephFS FUSE |
1528 +-----------------------+ +------------------------+
1529
1530 +---------------------------------------------------+
1531 | Ceph FS Library (libcephfs) |
1532 +---------------------------------------------------+
1533
1534 +---------------------------------------------------+
1535 | Ceph Storage Cluster Protocol (librados) |
1536 +---------------------------------------------------+
1537
1538 +---------------+ +---------------+ +---------------+
1539 | OSDs | | MDSs | | Monitors |
1540 +---------------+ +---------------+ +---------------+
1541
1542
1543The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed
1544with the Ceph Storage cluster. The purpose of the MDS is to store all the
1545filesystem metadata (directories, file ownership, access modes, etc) in
1546high-availability Ceph Metadata Servers where the metadata resides in memory.
1547The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
1548operations like listing a directory or changing a directory (``ls``, ``cd``)
1549would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
1550the data means that the Ceph Filesystem can provide high performance services
1551without taxing the Ceph Storage Cluster.
1552
1553Ceph FS separates the metadata from the data, storing the metadata in the MDS,
1554and storing the file data in one or more objects in the Ceph Storage Cluster.
1555The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
1556single process, or it can be distributed out to multiple physical machines,
1557either for high availability or for scalability.
1558
1559- **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
1560 ready to take over the duties of any failed ``ceph-mds`` that was
1561 `active`. This is easy because all the data, including the journal, is
1562 stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
1563
1564- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
1565 will split the directory tree into subtrees (and shards of a single
1566 busy directory), effectively balancing the load amongst all `active`
1567 servers.
1568
1569Combinations of `standby` and `active` etc are possible, for example
1570running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
1571instance for high availability.
1572
1573
1574
1575
1576.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf
1577.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
1578.. _Monitor Config Reference: ../rados/configuration/mon-config-ref
1579.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
1580.. _Heartbeats: ../rados/configuration/mon-osd-interaction
1581.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
1582.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: http://ceph.com/papers/weil-crush-sc06.pdf
1583.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1584.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
1585.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
1586.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
1587.. _Hardware Recommendations: ../start/hardware-recommendations
1588.. _Network Config Reference: ../rados/configuration/network-config-ref
1589.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1590.. _striping: http://en.wikipedia.org/wiki/Data_striping
1591.. _RAID: http://en.wikipedia.org/wiki/RAID
1592.. _RAID 0: http://en.wikipedia.org/wiki/RAID_0#RAID_0
1593.. _Ceph Object Storage: ../radosgw/
1594.. _RESTful: http://en.wikipedia.org/wiki/RESTful
1595.. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
1596.. _Cache Tiering: ../rados/operations/cache-tiering
1597.. _Set Pool Values: ../rados/operations/pools#set-pool-values
1598.. _Kerberos: http://en.wikipedia.org/wiki/Kerberos_(protocol)
1599.. _Cephx Config Guide: ../rados/configuration/auth-config-ref
1600.. _User Management: ../rados/operations/user-management