ceph/doc/architecture.rst

   1 ==============
   2  Architecture
   3 ==============
   4
   5 :term:`Ceph` uniquely delivers **object, block, and file storage** in one
   6 unified system. Ceph is highly reliable, easy to manage, and free. The power of
   7 Ceph can transform your company's IT infrastructure and your ability to manage
   8 vast amounts of data. Ceph delivers extraordinary scalability–thousands of
   9 clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
  10 commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
  11 accommodates large numbers of nodes, which communicate with each other to
  12 replicate and redistribute data dynamically.
  13
  14 .. image:: images/stack.png
  15
  16
  17 The Ceph Storage Cluster
  18 ========================
  19
  20 Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
  21 :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
  22 about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
  23 Storage Clusters`_.
  24
  25 A Ceph Storage Cluster consists of two types of daemons:
  26
  27 - :term:`Ceph Monitor`
  28 - :term:`Ceph OSD Daemon`
  29
  30 .. ditaa::
  31
  32             +---------------+ +---------------+
  33             |      OSDs     | |    Monitors   |
  34             +---------------+ +---------------+
  35
  36 A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
  37 monitors ensures high availability should a monitor daemon fail. Storage cluster
  38 clients retrieve a copy of the cluster map from the Ceph Monitor.
  39
  40 A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
  41 back to monitors.
  42
  43 Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
  44 to efficiently compute information about data location, instead of having to
  45 depend on a central lookup table. Ceph's high-level features include providing a
  46 native interface to the Ceph Storage Cluster via ``librados``, and a number of
  47 service interfaces built on top of ``librados``.
  48
  49
  50
  51 Storing Data
  52 ------------
  53
  54 The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it
  55 comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
  56 :term:`Ceph File System` or a custom implementation you create using
  57 ``librados``--and it stores the data as objects. Each object corresponds to a
  58 file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph
  59 OSD Daemons handle the read/write operations on the storage disks.
  60
  61 .. ditaa::
  62
  63            /-----\       +-----+       +-----+
  64            | obj |------>| {d} |------>| {s} |
  65            \-----/       +-----+       +-----+
  66
  67             Object         File         Disk
  68
  69 Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no
  70 hierarchy of directories). An object has an identifier, binary data, and
  71 metadata consisting of a set of name/value pairs. The semantics are completely
  72 up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file
  73 attributes such as the file owner, created date, last modified date, and so
  74 forth.
  75
  76
  77 .. ditaa::
  78
  79            /------+------------------------------+----------------\
  80            | ID   | Binary Data                  | Metadata       |
  81            +------+------------------------------+----------------+
  82            | 1234 | 0101010101010100110101010010 | name1 = value1 |
  83            |      | 0101100001010100110101010010 | name2 = value2 |
  84            |      | 0101100001010100110101010010 | nameN = valueN |
  85            \------+------------------------------+----------------/
  86
  87 .. note:: An object ID is unique across the entire cluster, not just the local
  88    filesystem.
  89
  90
  91 .. index:: architecture; high availability, scalability
  92
  93 Scalability and High Availability
  94 ---------------------------------
  95
  96 In traditional architectures, clients talk to a centralized component (e.g., a
  97 gateway, broker, API, facade, etc.), which acts as a single point of entry to a
  98 complex subsystem. This imposes a limit to both performance and scalability,
  99 while introducing a single point of failure (i.e., if the centralized component
 100 goes down, the whole system goes down, too).
 101
 102 Ceph eliminates the centralized gateway to enable clients to interact with
 103 Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
 104 Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
 105 of monitors to ensure high availability. To eliminate centralization, Ceph
 106 uses an algorithm called CRUSH.
 107
 108
 109 .. index:: CRUSH; architecture
 110
 111 CRUSH Introduction
 112 ~~~~~~~~~~~~~~~~~~
 113
 114 Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
 115 Replication Under Scalable Hashing)` algorithm to efficiently compute
 116 information about object location, instead of having to depend on a
 117 central lookup table. CRUSH provides a better data management mechanism compared
 118 to older approaches, and enables massive scale by cleanly distributing the work
 119 to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
 120 replication to ensure resiliency, which is better suited to hyper-scale storage.
 121 The following sections provide additional details on how CRUSH works. For a
 122 detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
 123 Placement of Replicated Data`_.
 124
 125 .. index:: architecture; cluster map
 126
 127 Cluster Map
 128 ~~~~~~~~~~~
 129
 130 Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
 131 cluster topology, which is inclusive of 5 maps collectively referred to as the
 132 "Cluster Map":
 133
 134 #. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
 135    address and port of each monitor. It also indicates the current epoch,
 136    when the map was created, and the last time it changed. To view a monitor
 137    map, execute ``ceph mon dump``.
 138
 139 #. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
 140    last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
 141    and their status (e.g., ``up``, ``in``). To view an OSD map, execute
 142    ``ceph osd dump``.
 143
 144 #. **The PG Map:** Contains the PG version, its time stamp, the last OSD
 145    map epoch, the full ratios, and details on each placement group such as
 146    the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
 147    ``active + clean``), and data usage statistics for each pool.
 148
 149 #. **The CRUSH Map:** Contains a list of storage devices, the failure domain
 150    hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
 151    traversing the hierarchy when storing data. To view a CRUSH map, execute
 152    ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
 153    ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
 154    You can view the decompiled map in a text editor or with ``cat``.
 155
 156 #. **The MDS Map:** Contains the current MDS map epoch, when the map was
 157    created, and the last time it changed. It also contains the pool for
 158    storing metadata, a list of metadata servers, and which metadata servers
 159    are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
 160
 161 Each map maintains an iterative history of its operating state changes. Ceph
 162 Monitors maintain a master copy of the cluster map including the cluster
 163 members, state, changes, and the overall health of the Ceph Storage Cluster.
 164
 165 .. index:: high availability; monitor architecture
 166
 167 High Availability Monitors
 168 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 169
 170 Before Ceph Clients can read or write data, they must contact a Ceph Monitor
 171 to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
 172 can operate with a single monitor; however, this introduces a single
 173 point of failure (i.e., if the monitor goes down, Ceph Clients cannot
 174 read or write data).
 175
 176 For added reliability and fault tolerance, Ceph supports a cluster of monitors.
 177 In a cluster of monitors, latency and other faults can cause one or more
 178 monitors to fall behind the current state of the cluster. For this reason, Ceph
 179 must have agreement among various monitor instances regarding the state of the
 180 cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
 181 and the `Paxos`_ algorithm to establish a consensus among the monitors about the
 182 current state of the cluster.
 183
 184 For details on configuring monitors, see the `Monitor Config Reference`_.
 185
 186 .. index:: architecture; high availability authentication
 187
 188 High Availability Authentication
 189 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 190
 191 To identify users and protect against man-in-the-middle attacks, Ceph provides
 192 its ``cephx`` authentication system to authenticate users and daemons.
 193
 194 .. note:: The ``cephx`` protocol does not address data encryption in transport
 195    (e.g., SSL/TLS) or encryption at rest.
 196
 197 Cephx uses shared secret keys for authentication, meaning both the client and
 198 the monitor cluster have a copy of the client's secret key. The authentication
 199 protocol is such that both parties are able to prove to each other they have a
 200 copy of the key without actually revealing it. This provides mutual
 201 authentication, which means the cluster is sure the user possesses the secret
 202 key, and the user is sure that the cluster has a copy of the secret key.
 203
 204 A key scalability feature of Ceph is to avoid a centralized interface to the
 205 Ceph object store, which means that Ceph clients must be able to interact with
 206 OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
 207 system, which authenticates users operating Ceph clients. The ``cephx`` protocol
 208 operates in a manner with behavior similar to `Kerberos`_.
 209
 210 A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
 211 monitor can authenticate users and distribute keys, so there is no single point
 212 of failure or bottleneck when using ``cephx``. The monitor returns an
 213 authentication data structure similar to a Kerberos ticket that contains a
 214 session key for use in obtaining Ceph services.  This session key is itself
 215 encrypted with the user's permanent  secret key, so that only the user can
 216 request services from the Ceph Monitor(s). The client then uses the session key
 217 to request its desired services from the monitor, and the monitor provides the
 218 client with a ticket that will authenticate the client to the OSDs that actually
 219 handle data. Ceph Monitors and OSDs share a secret, so the client can use the
 220 ticket provided by the monitor with any OSD or metadata server in the cluster.
 221 Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
 222 ticket or session key obtained surreptitiously. This form of authentication will
 223 prevent attackers with access to the communications medium from either creating
 224 bogus messages under another user's identity or altering another user's
 225 legitimate messages, as long as the user's secret key is not divulged before it
 226 expires.
 227
 228 To use ``cephx``, an administrator must set up users first. In the following
 229 diagram, the ``client.admin`` user invokes  ``ceph auth get-or-create-key`` from
 230 the command line to generate a username and secret key. Ceph's ``auth``
 231 subsystem generates the username and key, stores a copy with the monitor(s) and
 232 transmits the user's secret back to the ``client.admin`` user. This means that
 233 the client and the monitor share a secret key.
 234
 235 .. note:: The ``client.admin`` user must provide the user ID and
 236    secret key to the user in a secure manner.
 237
 238 .. ditaa::
 239
 240            +---------+     +---------+
 241            | Client  |     | Monitor |
 242            +---------+     +---------+
 243                 |  request to   |
 244                 | create a user |
 245                 |-------------->|----------+ create user
 246                 |               |          | and
 247                 |<--------------|<---------+ store key
 248                 | transmit key  |
 249                 |               |
 250
 251
 252 To authenticate with the monitor, the client passes in the user name to the
 253 monitor, and the monitor generates a session key and encrypts it with the secret
 254 key associated to the user name. Then, the monitor transmits the encrypted
 255 ticket back to the client. The client then decrypts the payload with the shared
 256 secret key to retrieve the session key. The session key identifies the user for
 257 the current session. The client then requests a ticket on behalf of the user
 258 signed by the session key. The monitor generates a ticket, encrypts it with the
 259 user's secret key and transmits it back to the client. The client decrypts the
 260 ticket and uses it to sign requests to OSDs and metadata servers throughout the
 261 cluster.
 262
 263 .. ditaa::
 264
 265            +---------+     +---------+
 266            | Client  |     | Monitor |
 267            +---------+     +---------+
 268                 |  authenticate |
 269                 |-------------->|----------+ generate and
 270                 |               |          | encrypt
 271                 |<--------------|<---------+ session key
 272                 | transmit      |
 273                 | encrypted     |
 274                 | session key   |
 275                 |               |
 276                 |-----+ decrypt |
 277                 |     | session |
 278                 |<----+ key     |
 279                 |               |
 280                 |  req. ticket  |
 281                 |-------------->|----------+ generate and
 282                 |               |          | encrypt
 283                 |<--------------|<---------+ ticket
 284                 | recv. ticket  |
 285                 |               |
 286                 |-----+ decrypt |
 287                 |     | ticket  |
 288                 |<----+         |
 289
 290
 291 The ``cephx`` protocol authenticates ongoing communications between the client
 292 machine and the Ceph servers. Each message sent between a client and server,
 293 subsequent to the initial authentication, is signed using a ticket that the
 294 monitors, OSDs and metadata servers can verify with their shared secret.
 295
 296 .. ditaa::
 297
 298            +---------+     +---------+     +-------+     +-------+
 299            |  Client |     | Monitor |     |  MDS  |     |  OSD  |
 300            +---------+     +---------+     +-------+     +-------+
 301                 |  request to   |              |             |
 302                 | create a user |              |             |
 303                 |-------------->| mon and      |             |
 304                 |<--------------| client share |             |
 305                 |    receive    | a secret.    |             |
 306                 | shared secret |              |             |
 307                 |               |<------------>|             |
 308                 |               |<-------------+------------>|
 309                 |               | mon, mds,    |             |
 310                 | authenticate  | and osd      |             |
 311                 |-------------->| share        |             |
 312                 |<--------------| a secret     |             |
 313                 |  session key  |              |             |
 314                 |               |              |             |
 315                 |  req. ticket  |              |             |
 316                 |-------------->|              |             |
 317                 |<--------------|              |             |
 318                 | recv. ticket  |              |             |
 319                 |               |              |             |
 320                 |   make request (CephFS only) |             |
 321                 |----------------------------->|             |
 322                 |<-----------------------------|             |
 323                 | receive response (CephFS only)             |
 324                 |                                            |
 325                 |                make request                |
 326                 |------------------------------------------->|
 327                 |<-------------------------------------------|
 328                                receive response
 329
 330 The protection offered by this authentication is between the Ceph client and the
 331 Ceph server hosts. The authentication is not extended beyond the Ceph client. If
 332 the user accesses the Ceph client from a remote host, Ceph authentication is not
 333 applied to the connection between the user's host and the client host.
 334
 335
 336 For configuration details, see `Cephx Config Guide`_. For user management
 337 details, see `User Management`_.
 338
 339
 340 .. index:: architecture; smart daemons and scalability
 341
 342 Smart Daemons Enable Hyperscale
 343 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 344
 345 In many clustered architectures, the primary purpose of cluster membership is
 346 so that a centralized interface knows which nodes it can access. Then the
 347 centralized interface provides services to the client through a double
 348 dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
 349
 350 Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
 351 aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
 352 Daemons in the cluster.  This enables Ceph OSD Daemons to interact directly with
 353 other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
 354 to interact directly with Ceph OSD Daemons.
 355
 356 The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
 357 each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
 358 nodes to easily perform tasks that would bog down a centralized server. The
 359 ability to leverage this computing power leads to several major benefits:
 360
 361 #. **OSDs Service Clients Directly:** Since any network device has a limit to
 362    the number of concurrent connections it can support, a centralized system
 363    has a low physical limit at high scales. By enabling Ceph Clients to contact
 364    Ceph OSD Daemons directly, Ceph increases both performance and total system
 365    capacity simultaneously, while removing a single point of failure. Ceph
 366    Clients can maintain a session when they need to, and with a particular Ceph
 367    OSD Daemon instead of a centralized server.
 368
 369 #. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
 370    on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
 371    or ``down`` reflecting whether or not it is running and able to service
 372    Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
 373    Storage Cluster, this status may indicate the failure of the Ceph OSD
 374    Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
 375    Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
 376    periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
 377    and a new ``MOSDBeacon`` in luminous).  If the Ceph Monitor doesn't see that
 378    message after a configurable period of time then it marks the OSD down.
 379    This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
 380    determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
 381    This assures that Ceph Monitors are lightweight processes.  See `Monitoring
 382    OSDs`_ and `Heartbeats`_ for additional details.
 383
 384 #. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
 385    Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph
 386    OSD Daemons can compare object metadata in one placement group with its
 387    replicas in placement groups stored on other OSDs. Scrubbing (usually
 388    performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also
 389    perform deeper scrubbing by comparing data in objects bit-for-bit. Deep
 390    scrubbing (usually performed weekly) finds bad sectors on a drive that
 391    weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
 392    configuring scrubbing.
 393
 394 #. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
 395    algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
 396    objects should be stored (and for rebalancing). In a typical write scenario,
 397    a client uses the CRUSH algorithm to compute where to store an object, maps
 398    the object to a pool and placement group, then looks at the CRUSH map to
 399    identify the primary OSD for the placement group.
 400
 401    The client writes the object to the identified placement group in the
 402    primary OSD. Then, the primary OSD with its own copy of the CRUSH map
 403    identifies the secondary and tertiary OSDs for replication purposes, and
 404    replicates the object to the appropriate placement groups in the secondary
 405    and tertiary OSDs (as many OSDs as additional replicas), and responds to the
 406    client once it has confirmed the object was stored successfully.
 407
 408 .. ditaa::
 409
 410              +----------+
 411              |  Client  |
 412              |          |
 413              +----------+
 414                  *  ^
 415       Write (1)  |  |  Ack (6)
 416                  |  |
 417                  v  *
 418             +-------------+
 419             | Primary OSD |
 420             |             |
 421             +-------------+
 422               *  ^   ^  *
 423     Write (2) |  |   |  |  Write (3)
 424        +------+  |   |  +------+
 425        |  +------+   +------+  |
 426        |  | Ack (4)  Ack (5)|  |
 427        v  *                 *  v
 428  +---------------+   +---------------+
 429  | Secondary OSD |   | Tertiary OSD  |
 430  |               |   |               |
 431  +---------------+   +---------------+
 432
 433 With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
 434 clients from that duty, while ensuring high data availability and data safety.
 435
 436
 437 Dynamic Cluster Management
 438 --------------------------
 439
 440 In the `Scalability and High Availability`_ section, we explained how Ceph uses
 441 CRUSH, cluster awareness and intelligent daemons to scale and maintain high
 442 availability. Key to Ceph's design is the autonomous, self-healing, and
 443 intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
 444 enable modern cloud storage infrastructures to place data, rebalance the cluster
 445 and recover from faults dynamically.
 446
 447 .. index:: architecture; pools
 448
 449 About Pools
 450 ~~~~~~~~~~~
 451
 452 The Ceph storage system supports the notion of 'Pools', which are logical
 453 partitions for storing objects.
 454
 455 Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
 456 pools. The pool's ``size`` or number of replicas, the CRUSH rule and the
 457 number of placement groups determine how Ceph will place the data.
 458
 459 .. ditaa::
 460
 461             +--------+  Retrieves  +---------------+
 462             | Client |------------>|  Cluster Map  |
 463             +--------+             +---------------+
 464                  |
 465                  v      Writes
 466               /-----\
 467               | obj |
 468               \-----/
 469                  |      To
 470                  v
 471             +--------+           +---------------+
 472             |  Pool  |---------->|  CRUSH Rule   |
 473             +--------+  Selects  +---------------+
 474
 475
 476 Pools set at least the following parameters:
 477
 478 - Ownership/Access to Objects
 479 - The Number of Placement Groups, and
 480 - The CRUSH Rule to Use.
 481
 482 See `Set Pool Values`_ for details.
 483
 484
 485 .. index: architecture; placement group mapping
 486
 487 Mapping PGs to OSDs
 488 ~~~~~~~~~~~~~~~~~~~
 489
 490 Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
 491 When a Ceph Client stores objects, CRUSH will map each object to a placement
 492 group.
 493
 494 Mapping objects to placement groups creates a layer of indirection between the
 495 Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
 496 grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
 497 Client "knew" which Ceph OSD Daemon had which object, that would create a tight
 498 coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
 499 algorithm maps each object to a placement group and then maps each placement
 500 group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
 501 rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
 502 come online. The following diagram depicts how CRUSH maps objects to placement
 503 groups, and placement groups to OSDs.
 504
 505 .. ditaa::
 506
 507            /-----\  /-----\  /-----\  /-----\  /-----\
 508            | obj |  | obj |  | obj |  | obj |  | obj |
 509            \-----/  \-----/  \-----/  \-----/  \-----/
 510               |        |        |        |        |
 511               +--------+--------+        +---+----+
 512               |                              |
 513               v                              v
 514    +-----------------------+      +-----------------------+
 515    |  Placement Group #1   |      |  Placement Group #2   |
 516    |                       |      |                       |
 517    +-----------------------+      +-----------------------+
 518                |                              |
 519                |      +-----------------------+---+
 520         +------+------+-------------+             |
 521         |             |             |             |
 522         v             v             v             v
 523    /----------\  /----------\  /----------\  /----------\
 524    |          |  |          |  |          |  |          |
 525    |  OSD #1  |  |  OSD #2  |  |  OSD #3  |  |  OSD #4  |
 526    |          |  |          |  |          |  |          |
 527    \----------/  \----------/  \----------/  \----------/
 528
 529 With a copy of the cluster map and the CRUSH algorithm, the client can compute
 530 exactly which OSD to use when reading or writing a particular object.
 531
 532 .. index:: architecture; calculating PG IDs
 533
 534 Calculating PG IDs
 535 ~~~~~~~~~~~~~~~~~~
 536
 537 When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
 538 `Cluster Map`_. With the cluster map, the client knows about all of the monitors,
 539 OSDs, and metadata servers in the cluster. **However, it doesn't know anything
 540 about object locations.**
 541
 542 .. epigraph::
 543
 544         Object locations get computed.
 545
 546
 547 The only input required by the client is the object ID and the pool.
 548 It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
 549 wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
 550 it calculates a placement group using the object name, a hash code, the
 551 number of PGs in the pool and the pool name. Ceph clients use the following
 552 steps to compute PG IDs.
 553
 554 #. The client inputs the pool name and the object ID. (e.g., pool = "liverpool"
 555    and object-id = "john")
 556 #. Ceph takes the object ID and hashes it.
 557 #. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
 558    a PG ID.
 559 #. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
 560 #. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
 561
 562 Computing object locations is much faster than performing object location query
 563 over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
 564 Hashing)` algorithm allows a client to compute where objects *should* be stored,
 565 and enables the client to contact the primary OSD to store or retrieve the
 566 objects.
 567
 568 .. index:: architecture; PG Peering
 569
 570 Peering and Sets
 571 ~~~~~~~~~~~~~~~~
 572
 573 In previous sections, we noted that Ceph OSD Daemons check each others
 574 heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
 575 do is called 'peering', which is the process of bringing all of the OSDs that
 576 store a Placement Group (PG) into agreement about the state of all of the
 577 objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
 578 Peering Failure`_ to the Ceph Monitors. Peering issues  usually resolve
 579 themselves; however, if the problem persists, you may need to refer to the
 580 `Troubleshooting Peering Failure`_ section.
 581
 582 .. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
 583
 584 The Ceph Storage Cluster was designed to store at least two copies of an object
 585 (i.e., ``size = 2``), which is the minimum requirement for data safety. For high
 586 availability, a Ceph Storage Cluster should store more than two copies of an object
 587 (e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
 588 ``degraded`` state while maintaining data safety.
 589
 590 Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
 591 name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
 592 rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
 593 the *Primary* is the first OSD in the *Acting Set*, and is responsible for
 594 coordinating the peering process for each placement group where it acts as
 595 the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
 596 writes to objects for a given placement group where it acts as the *Primary*.
 597
 598 When a series of OSDs are responsible for a placement group, that series of
 599 OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
 600 OSD Daemons that are currently responsible for the placement group, or the Ceph
 601 OSD Daemons that were responsible  for a particular placement group as of some
 602 epoch.
 603
 604 The Ceph OSD daemons that are part of an *Acting Set* may not always be  ``up``.
 605 When an OSD in the *Acting Set* is ``up``, it is part of the  *Up Set*. The *Up
 606 Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
 607 Daemons when an OSD fails.
 608
 609 .. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
 610    ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
 611    the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
 612    removed from the *Up Set*.
 613
 614
 615 .. index:: architecture; Rebalancing
 616
 617 Rebalancing
 618 ~~~~~~~~~~~
 619
 620 When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
 621 updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
 622 the cluster map. Consequently, it changes object placement, because it changes
 623 an input for the calculations. The following diagram depicts the rebalancing
 624 process (albeit rather crudely, since it is substantially less impactful with
 625 large clusters) where some, but not all of the PGs migrate from existing OSDs
 626 (OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
 627 stable. Many of the placement groups remain in their original configuration,
 628 and each OSD gets some added capacity, so there are no load spikes on the
 629 new OSD after rebalancing is complete.
 630
 631
 632 .. ditaa::
 633
 634            +--------+     +--------+
 635    Before  |  OSD 1 |     |  OSD 2 |
 636            +--------+     +--------+
 637            |  PG #1 |     | PG #6  |
 638            |  PG #2 |     | PG #7  |
 639            |  PG #3 |     | PG #8  |
 640            |  PG #4 |     | PG #9  |
 641            |  PG #5 |     | PG #10 |
 642            +--------+     +--------+
 643
 644            +--------+     +--------+     +--------+
 645     After  |  OSD 1 |     |  OSD 2 |     |  OSD 3 |
 646            +--------+     +--------+     +--------+
 647            |  PG #1 |     | PG #7  |     |  PG #3 |
 648            |  PG #2 |     | PG #8  |     |  PG #6 |
 649            |  PG #4 |     | PG #10 |     |  PG #9 |
 650            |  PG #5 |     |        |     |        |
 651            |        |     |        |     |        |
 652            +--------+     +--------+     +--------+
 653
 654
 655 .. index:: architecture; Data Scrubbing
 656
 657 Data Consistency
 658 ~~~~~~~~~~~~~~~~
 659
 660 As part of maintaining data consistency and cleanliness, Ceph OSDs can also
 661 scrub objects within placement groups. That is, Ceph OSDs can compare object
 662 metadata in one placement group with its replicas in placement groups stored in
 663 other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
 664 errors.  OSDs can also perform deeper scrubbing by comparing data in objects
 665 bit-for-bit.  Deep scrubbing (usually performed weekly) finds bad sectors on a
 666 disk that weren't apparent in a light scrub.
 667
 668 See `Data Scrubbing`_ for details on configuring scrubbing.
 669
 670
 671
 672
 673
 674 .. index:: erasure coding
 675
 676 Erasure Coding
 677 --------------
 678
 679 An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
 680 ``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
 681 of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
 682 the chunk is stored as an attribute of the object.
 683
 684 For instance an erasure coded pool is created to use five OSDs (``K+M = 5``) and
 685 sustain the loss of two of them (``M = 2``).
 686
 687 Reading and Writing Encoded Chunks
 688 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 689
 690 When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
 691 encoding function splits the content into three data chunks simply by dividing
 692 the content in three: the first contains ``ABC``, the second ``DEF`` and the
 693 last ``GHI``. The content will be padded if the content length is not a multiple
 694 of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
 695 and the fifth with ``QGC``. Each chunk is stored in an OSD in the acting set.
 696 The chunks are stored in objects that have the same name (**NYAN**) but reside
 697 on different OSDs. The order in which the chunks were created must be preserved
 698 and is stored as an attribute of the object (``shard_t``), in addition to its
 699 name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
 700 ``YXY`` and is stored on **OSD3**.
 701
 702
 703 .. ditaa::
 704
 705                             +-------------------+
 706                        name |       NYAN        |
 707                             +-------------------+
 708                     content |     ABCDEFGHI     |
 709                             +--------+----------+
 710                                      |
 711                                      |
 712                                      v
 713                               +------+------+
 714               +---------------+ encode(3,2) +-----------+
 715               |               +--+--+---+---+           |
 716               |                  |  |   |               |
 717               |          +-------+  |   +-----+         |
 718               |          |          |         |         |
 719            +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
 720      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
 721            +------+   +------+   +------+  +------+  +------+
 722     shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
 723            +------+   +------+   +------+  +------+  +------+
 724   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
 725            +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
 726               |          |          |         |         |
 727               |          |          v         |         |
 728               |          |       +--+---+     |         |
 729               |          |       | OSD1 |     |         |
 730               |          |       +------+     |         |
 731               |          |                    |         |
 732               |          |       +------+     |         |
 733               |          +------>| OSD2 |     |         |
 734               |                  +------+     |         |
 735               |                               |         |
 736               |                  +------+     |         |
 737               |                  | OSD3 |<----+         |
 738               |                  +------+               |
 739               |                                         |
 740               |                  +------+               |
 741               |                  | OSD4 |<--------------+
 742               |                  +------+
 743               |
 744               |                  +------+
 745               +----------------->| OSD5 |
 746                                  +------+
 747
 748
 749 When the object **NYAN** is read from the erasure coded pool, the decoding
 750 function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
 751 ``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
 752 of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
 753 and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
 754 because the **OSD4** is out. The decoding function can be called as soon as
 755 three chunks are read: **OSD2** was the slowest and its chunk was not taken into
 756 account.
 757
 758 .. ditaa::
 759
 760                                  +-------------------+
 761                             name |       NYAN        |
 762                                  +-------------------+
 763                          content |     ABCDEFGHI     |
 764                                  +---------+---------+
 765                                            ^
 766                                            |
 767                                            |
 768                                    +-------+-------+
 769                                    |  decode(3,2)  |
 770                     +------------->+  erasures 2,5 +<-+
 771                     |              |               |  |
 772                     |              +-------+-------+  |
 773                     |                      ^          |
 774                     |                      |          |
 775                     |                      |          |
 776                  +--+---+   +------+   +---+--+   +---+--+
 777            name  | NYAN |   | NYAN |   | NYAN |   | NYAN |
 778                  +------+   +------+   +------+   +------+
 779           shard  |  1   |   |  2   |   |  3   |   |  4   |
 780                  +------+   +------+   +------+   +------+
 781         content  | ABC  |   | DEF  |   | GHI  |   | YXY  |
 782                  +--+---+   +--+---+   +--+---+   +--+---+
 783                     ^          .          ^          ^
 784                     |    TOO   .          |          |
 785                     |    SLOW  .       +--+---+      |
 786                     |          ^       | OSD1 |      |
 787                     |          |       +------+      |
 788                     |          |                     |
 789                     |          |       +------+      |
 790                     |          +-------| OSD2 |      |
 791                     |                  +------+      |
 792                     |                                |
 793                     |                  +------+      |
 794                     |                  | OSD3 |------+
 795                     |                  +------+
 796                     |
 797                     |                  +------+
 798                     |                  | OSD4 | OUT
 799                     |                  +------+
 800                     |
 801                     |                  +------+
 802                     +------------------| OSD5 |
 803                                        +------+
 804
 805
 806 Interrupted Full Writes
 807 ~~~~~~~~~~~~~~~~~~~~~~~
 808
 809 In an erasure coded pool, the primary OSD in the up set receives all write
 810 operations. It is responsible for encoding the payload into ``K+M`` chunks and
 811 sends them to the other OSDs. It is also responsible for maintaining an
 812 authoritative version of the placement group logs.
 813
 814 In the following diagram, an erasure coded placement group has been created with
 815 ``K = 2, M = 1`` and is supported by three OSDs, two for ``K`` and one for
 816 ``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
 817 **OSD 3**. An object has been encoded and stored in the OSDs : the chunk
 818 ``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
 819 **OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
 820 placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
 821 version 1).
 822
 823
 824 .. ditaa::
 825
 826      Primary OSD
 827
 828    +-------------+
 829    |    OSD 1    |             +-------------+
 830    |         log |  Write Full |             |
 831    |  +----+     |<------------+ Ceph Client |
 832    |  |D1v1| 1,1 |      v1     |             |
 833    |  +----+     |             +-------------+
 834    +------+------+
 835           |
 836           |
 837           |          +-------------+
 838           |          |    OSD 2    |
 839           |          |         log |
 840           +--------->+  +----+     |
 841           |          |  |D2v1| 1,1 |
 842           |          |  +----+     |
 843           |          +-------------+
 844           |
 845           |          +-------------+
 846           |          |    OSD 3    |
 847           |          |         log |
 848           +--------->|  +----+     |
 849                      |  |C1v1| 1,1 |
 850                      |  +----+     |
 851                      +-------------+
 852
 853 **OSD 1** is the primary and receives a **WRITE FULL** from a client, which
 854 means the payload is to replace the object entirely instead of overwriting a
 855 portion of it. Version 2 (v2) of the object is created to override version 1
 856 (v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
 857 chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
 858 ``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
 859 to the target OSD, including the primary OSD which is responsible for storing
 860 chunks in addition to handling write operations and maintaining an authoritative
 861 version of the placement group logs. When an OSD receives the message
 862 instructing it to write the chunk, it also creates a new entry in the placement
 863 group logs to reflect the change. For instance, as soon as **OSD 3** stores
 864 ``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
 865 Because the OSDs work asynchronously, some chunks may still be in flight ( such
 866 as ``D2v2`` ) while others are acknowledged and on disk ( such as ``C1v1`` and
 867 ``D1v1``).
 868
 869 .. ditaa::
 870
 871      Primary OSD
 872
 873    +-------------+
 874    |    OSD 1    |
 875    |         log |
 876    |  +----+     |             +-------------+
 877    |  |D1v2| 1,2 |  Write Full |             |
 878    |  +----+     +<------------+ Ceph Client |
 879    |             |      v2     |             |
 880    |  +----+     |             +-------------+
 881    |  |D1v1| 1,1 |
 882    |  +----+     |
 883    +------+------+
 884           |
 885           |
 886           |           +------+------+
 887           |           |    OSD 2    |
 888           |  +------+ |         log |
 889           +->| D2v2 | |  +----+     |
 890           |  +------+ |  |D2v1| 1,1 |
 891           |           |  +----+     |
 892           |           +-------------+
 893           |
 894           |           +-------------+
 895           |           |    OSD 3    |
 896           |           |         log |
 897           |           |  +----+     |
 898           |           |  |C1v2| 1,2 |
 899           +---------->+  +----+     |
 900                       |             |
 901                       |  +----+     |
 902                       |  |C1v1| 1,1 |
 903                       |  +----+     |
 904                       +-------------+
 905
 906
 907 If all goes well, the chunks are acknowledged on each OSD in the acting set and
 908 the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
 909
 910 .. ditaa::
 911
 912      Primary OSD
 913
 914    +-------------+
 915    |    OSD 1    |
 916    |         log |
 917    |  +----+     |             +-------------+
 918    |  |D1v2| 1,2 |  Write Full |             |
 919    |  +----+     +<------------+ Ceph Client |
 920    |             |      v2     |             |
 921    |  +----+     |             +-------------+
 922    |  |D1v1| 1,1 |
 923    |  +----+     |
 924    +------+------+
 925           |
 926           |           +-------------+
 927           |           |    OSD 2    |
 928           |           |         log |
 929           |           |  +----+     |
 930           |           |  |D2v2| 1,2 |
 931           +---------->+  +----+     |
 932           |           |             |
 933           |           |  +----+     |
 934           |           |  |D2v1| 1,1 |
 935           |           |  +----+     |
 936           |           +-------------+
 937           |
 938           |           +-------------+
 939           |           |    OSD 3    |
 940           |           |         log |
 941           |           |  +----+     |
 942           |           |  |C1v2| 1,2 |
 943           +---------->+  +----+     |
 944                       |             |
 945                       |  +----+     |
 946                       |  |C1v1| 1,1 |
 947                       |  +----+     |
 948                       +-------------+
 949
 950
 951 Finally, the files used to store the chunks of the previous version of the
 952 object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
 953 on **OSD 3**.
 954
 955 .. ditaa::
 956
 957      Primary OSD
 958
 959    +-------------+
 960    |    OSD 1    |
 961    |         log |
 962    |  +----+     |
 963    |  |D1v2| 1,2 |
 964    |  +----+     |
 965    +------+------+
 966           |
 967           |
 968           |          +-------------+
 969           |          |    OSD 2    |
 970           |          |         log |
 971           +--------->+  +----+     |
 972           |          |  |D2v2| 1,2 |
 973           |          |  +----+     |
 974           |          +-------------+
 975           |
 976           |          +-------------+
 977           |          |    OSD 3    |
 978           |          |         log |
 979           +--------->|  +----+     |
 980                      |  |C1v2| 1,2 |
 981                      |  +----+     |
 982                      +-------------+
 983
 984
 985 But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
 986 the object's version 2 is partially written: **OSD 3** has one chunk but that is
 987 not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
 988 erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
 989 available to rebuild the third. **OSD 4** becomes the new primary and finds that
 990 the ``last_complete`` log entry (i.e., all objects before this entry were known
 991 to be available on all OSDs in the previous acting set ) is ``1,1`` and that
 992 will be the head of the new authoritative log.
 993
 994 .. ditaa::
 995
 996    +-------------+
 997    |    OSD 1    |
 998    |   (down)    |
 999    | c333        |
1000    +------+------+
1001           |
1002           |           +-------------+
1003           |           |    OSD 2    |
1004           |           |         log |
1005           |           |  +----+     |
1006           +---------->+  |D2v1| 1,1 |
1007           |           |  +----+     |
1008           |           |             |
1009           |           +-------------+
1010           |
1011           |           +-------------+
1012           |           |    OSD 3    |
1013           |           |         log |
1014           |           |  +----+     |
1015           |           |  |C1v2| 1,2 |
1016           +---------->+  +----+     |
1017                       |             |
1018                       |  +----+     |
1019                       |  |C1v1| 1,1 |
1020                       |  +----+     |
1021                       +-------------+
1022      Primary OSD
1023    +-------------+
1024    |    OSD 4    |
1025    |         log |
1026    |             |
1027    |         1,1 |
1028    |             |
1029    +------+------+
1030
1031
1032
1033 The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
1034 provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
1035 chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
1036 the erasure coding library during scrubbing and stored on the new primary
1037 **OSD 4**.
1038
1039
1040 .. ditaa::
1041
1042      Primary OSD
1043
1044    +-------------+
1045    |    OSD 4    |
1046    |         log |
1047    |  +----+     |
1048    |  |D1v1| 1,1 |
1049    |  +----+     |
1050    +------+------+
1051           ^
1052           |
1053           |          +-------------+
1054           |          |    OSD 2    |
1055           |          |         log |
1056           +----------+  +----+     |
1057           |          |  |D2v1| 1,1 |
1058           |          |  +----+     |
1059           |          +-------------+
1060           |
1061           |          +-------------+
1062           |          |    OSD 3    |
1063           |          |         log |
1064           +----------|  +----+     |
1065                      |  |C1v1| 1,1 |
1066                      |  +----+     |
1067                      +-------------+
1068
1069    +-------------+
1070    |    OSD 1    |
1071    |   (down)    |
1072    | c333        |
1073    +-------------+
1074
1075 See `Erasure Code Notes`_ for additional details.
1076
1077
1078
1079 Cache Tiering
1080 -------------
1081
1082 A cache tier provides Ceph Clients with better I/O performance for a subset of
1083 the data stored in a backing storage tier. Cache tiering involves creating a
1084 pool of relatively fast/expensive storage devices (e.g., solid state drives)
1085 configured to act as a cache tier, and a backing pool of either erasure-coded
1086 or relatively slower/cheaper devices configured to act as an economical storage
1087 tier. The Ceph objecter handles where to place the objects and the tiering
1088 agent determines when to flush objects from the cache to the backing storage
1089 tier. So the cache tier and the backing storage tier are completely transparent
1090 to Ceph clients.
1091
1092
1093 .. ditaa::
1094
1095            +-------------+
1096            | Ceph Client |
1097            +------+------+
1098                   ^
1099      Tiering is   |
1100     Transparent   |              Faster I/O
1101         to Ceph   |           +---------------+
1102      Client Ops   |           |               |
1103                   |    +----->+   Cache Tier  |
1104                   |    |      |               |
1105                   |    |      +-----+---+-----+
1106                   |    |            |   ^
1107                   v    v            |   |   Active Data in Cache Tier
1108            +------+----+--+         |   |
1109            |   Objecter   |         |   |
1110            +-----------+--+         |   |
1111                        ^            |   |   Inactive Data in Storage Tier
1112                        |            v   |
1113                        |      +-----+---+-----+
1114                        |      |               |
1115                        +----->|  Storage Tier |
1116                               |               |
1117                               +---------------+
1118                                  Slower I/O
1119
1120 See `Cache Tiering`_ for additional details.
1121
1122
1123 .. index:: Extensibility, Ceph Classes
1124
1125 Extending Ceph
1126 --------------
1127
1128 You can extend Ceph by creating shared object classes called 'Ceph Classes'.
1129 Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
1130 (i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
1131 can create new object methods that have the ability to call the native methods
1132 in the Ceph Object Store, or other class methods you incorporate via libraries
1133 or create yourself.
1134
1135 On writes, Ceph Classes can call native or class methods, perform any series of
1136 operations on the inbound data and generate a resulting write transaction  that
1137 Ceph will apply atomically.
1138
1139 On reads, Ceph Classes can call native or class methods, perform any series of
1140 operations on the outbound data and return the data to the client.
1141
1142 .. topic:: Ceph Class Example
1143
1144    A Ceph class for a content management system that presents pictures of a
1145    particular size and aspect ratio could take an inbound bitmap image, crop it
1146    to a particular aspect ratio, resize it and embed an invisible copyright or
1147    watermark to help protect the intellectual property; then, save the
1148    resulting bitmap image to the object store.
1149
1150 See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for
1151 exemplary implementations.
1152
1153
1154 Summary
1155 -------
1156
1157 Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
1158 appliances do not fully utilize the CPU and RAM of a typical commodity server,
1159 Ceph does. From heartbeats, to  peering, to rebalancing the cluster or
1160 recovering from faults,  Ceph offloads work from clients (and from a centralized
1161 gateway which doesn't exist in the Ceph architecture) and uses the computing
1162 power of the OSDs to perform the work. When referring to `Hardware
1163 Recommendations`_ and the `Network Config Reference`_,  be cognizant of the
1164 foregoing concepts to understand how Ceph utilizes computing resources.
1165
1166 .. index:: Ceph Protocol, librados
1167
1168 Ceph Protocol
1169 =============
1170
1171 Ceph Clients use the native protocol for interacting with the Ceph Storage
1172 Cluster. Ceph packages this functionality into the ``librados`` library so that
1173 you can create your own custom Ceph Clients. The following diagram depicts the
1174 basic architecture.
1175
1176 .. ditaa::
1177
1178             +---------------------------------+
1179             |  Ceph Storage Cluster Protocol  |
1180             |           (librados)            |
1181             +---------------------------------+
1182             +---------------+ +---------------+
1183             |      OSDs     | |    Monitors   |
1184             +---------------+ +---------------+
1185
1186
1187 Native Protocol and ``librados``
1188 --------------------------------
1189
1190 Modern applications need a simple object storage interface with asynchronous
1191 communication capability. The Ceph Storage Cluster provides a simple object
1192 storage interface with asynchronous communication capability. The interface
1193 provides direct, parallel access to objects throughout the cluster.
1194
1195
1196 - Pool Operations
1197 - Snapshots and Copy-on-write Cloning
1198 - Read/Write Objects
1199   - Create or Remove
1200   - Entire Object or Byte Range
1201   - Append or Truncate
1202 - Create/Set/Get/Remove XATTRs
1203 - Create/Set/Get/Remove Key/Value Pairs
1204 - Compound operations and dual-ack semantics
1205 - Object Classes
1206
1207
1208 .. index:: architecture; watch/notify
1209
1210 Object Watch/Notify
1211 -------------------
1212
1213 A client can register a persistent interest with an object and keep a session to
1214 the primary OSD open. The client can send a notification message and a payload to
1215 all watchers and receive notification when the watchers receive the
1216 notification. This enables a client to use any object as a
1217 synchronization/communication channel.
1218
1219
1220 .. ditaa::
1221
1222            +----------+     +----------+     +----------+     +---------------+
1223            | Client 1 |     | Client 2 |     | Client 3 |     | OSD:Object ID |
1224            +----------+     +----------+     +----------+     +---------------+
1225                  |                |                |                  |
1226                  |                |                |                  |
1227                  |                |  Watch Object  |                  |
1228                  |--------------------------------------------------->|
1229                  |                |                |                  |
1230                  |<---------------------------------------------------|
1231                  |                |   Ack/Commit   |                  |
1232                  |                |                |                  |
1233                  |                |  Watch Object  |                  |
1234                  |                |---------------------------------->|
1235                  |                |                |                  |
1236                  |                |<----------------------------------|
1237                  |                |   Ack/Commit   |                  |
1238                  |                |                |   Watch Object   |
1239                  |                |                |----------------->|
1240                  |                |                |                  |
1241                  |                |                |<-----------------|
1242                  |                |                |    Ack/Commit    |
1243                  |                |     Notify     |                  |
1244                  |--------------------------------------------------->|
1245                  |                |                |                  |
1246                  |<---------------------------------------------------|
1247                  |                |     Notify     |                  |
1248                  |                |                |                  |
1249                  |                |<----------------------------------|
1250                  |                |     Notify     |                  |
1251                  |                |                |<-----------------|
1252                  |                |                |      Notify      |
1253                  |                |       Ack      |                  |
1254                  |----------------+---------------------------------->|
1255                  |                |                |                  |
1256                  |                |       Ack      |                  |
1257                  |                +---------------------------------->|
1258                  |                |                |                  |
1259                  |                |                |        Ack       |
1260                  |                |                |----------------->|
1261                  |                |                |                  |
1262                  |<---------------+----------------+------------------|
1263                  |                     Complete
1264
1265 .. index:: architecture; Striping
1266
1267 Data Striping
1268 -------------
1269
1270 Storage devices have throughput limitations, which impact performance and
1271 scalability. So storage systems often support `striping`_--storing sequential
1272 pieces of information across multiple storage devices--to increase throughput
1273 and performance. The most common form of data striping comes from `RAID`_.
1274 The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
1275 volume'. Ceph's striping offers the throughput of RAID 0 striping, the
1276 reliability of n-way RAID mirroring and faster recovery.
1277
1278 Ceph provides three types of clients: Ceph Block Device, Ceph File System, and
1279 Ceph Object Storage. A Ceph Client converts its data from the representation
1280 format it provides to its users (a block device image, RESTful objects, CephFS
1281 filesystem directories) into objects for storage in the Ceph Storage Cluster.
1282
1283 .. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped.
1284    Ceph Object Storage, Ceph Block Device, and the Ceph File System stripe their
1285    data over multiple Ceph Storage Cluster objects. Ceph Clients that write
1286    directly to the Ceph Storage Cluster via ``librados`` must perform the
1287    striping (and parallel I/O) for themselves to obtain these benefits.
1288
1289 The simplest Ceph striping format involves a stripe count of 1 object. Ceph
1290 Clients write stripe units to a Ceph Storage Cluster object until the object is
1291 at its maximum capacity, and then create another object for additional stripes
1292 of data. The simplest form of striping may be sufficient for small block device
1293 images, S3 or Swift objects and CephFS files. However, this simple form doesn't
1294 take maximum advantage of Ceph's ability to distribute data across placement
1295 groups, and consequently doesn't improve performance very much. The following
1296 diagram depicts the simplest form of striping:
1297
1298 .. ditaa::
1299
1300                         +---------------+
1301                         |  Client Data  |
1302                         |     Format    |
1303                         | cCCC          |
1304                         +---------------+
1305                                 |
1306                        +--------+-------+
1307                        |                |
1308                        v                v
1309                  /-----------\    /-----------\
1310                  | Begin cCCC|    | Begin cCCC|
1311                  | Object  0 |    | Object  1 |
1312                  +-----------+    +-----------+
1313                  |  stripe   |    |  stripe   |
1314                  |  unit 1   |    |  unit 5   |
1315                  +-----------+    +-----------+
1316                  |  stripe   |    |  stripe   |
1317                  |  unit 2   |    |  unit 6   |
1318                  +-----------+    +-----------+
1319                  |  stripe   |    |  stripe   |
1320                  |  unit 3   |    |  unit 7   |
1321                  +-----------+    +-----------+
1322                  |  stripe   |    |  stripe   |
1323                  |  unit 4   |    |  unit 8   |
1324                  +-----------+    +-----------+
1325                  | End cCCC  |    | End cCCC  |
1326                  | Object 0  |    | Object 1  |
1327                  \-----------/    \-----------/
1328
1329
1330 If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
1331 or large CephFS directories, you may see considerable read/write performance
1332 improvements by striping client data over multiple objects within an object set.
1333 Significant write performance occurs when the client writes the stripe units to
1334 their corresponding objects in parallel. Since objects get mapped to different
1335 placement groups and further mapped to different OSDs, each write occurs in
1336 parallel at the maximum write speed. A write to a single disk would be limited
1337 by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
1338 100MB/s).  By spreading that write over multiple objects (which map to different
1339 placement groups and OSDs) Ceph can reduce the number of seeks per drive and
1340 combine the throughput of multiple drives to achieve much faster write (or read)
1341 speeds.
1342
1343 .. note:: Striping is independent of object replicas. Since CRUSH
1344    replicates objects across OSDs, stripes get replicated automatically.
1345
1346 In the following diagram, client data gets striped across an object set
1347 (``object set 1`` in the following diagram) consisting of 4 objects, where the
1348 first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
1349 unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
1350 client determines if the object set is full. If the object set is not full, the
1351 client begins writing a stripe to the first object again (``object 0`` in the
1352 following diagram). If the object set is full, the client creates a new object
1353 set (``object set 2`` in the following diagram), and begins writing to the first
1354 stripe (``stripe unit 16``) in the first object in the new object set (``object
1355 4`` in the diagram below).
1356
1357 .. ditaa::
1358
1359                           +---------------+
1360                           |  Client Data  |
1361                           |     Format    |
1362                           | cCCC          |
1363                           +---------------+
1364                                   |
1365        +-----------------+--------+--------+-----------------+
1366        |                 |                 |                 |     +--\
1367        v                 v                 v                 v        |
1368  /-----------\     /-----------\     /-----------\     /-----------\  |
1369  | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
1370  | Object 0  |     | Object  1 |     | Object  2 |     | Object  3 |  |
1371  +-----------+     +-----------+     +-----------+     +-----------+  |
1372  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1373  |  unit 0   |     |  unit 1   |     |  unit 2   |     |  unit 3   |  |
1374  +-----------+     +-----------+     +-----------+     +-----------+  |
1375  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
1376  |  unit 4   |     |  unit 5   |     |  unit 6   |     |  unit 7   |    | Object
1377  +-----------+     +-----------+     +-----------+     +-----------+    +- Set
1378  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   1
1379  |  unit 8   |     |  unit 9   |     |  unit 10  |     |  unit 11  |  +-/
1380  +-----------+     +-----------+     +-----------+     +-----------+  |
1381  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1382  |  unit 12  |     |  unit 13  |     |  unit 14  |     |  unit 15  |  |
1383  +-----------+     +-----------+     +-----------+     +-----------+  |
1384  | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
1385  | Object 0  |     | Object 1  |     | Object 2  |     | Object 3  |  |
1386  \-----------/     \-----------/     \-----------/     \-----------/  |
1387                                                                       |
1388                                                                    +--/
1389
1390                                                                    +--\
1391                                                                       |
1392  /-----------\     /-----------\     /-----------\     /-----------\  |
1393  | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
1394  | Object  4 |     | Object  5 |     | Object  6 |     | Object  7 |  |
1395  +-----------+     +-----------+     +-----------+     +-----------+  |
1396  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1397  |  unit 16  |     |  unit 17  |     |  unit 18  |     |  unit 19  |  |
1398  +-----------+     +-----------+     +-----------+     +-----------+  |
1399  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
1400  |  unit 20  |     |  unit 21  |     |  unit 22  |     |  unit 23  |    | Object
1401  +-----------+     +-----------+     +-----------+     +-----------+    +- Set
1402  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   2
1403  |  unit 24  |     |  unit 25  |     |  unit 26  |     |  unit 27  |  +-/
1404  +-----------+     +-----------+     +-----------+     +-----------+  |
1405  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1406  |  unit 28  |     |  unit 29  |     |  unit 30  |     |  unit 31  |  |
1407  +-----------+     +-----------+     +-----------+     +-----------+  |
1408  | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
1409  | Object 4  |     | Object 5  |     | Object 6  |     | Object 7  |  |
1410  \-----------/     \-----------/     \-----------/     \-----------/  |
1411                                                                       |
1412                                                                    +--/
1413
1414 Three important variables determine how Ceph stripes data:
1415
1416 - **Object Size:** Objects in the Ceph Storage Cluster have a maximum
1417   configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
1418   enough to accommodate many stripe units, and should be a multiple of
1419   the stripe unit.
1420
1421 - **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
1422   The Ceph Client divides the data it will write to objects into equally
1423   sized stripe units, except for the last stripe unit. A stripe width,
1424   should be a fraction of the Object Size so that an object may contain
1425   many stripe units.
1426
1427 - **Stripe Count:** The Ceph Client writes a sequence of stripe units
1428   over a series of objects determined by the stripe count. The series
1429   of objects is called an object set. After the Ceph Client writes to
1430   the last object in the object set, it returns to the first object in
1431   the object set.
1432
1433 .. important:: Test the performance of your striping configuration before
1434    putting your cluster into production. You CANNOT change these striping
1435    parameters after you stripe the data and write it to objects.
1436
1437 Once the Ceph Client has striped data to stripe units and mapped the stripe
1438 units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
1439 and the placement groups to Ceph OSD Daemons before the objects are stored as
1440 files on a storage disk.
1441
1442 .. note:: Since a client writes to a single pool, all data striped into objects
1443    get mapped to placement groups in the same pool. So they use the same CRUSH
1444    map and the same access controls.
1445
1446
1447 .. index:: architecture; Ceph Clients
1448
1449 Ceph Clients
1450 ============
1451
1452 Ceph Clients include a number of service interfaces. These include:
1453
1454 - **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
1455   provides resizable, thin-provisioned block devices with snapshotting and
1456   cloning. Ceph stripes a block device across the cluster for high
1457   performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
1458   that uses ``librbd`` directly--avoiding the kernel object overhead for
1459   virtualized systems.
1460
1461 - **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
1462   provides RESTful APIs with interfaces that are compatible with Amazon S3
1463   and OpenStack Swift.
1464
1465 - **Filesystem**: The :term:`Ceph File System` (CephFS) service provides
1466   a POSIX compliant filesystem usable with ``mount`` or as
1467   a filesystem in user space (FUSE).
1468
1469 Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
1470 and high availability. The following diagram depicts the high-level
1471 architecture.
1472
1473 .. ditaa::
1474
1475             +--------------+  +----------------+  +-------------+
1476             | Block Device |  | Object Storage |  |   CephFS    |
1477             +--------------+  +----------------+  +-------------+
1478
1479             +--------------+  +----------------+  +-------------+
1480             |    librbd    |  |     librgw     |  |  libcephfs  |
1481             +--------------+  +----------------+  +-------------+
1482
1483             +---------------------------------------------------+
1484             |      Ceph Storage Cluster Protocol (librados)     |
1485             +---------------------------------------------------+
1486
1487             +---------------+ +---------------+ +---------------+
1488             |      OSDs     | |      MDSs     | |    Monitors   |
1489             +---------------+ +---------------+ +---------------+
1490
1491
1492 .. index:: architecture; Ceph Object Storage
1493
1494 Ceph Object Storage
1495 -------------------
1496
1497 The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
1498 a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
1499 Storage Cluster with its own data formats, and maintains its own user database,
1500 authentication, and access control. The RADOS Gateway uses a unified namespace,
1501 which means you can use either the OpenStack Swift-compatible API or the Amazon
1502 S3-compatible API. For example, you can write data using the S3-compatible API
1503 with one application and then read data using the Swift-compatible API with
1504 another application.
1505
1506 .. topic:: S3/Swift Objects and Store Cluster Objects Compared
1507
1508    Ceph's Object Storage uses the term *object* to describe the data it stores.
1509    S3 and Swift objects are not the same as the objects that Ceph writes to the
1510    Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
1511    Cluster objects. The S3 and Swift objects do not necessarily
1512    correspond in a 1:1 manner with an object stored in the storage cluster. It
1513    is possible for an S3 or Swift object to map to multiple Ceph objects.
1514
1515 See `Ceph Object Storage`_ for details.
1516
1517
1518 .. index:: Ceph Block Device; block device; RBD; Rados Block Device
1519
1520 Ceph Block Device
1521 -----------------
1522
1523 A Ceph Block Device stripes a block device image over multiple objects in the
1524 Ceph Storage Cluster, where each object gets mapped to a placement group and
1525 distributed, and the placement groups are spread across separate ``ceph-osd``
1526 daemons throughout the cluster.
1527
1528 .. important:: Striping allows RBD block devices to perform better than a single
1529    server could!
1530
1531 Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
1532 virtualization and cloud computing. In virtual machine scenarios, people
1533 typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
1534 QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
1535 service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
1536 with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
1537 ``libvirt`` to support OpenStack and CloudStack among other solutions.
1538
1539 While we do not provide ``librbd`` support with other hypervisors at this time,
1540 you may also use Ceph Block Device kernel objects to provide a block device to a
1541 client. Other virtualization technologies such as Xen can access the Ceph Block
1542 Device kernel object(s). This is done with the  command-line tool ``rbd``.
1543
1544
1545 .. index:: CephFS; Ceph File System; libcephfs; MDS; metadata server; ceph-mds
1546
1547 .. _arch-cephfs:
1548
1549 Ceph File System
1550 ----------------
1551
1552 The Ceph File System (CephFS) provides a POSIX-compliant filesystem as a
1553 service that is layered on top of the object-based Ceph Storage Cluster.
1554 CephFS files get mapped to objects that Ceph stores in the Ceph Storage
1555 Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
1556 a Filesystem in User Space (FUSE).
1557
1558 .. ditaa::
1559
1560             +-----------------------+  +------------------------+
1561             | CephFS Kernel Object  |  |      CephFS FUSE       |
1562             +-----------------------+  +------------------------+
1563
1564             +---------------------------------------------------+
1565             |            CephFS Library (libcephfs)             |
1566             +---------------------------------------------------+
1567
1568             +---------------------------------------------------+
1569             |      Ceph Storage Cluster Protocol (librados)     |
1570             +---------------------------------------------------+
1571
1572             +---------------+ +---------------+ +---------------+
1573             |      OSDs     | |      MDSs     | |    Monitors   |
1574             +---------------+ +---------------+ +---------------+
1575
1576
1577 The Ceph File System service includes the Ceph Metadata Server (MDS) deployed
1578 with the Ceph Storage cluster. The purpose of the MDS is to store all the
1579 filesystem metadata (directories, file ownership, access modes, etc) in
1580 high-availability Ceph Metadata Servers where the metadata resides in memory.
1581 The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
1582 operations like listing a directory or changing a directory (``ls``, ``cd``)
1583 would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
1584 the data means that the Ceph File System can provide high performance services
1585 without taxing the Ceph Storage Cluster.
1586
1587 CephFS separates the metadata from the data, storing the metadata in the MDS,
1588 and storing the file data in one or more objects in the Ceph Storage Cluster.
1589 The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
1590 single process, or it can be distributed out to multiple physical machines,
1591 either for high availability or for scalability.
1592
1593 - **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
1594   ready to take over the duties of any failed ``ceph-mds`` that was
1595   `active`. This is easy because all the data, including the journal, is
1596   stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
1597
1598 - **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
1599   will split the directory tree into subtrees (and shards of a single
1600   busy directory), effectively balancing the load amongst all `active`
1601   servers.
1602
1603 Combinations of `standby` and `active` etc are possible, for example
1604 running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
1605 instance for high availability.
1606
1607
1608
1609
1610 .. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf
1611 .. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
1612 .. _Monitor Config Reference: ../rados/configuration/mon-config-ref
1613 .. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
1614 .. _Heartbeats: ../rados/configuration/mon-osd-interaction
1615 .. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
1616 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
1617 .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1618 .. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
1619 .. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
1620 .. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
1621 .. _Hardware Recommendations: ../start/hardware-recommendations
1622 .. _Network Config Reference: ../rados/configuration/network-config-ref
1623 .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1624 .. _striping: https://en.wikipedia.org/wiki/Data_striping
1625 .. _RAID: https://en.wikipedia.org/wiki/RAID
1626 .. _RAID 0: https://en.wikipedia.org/wiki/RAID_0#RAID_0
1627 .. _Ceph Object Storage: ../radosgw/
1628 .. _RESTful: https://en.wikipedia.org/wiki/RESTful
1629 .. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
1630 .. _Cache Tiering: ../rados/operations/cache-tiering
1631 .. _Set Pool Values: ../rados/operations/pools#set-pool-values
1632 .. _Kerberos: https://en.wikipedia.org/wiki/Kerberos_(protocol)
1633 .. _Cephx Config Guide: ../rados/configuration/auth-config-ref
1634 .. _User Management: ../rados/operations/user-management