ceph/doc/architecture.rst

   1 ==============
   2  Architecture
   3 ==============
   4
   5 :term:`Ceph` uniquely delivers **object, block, and file storage** in one
   6 unified system. Ceph is highly reliable, easy to manage, and free. The power of
   7 Ceph can transform your company's IT infrastructure and your ability to manage
   8 vast amounts of data. Ceph delivers extraordinary scalability–thousands of
   9 clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
  10 commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
  11 accommodates large numbers of nodes, which communicate with each other to
  12 replicate and redistribute data dynamically.
  13
  14 .. image:: images/stack.png
  15
  16
  17 The Ceph Storage Cluster
  18 ========================
  19
  20 Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
  21 :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
  22 about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
  23 Storage Clusters`_.
  24
  25 A Ceph Storage Cluster consists of multiple types of daemons:
  26
  27 - :term:`Ceph Monitor`
  28 - :term:`Ceph OSD Daemon`
  29 - :term:`Ceph Manager`
  30 - :term:`Ceph Metadata Server`
  31
  32 .. ditaa::
  33
  34             +---------------+ +---------------+ +---------------+ +---------------+
  35             |      OSDs     | |    Monitors   | |    Managers   | |      MDS      |
  36             +---------------+ +---------------+ +---------------+ +---------------+
  37
  38 A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
  39 monitors ensures high availability should a monitor daemon fail. Storage cluster
  40 clients retrieve a copy of the cluster map from the Ceph Monitor.
  41
  42 A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
  43 back to monitors.
  44
  45 A Ceph Manager acts as an endpoint for monitoring, orchestration, and plug-in
  46 modules.
  47
  48 A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
  49 provide file services.
  50
  51 Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
  52 to efficiently compute information about data location, instead of having to
  53 depend on a central lookup table. Ceph's high-level features include a
  54 native interface to the Ceph Storage Cluster via ``librados``, and a number of
  55 service interfaces built on top of ``librados``.
  56
  57
  58
  59 Storing Data
  60 ------------
  61
  62 The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it
  63 comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
  64 :term:`Ceph File System` or a custom implementation you create using
  65 ``librados``-- which is stored as RADOS objects. Each object is stored on an
  66 :term:`Object Storage Device`. Ceph OSD Daemons handle read, write, and
  67 replication operations on storage drives.  With the older Filestore back end,
  68 each RADOS object was stored as a separate file on a conventional filesystem
  69 (usually XFS).  With the new and default BlueStore back end, objects are
  70 stored in a monolithic database-like fashion.
  71
  72 .. ditaa::
  73
  74            /-----\       +-----+       +-----+
  75            | obj |------>| {d} |------>| {s} |
  76            \-----/       +-----+       +-----+
  77
  78             Object         OSD          Drive
  79
  80 Ceph OSD Daemons store data as objects in a flat namespace (e.g., no
  81 hierarchy of directories). An object has an identifier, binary data, and
  82 metadata consisting of a set of name/value pairs. The semantics are completely
  83 up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file
  84 attributes such as the file owner, created date, last modified date, and so
  85 forth.
  86
  87
  88 .. ditaa::
  89
  90            /------+------------------------------+----------------\
  91            | ID   | Binary Data                  | Metadata       |
  92            +------+------------------------------+----------------+
  93            | 1234 | 0101010101010100110101010010 | name1 = value1 |
  94            |      | 0101100001010100110101010010 | name2 = value2 |
  95            |      | 0101100001010100110101010010 | nameN = valueN |
  96            \------+------------------------------+----------------/
  97
  98 .. note:: An object ID is unique across the entire cluster, not just the local
  99    filesystem.
 100
 101
 102 .. index:: architecture; high availability, scalability
 103
 104 Scalability and High Availability
 105 ---------------------------------
 106
 107 In traditional architectures, clients talk to a centralized component (e.g., a
 108 gateway, broker, API, facade, etc.), which acts as a single point of entry to a
 109 complex subsystem. This imposes a limit to both performance and scalability,
 110 while introducing a single point of failure (i.e., if the centralized component
 111 goes down, the whole system goes down, too).
 112
 113 Ceph eliminates the centralized gateway to enable clients to interact with
 114 Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
 115 Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
 116 of monitors to ensure high availability. To eliminate centralization, Ceph
 117 uses an algorithm called CRUSH.
 118
 119
 120 .. index:: CRUSH; architecture
 121
 122 CRUSH Introduction
 123 ~~~~~~~~~~~~~~~~~~
 124
 125 Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
 126 Replication Under Scalable Hashing)` algorithm to efficiently compute
 127 information about object location, instead of having to depend on a
 128 central lookup table. CRUSH provides a better data management mechanism compared
 129 to older approaches, and enables massive scale by cleanly distributing the work
 130 to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
 131 replication to ensure resiliency, which is better suited to hyper-scale storage.
 132 The following sections provide additional details on how CRUSH works. For a
 133 detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
 134 Placement of Replicated Data`_.
 135
 136 .. index:: architecture; cluster map
 137
 138 Cluster Map
 139 ~~~~~~~~~~~
 140
 141 Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
 142 cluster topology, which is inclusive of 5 maps collectively referred to as the
 143 "Cluster Map":
 144
 145 #. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
 146    address and port of each monitor. It also indicates the current epoch,
 147    when the map was created, and the last time it changed. To view a monitor
 148    map, execute ``ceph mon dump``.
 149
 150 #. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
 151    last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
 152    and their status (e.g., ``up``, ``in``). To view an OSD map, execute
 153    ``ceph osd dump``.
 154
 155 #. **The PG Map:** Contains the PG version, its time stamp, the last OSD
 156    map epoch, the full ratios, and details on each placement group such as
 157    the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
 158    ``active + clean``), and data usage statistics for each pool.
 159
 160 #. **The CRUSH Map:** Contains a list of storage devices, the failure domain
 161    hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
 162    traversing the hierarchy when storing data. To view a CRUSH map, execute
 163    ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
 164    ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
 165    You can view the decompiled map in a text editor or with ``cat``.
 166
 167 #. **The MDS Map:** Contains the current MDS map epoch, when the map was
 168    created, and the last time it changed. It also contains the pool for
 169    storing metadata, a list of metadata servers, and which metadata servers
 170    are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
 171
 172 Each map maintains an iterative history of its operating state changes. Ceph
 173 Monitors maintain a master copy of the cluster map including the cluster
 174 members, state, changes, and the overall health of the Ceph Storage Cluster.
 175
 176 .. index:: high availability; monitor architecture
 177
 178 High Availability Monitors
 179 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 180
 181 Before Ceph Clients can read or write data, they must contact a Ceph Monitor
 182 to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
 183 can operate with a single monitor; however, this introduces a single
 184 point of failure (i.e., if the monitor goes down, Ceph Clients cannot
 185 read or write data).
 186
 187 For added reliability and fault tolerance, Ceph supports a cluster of monitors.
 188 In a cluster of monitors, latency and other faults can cause one or more
 189 monitors to fall behind the current state of the cluster. For this reason, Ceph
 190 must have agreement among various monitor instances regarding the state of the
 191 cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
 192 and the `Paxos`_ algorithm to establish a consensus among the monitors about the
 193 current state of the cluster.
 194
 195 For details on configuring monitors, see the `Monitor Config Reference`_.
 196
 197 .. index:: architecture; high availability authentication
 198
 199 High Availability Authentication
 200 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 201
 202 To identify users and protect against man-in-the-middle attacks, Ceph provides
 203 its ``cephx`` authentication system to authenticate users and daemons.
 204
 205 .. note:: The ``cephx`` protocol does not address data encryption in transport
 206    (e.g., SSL/TLS) or encryption at rest.
 207
 208 Cephx uses shared secret keys for authentication, meaning both the client and
 209 the monitor cluster have a copy of the client's secret key. The authentication
 210 protocol is such that both parties are able to prove to each other they have a
 211 copy of the key without actually revealing it. This provides mutual
 212 authentication, which means the cluster is sure the user possesses the secret
 213 key, and the user is sure that the cluster has a copy of the secret key.
 214
 215 A key scalability feature of Ceph is to avoid a centralized interface to the
 216 Ceph object store, which means that Ceph clients must be able to interact with
 217 OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
 218 system, which authenticates users operating Ceph clients. The ``cephx`` protocol
 219 operates in a manner with behavior similar to `Kerberos`_.
 220
 221 A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
 222 monitor can authenticate users and distribute keys, so there is no single point
 223 of failure or bottleneck when using ``cephx``. The monitor returns an
 224 authentication data structure similar to a Kerberos ticket that contains a
 225 session key for use in obtaining Ceph services.  This session key is itself
 226 encrypted with the user's permanent  secret key, so that only the user can
 227 request services from the Ceph Monitor(s). The client then uses the session key
 228 to request its desired services from the monitor, and the monitor provides the
 229 client with a ticket that will authenticate the client to the OSDs that actually
 230 handle data. Ceph Monitors and OSDs share a secret, so the client can use the
 231 ticket provided by the monitor with any OSD or metadata server in the cluster.
 232 Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
 233 ticket or session key obtained surreptitiously. This form of authentication will
 234 prevent attackers with access to the communications medium from either creating
 235 bogus messages under another user's identity or altering another user's
 236 legitimate messages, as long as the user's secret key is not divulged before it
 237 expires.
 238
 239 To use ``cephx``, an administrator must set up users first. In the following
 240 diagram, the ``client.admin`` user invokes  ``ceph auth get-or-create-key`` from
 241 the command line to generate a username and secret key. Ceph's ``auth``
 242 subsystem generates the username and key, stores a copy with the monitor(s) and
 243 transmits the user's secret back to the ``client.admin`` user. This means that
 244 the client and the monitor share a secret key.
 245
 246 .. note:: The ``client.admin`` user must provide the user ID and
 247    secret key to the user in a secure manner.
 248
 249 .. ditaa::
 250
 251            +---------+     +---------+
 252            | Client  |     | Monitor |
 253            +---------+     +---------+
 254                 |  request to   |
 255                 | create a user |
 256                 |-------------->|----------+ create user
 257                 |               |          | and
 258                 |<--------------|<---------+ store key
 259                 | transmit key  |
 260                 |               |
 261
 262
 263 To authenticate with the monitor, the client passes in the user name to the
 264 monitor, and the monitor generates a session key and encrypts it with the secret
 265 key associated to the user name. Then, the monitor transmits the encrypted
 266 ticket back to the client. The client then decrypts the payload with the shared
 267 secret key to retrieve the session key. The session key identifies the user for
 268 the current session. The client then requests a ticket on behalf of the user
 269 signed by the session key. The monitor generates a ticket, encrypts it with the
 270 user's secret key and transmits it back to the client. The client decrypts the
 271 ticket and uses it to sign requests to OSDs and metadata servers throughout the
 272 cluster.
 273
 274 .. ditaa::
 275
 276            +---------+     +---------+
 277            | Client  |     | Monitor |
 278            +---------+     +---------+
 279                 |  authenticate |
 280                 |-------------->|----------+ generate and
 281                 |               |          | encrypt
 282                 |<--------------|<---------+ session key
 283                 | transmit      |
 284                 | encrypted     |
 285                 | session key   |
 286                 |               |
 287                 |-----+ decrypt |
 288                 |     | session |
 289                 |<----+ key     |
 290                 |               |
 291                 |  req. ticket  |
 292                 |-------------->|----------+ generate and
 293                 |               |          | encrypt
 294                 |<--------------|<---------+ ticket
 295                 | recv. ticket  |
 296                 |               |
 297                 |-----+ decrypt |
 298                 |     | ticket  |
 299                 |<----+         |
 300
 301
 302 The ``cephx`` protocol authenticates ongoing communications between the client
 303 machine and the Ceph servers. Each message sent between a client and server,
 304 subsequent to the initial authentication, is signed using a ticket that the
 305 monitors, OSDs and metadata servers can verify with their shared secret.
 306
 307 .. ditaa::
 308
 309            +---------+     +---------+     +-------+     +-------+
 310            |  Client |     | Monitor |     |  MDS  |     |  OSD  |
 311            +---------+     +---------+     +-------+     +-------+
 312                 |  request to   |              |             |
 313                 | create a user |              |             |
 314                 |-------------->| mon and      |             |
 315                 |<--------------| client share |             |
 316                 |    receive    | a secret.    |             |
 317                 | shared secret |              |             |
 318                 |               |<------------>|             |
 319                 |               |<-------------+------------>|
 320                 |               | mon, mds,    |             |
 321                 | authenticate  | and osd      |             |
 322                 |-------------->| share        |             |
 323                 |<--------------| a secret     |             |
 324                 |  session key  |              |             |
 325                 |               |              |             |
 326                 |  req. ticket  |              |             |
 327                 |-------------->|              |             |
 328                 |<--------------|              |             |
 329                 | recv. ticket  |              |             |
 330                 |               |              |             |
 331                 |   make request (CephFS only) |             |
 332                 |----------------------------->|             |
 333                 |<-----------------------------|             |
 334                 | receive response (CephFS only)             |
 335                 |                                            |
 336                 |                make request                |
 337                 |------------------------------------------->|
 338                 |<-------------------------------------------|
 339                                receive response
 340
 341 The protection offered by this authentication is between the Ceph client and the
 342 Ceph server hosts. The authentication is not extended beyond the Ceph client. If
 343 the user accesses the Ceph client from a remote host, Ceph authentication is not
 344 applied to the connection between the user's host and the client host.
 345
 346
 347 For configuration details, see `Cephx Config Guide`_. For user management
 348 details, see `User Management`_.
 349
 350
 351 .. index:: architecture; smart daemons and scalability
 352
 353 Smart Daemons Enable Hyperscale
 354 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 355
 356 In many clustered architectures, the primary purpose of cluster membership is
 357 so that a centralized interface knows which nodes it can access. Then the
 358 centralized interface provides services to the client through a double
 359 dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
 360
 361 Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
 362 aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
 363 Daemons in the cluster.  This enables Ceph OSD Daemons to interact directly with
 364 other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
 365 to interact directly with Ceph OSD Daemons.
 366
 367 The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
 368 each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
 369 nodes to easily perform tasks that would bog down a centralized server. The
 370 ability to leverage this computing power leads to several major benefits:
 371
 372 #. **OSDs Service Clients Directly:** Since any network device has a limit to
 373    the number of concurrent connections it can support, a centralized system
 374    has a low physical limit at high scales. By enabling Ceph Clients to contact
 375    Ceph OSD Daemons directly, Ceph increases both performance and total system
 376    capacity simultaneously, while removing a single point of failure. Ceph
 377    Clients can maintain a session when they need to, and with a particular Ceph
 378    OSD Daemon instead of a centralized server.
 379
 380 #. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
 381    on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
 382    or ``down`` reflecting whether or not it is running and able to service
 383    Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
 384    Storage Cluster, this status may indicate the failure of the Ceph OSD
 385    Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
 386    Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
 387    periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
 388    and a new ``MOSDBeacon`` in luminous).  If the Ceph Monitor doesn't see that
 389    message after a configurable period of time then it marks the OSD down.
 390    This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
 391    determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
 392    This assures that Ceph Monitors are lightweight processes.  See `Monitoring
 393    OSDs`_ and `Heartbeats`_ for additional details.
 394
 395 #. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
 396    Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare
 397    their local objects metadata with its replicas stored on other OSDs. Scrubbing
 398    happens on a per-Placement Group base. Scrubbing (usually performed daily)
 399    catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper
 400    scrubbing by comparing data in objects bit-for-bit with their checksums.
 401    Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
 402    weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
 403    configuring scrubbing.
 404
 405 #. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
 406    algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
 407    objects should be stored (and for rebalancing). In a typical write scenario,
 408    a client uses the CRUSH algorithm to compute where to store an object, maps
 409    the object to a pool and placement group, then looks at the CRUSH map to
 410    identify the primary OSD for the placement group.
 411
 412    The client writes the object to the identified placement group in the
 413    primary OSD. Then, the primary OSD with its own copy of the CRUSH map
 414    identifies the secondary and tertiary OSDs for replication purposes, and
 415    replicates the object to the appropriate placement groups in the secondary
 416    and tertiary OSDs (as many OSDs as additional replicas), and responds to the
 417    client once it has confirmed the object was stored successfully.
 418
 419 .. ditaa::
 420
 421              +----------+
 422              |  Client  |
 423              |          |
 424              +----------+
 425                  *  ^
 426       Write (1)  |  |  Ack (6)
 427                  |  |
 428                  v  *
 429             +-------------+
 430             | Primary OSD |
 431             |             |
 432             +-------------+
 433               *  ^   ^  *
 434     Write (2) |  |   |  |  Write (3)
 435        +------+  |   |  +------+
 436        |  +------+   +------+  |
 437        |  | Ack (4)  Ack (5)|  |
 438        v  *                 *  v
 439  +---------------+   +---------------+
 440  | Secondary OSD |   | Tertiary OSD  |
 441  |               |   |               |
 442  +---------------+   +---------------+
 443
 444 With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
 445 clients from that duty, while ensuring high data availability and data safety.
 446
 447
 448 Dynamic Cluster Management
 449 --------------------------
 450
 451 In the `Scalability and High Availability`_ section, we explained how Ceph uses
 452 CRUSH, cluster awareness and intelligent daemons to scale and maintain high
 453 availability. Key to Ceph's design is the autonomous, self-healing, and
 454 intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
 455 enable modern cloud storage infrastructures to place data, rebalance the cluster
 456 and recover from faults dynamically.
 457
 458 .. index:: architecture; pools
 459
 460 About Pools
 461 ~~~~~~~~~~~
 462
 463 The Ceph storage system supports the notion of 'Pools', which are logical
 464 partitions for storing objects.
 465
 466 Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
 467 pools. The pool's ``size`` or number of replicas, the CRUSH rule and the
 468 number of placement groups determine how Ceph will place the data.
 469
 470 .. ditaa::
 471
 472             +--------+  Retrieves  +---------------+
 473             | Client |------------>|  Cluster Map  |
 474             +--------+             +---------------+
 475                  |
 476                  v      Writes
 477               /-----\
 478               | obj |
 479               \-----/
 480                  |      To
 481                  v
 482             +--------+           +---------------+
 483             |  Pool  |---------->|  CRUSH Rule   |
 484             +--------+  Selects  +---------------+
 485
 486
 487 Pools set at least the following parameters:
 488
 489 - Ownership/Access to Objects
 490 - The Number of Placement Groups, and
 491 - The CRUSH Rule to Use.
 492
 493 See `Set Pool Values`_ for details.
 494
 495
 496 .. index: architecture; placement group mapping
 497
 498 Mapping PGs to OSDs
 499 ~~~~~~~~~~~~~~~~~~~
 500
 501 Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
 502 When a Ceph Client stores objects, CRUSH will map each object to a placement
 503 group.
 504
 505 Mapping objects to placement groups creates a layer of indirection between the
 506 Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
 507 grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
 508 Client "knew" which Ceph OSD Daemon had which object, that would create a tight
 509 coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
 510 algorithm maps each object to a placement group and then maps each placement
 511 group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
 512 rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
 513 come online. The following diagram depicts how CRUSH maps objects to placement
 514 groups, and placement groups to OSDs.
 515
 516 .. ditaa::
 517
 518            /-----\  /-----\  /-----\  /-----\  /-----\
 519            | obj |  | obj |  | obj |  | obj |  | obj |
 520            \-----/  \-----/  \-----/  \-----/  \-----/
 521               |        |        |        |        |
 522               +--------+--------+        +---+----+
 523               |                              |
 524               v                              v
 525    +-----------------------+      +-----------------------+
 526    |  Placement Group #1   |      |  Placement Group #2   |
 527    |                       |      |                       |
 528    +-----------------------+      +-----------------------+
 529                |                              |
 530                |      +-----------------------+---+
 531         +------+------+-------------+             |
 532         |             |             |             |
 533         v             v             v             v
 534    /----------\  /----------\  /----------\  /----------\
 535    |          |  |          |  |          |  |          |
 536    |  OSD #1  |  |  OSD #2  |  |  OSD #3  |  |  OSD #4  |
 537    |          |  |          |  |          |  |          |
 538    \----------/  \----------/  \----------/  \----------/
 539
 540 With a copy of the cluster map and the CRUSH algorithm, the client can compute
 541 exactly which OSD to use when reading or writing a particular object.
 542
 543 .. index:: architecture; calculating PG IDs
 544
 545 Calculating PG IDs
 546 ~~~~~~~~~~~~~~~~~~
 547
 548 When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
 549 `Cluster Map`_. With the cluster map, the client knows about all of the monitors,
 550 OSDs, and metadata servers in the cluster. **However, it doesn't know anything
 551 about object locations.**
 552
 553 .. epigraph::
 554
 555         Object locations get computed.
 556
 557
 558 The only input required by the client is the object ID and the pool.
 559 It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
 560 wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
 561 it calculates a placement group using the object name, a hash code, the
 562 number of PGs in the pool and the pool name. Ceph clients use the following
 563 steps to compute PG IDs.
 564
 565 #. The client inputs the pool name and the object ID. (e.g., pool = "liverpool"
 566    and object-id = "john")
 567 #. Ceph takes the object ID and hashes it.
 568 #. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
 569    a PG ID.
 570 #. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
 571 #. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
 572
 573 Computing object locations is much faster than performing object location query
 574 over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
 575 Hashing)` algorithm allows a client to compute where objects *should* be stored,
 576 and enables the client to contact the primary OSD to store or retrieve the
 577 objects.
 578
 579 .. index:: architecture; PG Peering
 580
 581 Peering and Sets
 582 ~~~~~~~~~~~~~~~~
 583
 584 In previous sections, we noted that Ceph OSD Daemons check each others
 585 heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
 586 do is called 'peering', which is the process of bringing all of the OSDs that
 587 store a Placement Group (PG) into agreement about the state of all of the
 588 objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
 589 Peering Failure`_ to the Ceph Monitors. Peering issues  usually resolve
 590 themselves; however, if the problem persists, you may need to refer to the
 591 `Troubleshooting Peering Failure`_ section.
 592
 593 .. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
 594
 595 The Ceph Storage Cluster was designed to store at least two copies of an object
 596 (i.e., ``size = 2``), which is the minimum requirement for data safety. For high
 597 availability, a Ceph Storage Cluster should store more than two copies of an object
 598 (e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
 599 ``degraded`` state while maintaining data safety.
 600
 601 Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
 602 name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
 603 rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
 604 the *Primary* is the first OSD in the *Acting Set*, and is responsible for
 605 coordinating the peering process for each placement group where it acts as
 606 the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
 607 writes to objects for a given placement group where it acts as the *Primary*.
 608
 609 When a series of OSDs are responsible for a placement group, that series of
 610 OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
 611 OSD Daemons that are currently responsible for the placement group, or the Ceph
 612 OSD Daemons that were responsible  for a particular placement group as of some
 613 epoch.
 614
 615 The Ceph OSD daemons that are part of an *Acting Set* may not always be  ``up``.
 616 When an OSD in the *Acting Set* is ``up``, it is part of the  *Up Set*. The *Up
 617 Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
 618 Daemons when an OSD fails.
 619
 620 .. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
 621    ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
 622    the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
 623    removed from the *Up Set*.
 624
 625
 626 .. index:: architecture; Rebalancing
 627
 628 Rebalancing
 629 ~~~~~~~~~~~
 630
 631 When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
 632 updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
 633 the cluster map. Consequently, it changes object placement, because it changes
 634 an input for the calculations. The following diagram depicts the rebalancing
 635 process (albeit rather crudely, since it is substantially less impactful with
 636 large clusters) where some, but not all of the PGs migrate from existing OSDs
 637 (OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
 638 stable. Many of the placement groups remain in their original configuration,
 639 and each OSD gets some added capacity, so there are no load spikes on the
 640 new OSD after rebalancing is complete.
 641
 642
 643 .. ditaa::
 644
 645            +--------+     +--------+
 646    Before  |  OSD 1 |     |  OSD 2 |
 647            +--------+     +--------+
 648            |  PG #1 |     | PG #6  |
 649            |  PG #2 |     | PG #7  |
 650            |  PG #3 |     | PG #8  |
 651            |  PG #4 |     | PG #9  |
 652            |  PG #5 |     | PG #10 |
 653            +--------+     +--------+
 654
 655            +--------+     +--------+     +--------+
 656     After  |  OSD 1 |     |  OSD 2 |     |  OSD 3 |
 657            +--------+     +--------+     +--------+
 658            |  PG #1 |     | PG #7  |     |  PG #3 |
 659            |  PG #2 |     | PG #8  |     |  PG #6 |
 660            |  PG #4 |     | PG #10 |     |  PG #9 |
 661            |  PG #5 |     |        |     |        |
 662            |        |     |        |     |        |
 663            +--------+     +--------+     +--------+
 664
 665
 666 .. index:: architecture; Data Scrubbing
 667
 668 Data Consistency
 669 ~~~~~~~~~~~~~~~~
 670
 671 As part of maintaining data consistency and cleanliness, Ceph OSDs also scrub
 672 objects within placement groups. That is, Ceph OSDs compare object metadata in
 673 one placement group with its replicas in placement groups stored in other
 674 OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
 675 errors, often as a result of hardware issues.  OSDs also perform deeper
 676 scrubbing by comparing data in objects bit-for-bit.  Deep scrubbing (by default
 677 performed weekly) finds bad blocks on a drive that weren't apparent in a light
 678 scrub.
 679
 680 See `Data Scrubbing`_ for details on configuring scrubbing.
 681
 682
 683
 684
 685
 686 .. index:: erasure coding
 687
 688 Erasure Coding
 689 --------------
 690
 691 An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
 692 ``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
 693 of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
 694 the chunk is stored as an attribute of the object.
 695
 696 For instance an erasure coded pool can be created to use five OSDs (``K+M = 5``) and
 697 sustain the loss of two of them (``M = 2``).
 698
 699 Reading and Writing Encoded Chunks
 700 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 701
 702 When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
 703 encoding function splits the content into three data chunks simply by dividing
 704 the content in three: the first contains ``ABC``, the second ``DEF`` and the
 705 last ``GHI``. The content will be padded if the content length is not a multiple
 706 of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
 707 and the fifth with ``QGC``. Each chunk is stored in an OSD in the acting set.
 708 The chunks are stored in objects that have the same name (**NYAN**) but reside
 709 on different OSDs. The order in which the chunks were created must be preserved
 710 and is stored as an attribute of the object (``shard_t``), in addition to its
 711 name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
 712 ``YXY`` and is stored on **OSD3**.
 713
 714
 715 .. ditaa::
 716
 717                             +-------------------+
 718                        name |       NYAN        |
 719                             +-------------------+
 720                     content |     ABCDEFGHI     |
 721                             +--------+----------+
 722                                      |
 723                                      |
 724                                      v
 725                               +------+------+
 726               +---------------+ encode(3,2) +-----------+
 727               |               +--+--+---+---+           |
 728               |                  |  |   |               |
 729               |          +-------+  |   +-----+         |
 730               |          |          |         |         |
 731            +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
 732      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
 733            +------+   +------+   +------+  +------+  +------+
 734     shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
 735            +------+   +------+   +------+  +------+  +------+
 736   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
 737            +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
 738               |          |          |         |         |
 739               |          |          v         |         |
 740               |          |       +--+---+     |         |
 741               |          |       | OSD1 |     |         |
 742               |          |       +------+     |         |
 743               |          |                    |         |
 744               |          |       +------+     |         |
 745               |          +------>| OSD2 |     |         |
 746               |                  +------+     |         |
 747               |                               |         |
 748               |                  +------+     |         |
 749               |                  | OSD3 |<----+         |
 750               |                  +------+               |
 751               |                                         |
 752               |                  +------+               |
 753               |                  | OSD4 |<--------------+
 754               |                  +------+
 755               |
 756               |                  +------+
 757               +----------------->| OSD5 |
 758                                  +------+
 759
 760
 761 When the object **NYAN** is read from the erasure coded pool, the decoding
 762 function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
 763 ``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
 764 of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
 765 and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
 766 because the **OSD4** is out. The decoding function can be called as soon as
 767 three chunks are read: **OSD2** was the slowest and its chunk was not taken into
 768 account.
 769
 770 .. ditaa::
 771
 772                                  +-------------------+
 773                             name |       NYAN        |
 774                                  +-------------------+
 775                          content |     ABCDEFGHI     |
 776                                  +---------+---------+
 777                                            ^
 778                                            |
 779                                            |
 780                                    +-------+-------+
 781                                    |  decode(3,2)  |
 782                     +------------->+  erasures 2,5 +<-+
 783                     |              |               |  |
 784                     |              +-------+-------+  |
 785                     |                      ^          |
 786                     |                      |          |
 787                     |                      |          |
 788                  +--+---+   +------+   +---+--+   +---+--+
 789            name  | NYAN |   | NYAN |   | NYAN |   | NYAN |
 790                  +------+   +------+   +------+   +------+
 791           shard  |  1   |   |  2   |   |  3   |   |  4   |
 792                  +------+   +------+   +------+   +------+
 793         content  | ABC  |   | DEF  |   | GHI  |   | YXY  |
 794                  +--+---+   +--+---+   +--+---+   +--+---+
 795                     ^          .          ^          ^
 796                     |    TOO   .          |          |
 797                     |    SLOW  .       +--+---+      |
 798                     |          ^       | OSD1 |      |
 799                     |          |       +------+      |
 800                     |          |                     |
 801                     |          |       +------+      |
 802                     |          +-------| OSD2 |      |
 803                     |                  +------+      |
 804                     |                                |
 805                     |                  +------+      |
 806                     |                  | OSD3 |------+
 807                     |                  +------+
 808                     |
 809                     |                  +------+
 810                     |                  | OSD4 | OUT
 811                     |                  +------+
 812                     |
 813                     |                  +------+
 814                     +------------------| OSD5 |
 815                                        +------+
 816
 817
 818 Interrupted Full Writes
 819 ~~~~~~~~~~~~~~~~~~~~~~~
 820
 821 In an erasure coded pool, the primary OSD in the up set receives all write
 822 operations. It is responsible for encoding the payload into ``K+M`` chunks and
 823 sends them to the other OSDs. It is also responsible for maintaining an
 824 authoritative version of the placement group logs.
 825
 826 In the following diagram, an erasure coded placement group has been created with
 827 ``K = 2, M = 1`` and is supported by three OSDs, two for ``K`` and one for
 828 ``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
 829 **OSD 3**. An object has been encoded and stored in the OSDs : the chunk
 830 ``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
 831 **OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
 832 placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
 833 version 1).
 834
 835
 836 .. ditaa::
 837
 838      Primary OSD
 839
 840    +-------------+
 841    |    OSD 1    |             +-------------+
 842    |         log |  Write Full |             |
 843    |  +----+     |<------------+ Ceph Client |
 844    |  |D1v1| 1,1 |      v1     |             |
 845    |  +----+     |             +-------------+
 846    +------+------+
 847           |
 848           |
 849           |          +-------------+
 850           |          |    OSD 2    |
 851           |          |         log |
 852           +--------->+  +----+     |
 853           |          |  |D2v1| 1,1 |
 854           |          |  +----+     |
 855           |          +-------------+
 856           |
 857           |          +-------------+
 858           |          |    OSD 3    |
 859           |          |         log |
 860           +--------->|  +----+     |
 861                      |  |C1v1| 1,1 |
 862                      |  +----+     |
 863                      +-------------+
 864
 865 **OSD 1** is the primary and receives a **WRITE FULL** from a client, which
 866 means the payload is to replace the object entirely instead of overwriting a
 867 portion of it. Version 2 (v2) of the object is created to override version 1
 868 (v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
 869 chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
 870 ``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
 871 to the target OSD, including the primary OSD which is responsible for storing
 872 chunks in addition to handling write operations and maintaining an authoritative
 873 version of the placement group logs. When an OSD receives the message
 874 instructing it to write the chunk, it also creates a new entry in the placement
 875 group logs to reflect the change. For instance, as soon as **OSD 3** stores
 876 ``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
 877 Because the OSDs work asynchronously, some chunks may still be in flight ( such
 878 as ``D2v2`` ) while others are acknowledged and persisted to storage drives
 879 (such as ``C1v1`` and ``D1v1``).
 880
 881 .. ditaa::
 882
 883      Primary OSD
 884
 885    +-------------+
 886    |    OSD 1    |
 887    |         log |
 888    |  +----+     |             +-------------+
 889    |  |D1v2| 1,2 |  Write Full |             |
 890    |  +----+     +<------------+ Ceph Client |
 891    |             |      v2     |             |
 892    |  +----+     |             +-------------+
 893    |  |D1v1| 1,1 |
 894    |  +----+     |
 895    +------+------+
 896           |
 897           |
 898           |           +------+------+
 899           |           |    OSD 2    |
 900           |  +------+ |         log |
 901           +->| D2v2 | |  +----+     |
 902           |  +------+ |  |D2v1| 1,1 |
 903           |           |  +----+     |
 904           |           +-------------+
 905           |
 906           |           +-------------+
 907           |           |    OSD 3    |
 908           |           |         log |
 909           |           |  +----+     |
 910           |           |  |C1v2| 1,2 |
 911           +---------->+  +----+     |
 912                       |             |
 913                       |  +----+     |
 914                       |  |C1v1| 1,1 |
 915                       |  +----+     |
 916                       +-------------+
 917
 918
 919 If all goes well, the chunks are acknowledged on each OSD in the acting set and
 920 the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
 921
 922 .. ditaa::
 923
 924      Primary OSD
 925
 926    +-------------+
 927    |    OSD 1    |
 928    |         log |
 929    |  +----+     |             +-------------+
 930    |  |D1v2| 1,2 |  Write Full |             |
 931    |  +----+     +<------------+ Ceph Client |
 932    |             |      v2     |             |
 933    |  +----+     |             +-------------+
 934    |  |D1v1| 1,1 |
 935    |  +----+     |
 936    +------+------+
 937           |
 938           |           +-------------+
 939           |           |    OSD 2    |
 940           |           |         log |
 941           |           |  +----+     |
 942           |           |  |D2v2| 1,2 |
 943           +---------->+  +----+     |
 944           |           |             |
 945           |           |  +----+     |
 946           |           |  |D2v1| 1,1 |
 947           |           |  +----+     |
 948           |           +-------------+
 949           |
 950           |           +-------------+
 951           |           |    OSD 3    |
 952           |           |         log |
 953           |           |  +----+     |
 954           |           |  |C1v2| 1,2 |
 955           +---------->+  +----+     |
 956                       |             |
 957                       |  +----+     |
 958                       |  |C1v1| 1,1 |
 959                       |  +----+     |
 960                       +-------------+
 961
 962
 963 Finally, the files used to store the chunks of the previous version of the
 964 object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
 965 on **OSD 3**.
 966
 967 .. ditaa::
 968
 969      Primary OSD
 970
 971    +-------------+
 972    |    OSD 1    |
 973    |         log |
 974    |  +----+     |
 975    |  |D1v2| 1,2 |
 976    |  +----+     |
 977    +------+------+
 978           |
 979           |
 980           |          +-------------+
 981           |          |    OSD 2    |
 982           |          |         log |
 983           +--------->+  +----+     |
 984           |          |  |D2v2| 1,2 |
 985           |          |  +----+     |
 986           |          +-------------+
 987           |
 988           |          +-------------+
 989           |          |    OSD 3    |
 990           |          |         log |
 991           +--------->|  +----+     |
 992                      |  |C1v2| 1,2 |
 993                      |  +----+     |
 994                      +-------------+
 995
 996
 997 But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
 998 the object's version 2 is partially written: **OSD 3** has one chunk but that is
 999 not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
1000 erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
1001 available to rebuild the third. **OSD 4** becomes the new primary and finds that
1002 the ``last_complete`` log entry (i.e., all objects before this entry were known
1003 to be available on all OSDs in the previous acting set ) is ``1,1`` and that
1004 will be the head of the new authoritative log.
1005
1006 .. ditaa::
1007
1008    +-------------+
1009    |    OSD 1    |
1010    |   (down)    |
1011    | c333        |
1012    +------+------+
1013           |
1014           |           +-------------+
1015           |           |    OSD 2    |
1016           |           |         log |
1017           |           |  +----+     |
1018           +---------->+  |D2v1| 1,1 |
1019           |           |  +----+     |
1020           |           |             |
1021           |           +-------------+
1022           |
1023           |           +-------------+
1024           |           |    OSD 3    |
1025           |           |         log |
1026           |           |  +----+     |
1027           |           |  |C1v2| 1,2 |
1028           +---------->+  +----+     |
1029                       |             |
1030                       |  +----+     |
1031                       |  |C1v1| 1,1 |
1032                       |  +----+     |
1033                       +-------------+
1034      Primary OSD
1035    +-------------+
1036    |    OSD 4    |
1037    |         log |
1038    |             |
1039    |         1,1 |
1040    |             |
1041    +------+------+
1042
1043
1044
1045 The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
1046 provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
1047 chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
1048 the erasure coding library during scrubbing and stored on the new primary
1049 **OSD 4**.
1050
1051
1052 .. ditaa::
1053
1054      Primary OSD
1055
1056    +-------------+
1057    |    OSD 4    |
1058    |         log |
1059    |  +----+     |
1060    |  |D1v1| 1,1 |
1061    |  +----+     |
1062    +------+------+
1063           ^
1064           |
1065           |          +-------------+
1066           |          |    OSD 2    |
1067           |          |         log |
1068           +----------+  +----+     |
1069           |          |  |D2v1| 1,1 |
1070           |          |  +----+     |
1071           |          +-------------+
1072           |
1073           |          +-------------+
1074           |          |    OSD 3    |
1075           |          |         log |
1076           +----------|  +----+     |
1077                      |  |C1v1| 1,1 |
1078                      |  +----+     |
1079                      +-------------+
1080
1081    +-------------+
1082    |    OSD 1    |
1083    |   (down)    |
1084    | c333        |
1085    +-------------+
1086
1087 See `Erasure Code Notes`_ for additional details.
1088
1089
1090
1091 Cache Tiering
1092 -------------
1093
1094 A cache tier provides Ceph Clients with better I/O performance for a subset of
1095 the data stored in a backing storage tier. Cache tiering involves creating a
1096 pool of relatively fast/expensive storage devices (e.g., solid state drives)
1097 configured to act as a cache tier, and a backing pool of either erasure-coded
1098 or relatively slower/cheaper devices configured to act as an economical storage
1099 tier. The Ceph objecter handles where to place the objects and the tiering
1100 agent determines when to flush objects from the cache to the backing storage
1101 tier. So the cache tier and the backing storage tier are completely transparent
1102 to Ceph clients.
1103
1104
1105 .. ditaa::
1106
1107            +-------------+
1108            | Ceph Client |
1109            +------+------+
1110                   ^
1111      Tiering is   |
1112     Transparent   |              Faster I/O
1113         to Ceph   |           +---------------+
1114      Client Ops   |           |               |
1115                   |    +----->+   Cache Tier  |
1116                   |    |      |               |
1117                   |    |      +-----+---+-----+
1118                   |    |            |   ^
1119                   v    v            |   |   Active Data in Cache Tier
1120            +------+----+--+         |   |
1121            |   Objecter   |         |   |
1122            +-----------+--+         |   |
1123                        ^            |   |   Inactive Data in Storage Tier
1124                        |            v   |
1125                        |      +-----+---+-----+
1126                        |      |               |
1127                        +----->|  Storage Tier |
1128                               |               |
1129                               +---------------+
1130                                  Slower I/O
1131
1132 See `Cache Tiering`_ for additional details.  Note that Cache Tiers can be
1133 tricky and their use is now discouraged.
1134
1135
1136 .. index:: Extensibility, Ceph Classes
1137
1138 Extending Ceph
1139 --------------
1140
1141 You can extend Ceph by creating shared object classes called 'Ceph Classes'.
1142 Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
1143 (i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
1144 can create new object methods that have the ability to call the native methods
1145 in the Ceph Object Store, or other class methods you incorporate via libraries
1146 or create yourself.
1147
1148 On writes, Ceph Classes can call native or class methods, perform any series of
1149 operations on the inbound data and generate a resulting write transaction  that
1150 Ceph will apply atomically.
1151
1152 On reads, Ceph Classes can call native or class methods, perform any series of
1153 operations on the outbound data and return the data to the client.
1154
1155 .. topic:: Ceph Class Example
1156
1157    A Ceph class for a content management system that presents pictures of a
1158    particular size and aspect ratio could take an inbound bitmap image, crop it
1159    to a particular aspect ratio, resize it and embed an invisible copyright or
1160    watermark to help protect the intellectual property; then, save the
1161    resulting bitmap image to the object store.
1162
1163 See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for
1164 exemplary implementations.
1165
1166
1167 Summary
1168 -------
1169
1170 Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
1171 appliances do not fully utilize the CPU and RAM of a typical commodity server,
1172 Ceph does. From heartbeats, to  peering, to rebalancing the cluster or
1173 recovering from faults,  Ceph offloads work from clients (and from a centralized
1174 gateway which doesn't exist in the Ceph architecture) and uses the computing
1175 power of the OSDs to perform the work. When referring to `Hardware
1176 Recommendations`_ and the `Network Config Reference`_,  be cognizant of the
1177 foregoing concepts to understand how Ceph utilizes computing resources.
1178
1179 .. index:: Ceph Protocol, librados
1180
1181 Ceph Protocol
1182 =============
1183
1184 Ceph Clients use the native protocol for interacting with the Ceph Storage
1185 Cluster. Ceph packages this functionality into the ``librados`` library so that
1186 you can create your own custom Ceph Clients. The following diagram depicts the
1187 basic architecture.
1188
1189 .. ditaa::
1190
1191             +---------------------------------+
1192             |  Ceph Storage Cluster Protocol  |
1193             |           (librados)            |
1194             +---------------------------------+
1195             +---------------+ +---------------+
1196             |      OSDs     | |    Monitors   |
1197             +---------------+ +---------------+
1198
1199
1200 Native Protocol and ``librados``
1201 --------------------------------
1202
1203 Modern applications need a simple object storage interface with asynchronous
1204 communication capability. The Ceph Storage Cluster provides a simple object
1205 storage interface with asynchronous communication capability. The interface
1206 provides direct, parallel access to objects throughout the cluster.
1207
1208
1209 - Pool Operations
1210 - Snapshots and Copy-on-write Cloning
1211 - Read/Write Objects
1212   - Create or Remove
1213   - Entire Object or Byte Range
1214   - Append or Truncate
1215 - Create/Set/Get/Remove XATTRs
1216 - Create/Set/Get/Remove Key/Value Pairs
1217 - Compound operations and dual-ack semantics
1218 - Object Classes
1219
1220
1221 .. index:: architecture; watch/notify
1222
1223 Object Watch/Notify
1224 -------------------
1225
1226 A client can register a persistent interest with an object and keep a session to
1227 the primary OSD open. The client can send a notification message and a payload to
1228 all watchers and receive notification when the watchers receive the
1229 notification. This enables a client to use any object as a
1230 synchronization/communication channel.
1231
1232
1233 .. ditaa::
1234
1235            +----------+     +----------+     +----------+     +---------------+
1236            | Client 1 |     | Client 2 |     | Client 3 |     | OSD:Object ID |
1237            +----------+     +----------+     +----------+     +---------------+
1238                  |                |                |                  |
1239                  |                |                |                  |
1240                  |                |  Watch Object  |                  |
1241                  |--------------------------------------------------->|
1242                  |                |                |                  |
1243                  |<---------------------------------------------------|
1244                  |                |   Ack/Commit   |                  |
1245                  |                |                |                  |
1246                  |                |  Watch Object  |                  |
1247                  |                |---------------------------------->|
1248                  |                |                |                  |
1249                  |                |<----------------------------------|
1250                  |                |   Ack/Commit   |                  |
1251                  |                |                |   Watch Object   |
1252                  |                |                |----------------->|
1253                  |                |                |                  |
1254                  |                |                |<-----------------|
1255                  |                |                |    Ack/Commit    |
1256                  |                |     Notify     |                  |
1257                  |--------------------------------------------------->|
1258                  |                |                |                  |
1259                  |<---------------------------------------------------|
1260                  |                |     Notify     |                  |
1261                  |                |                |                  |
1262                  |                |<----------------------------------|
1263                  |                |     Notify     |                  |
1264                  |                |                |<-----------------|
1265                  |                |                |      Notify      |
1266                  |                |       Ack      |                  |
1267                  |----------------+---------------------------------->|
1268                  |                |                |                  |
1269                  |                |       Ack      |                  |
1270                  |                +---------------------------------->|
1271                  |                |                |                  |
1272                  |                |                |        Ack       |
1273                  |                |                |----------------->|
1274                  |                |                |                  |
1275                  |<---------------+----------------+------------------|
1276                  |                     Complete
1277
1278 .. index:: architecture; Striping
1279
1280 Data Striping
1281 -------------
1282
1283 Storage devices have throughput limitations, which impact performance and
1284 scalability. So storage systems often support `striping`_--storing sequential
1285 pieces of information across multiple storage devices--to increase throughput
1286 and performance. The most common form of data striping comes from `RAID`_.
1287 The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
1288 volume'. Ceph's striping offers the throughput of RAID 0 striping, the
1289 reliability of n-way RAID mirroring and faster recovery.
1290
1291 Ceph provides three types of clients: Ceph Block Device, Ceph File System, and
1292 Ceph Object Storage. A Ceph Client converts its data from the representation
1293 format it provides to its users (a block device image, RESTful objects, CephFS
1294 filesystem directories) into objects for storage in the Ceph Storage Cluster.
1295
1296 .. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped.
1297    Ceph Object Storage, Ceph Block Device, and the Ceph File System stripe their
1298    data over multiple Ceph Storage Cluster objects. Ceph Clients that write
1299    directly to the Ceph Storage Cluster via ``librados`` must perform the
1300    striping (and parallel I/O) for themselves to obtain these benefits.
1301
1302 The simplest Ceph striping format involves a stripe count of 1 object. Ceph
1303 Clients write stripe units to a Ceph Storage Cluster object until the object is
1304 at its maximum capacity, and then create another object for additional stripes
1305 of data. The simplest form of striping may be sufficient for small block device
1306 images, S3 or Swift objects and CephFS files. However, this simple form doesn't
1307 take maximum advantage of Ceph's ability to distribute data across placement
1308 groups, and consequently doesn't improve performance very much. The following
1309 diagram depicts the simplest form of striping:
1310
1311 .. ditaa::
1312
1313                         +---------------+
1314                         |  Client Data  |
1315                         |     Format    |
1316                         | cCCC          |
1317                         +---------------+
1318                                 |
1319                        +--------+-------+
1320                        |                |
1321                        v                v
1322                  /-----------\    /-----------\
1323                  | Begin cCCC|    | Begin cCCC|
1324                  | Object  0 |    | Object  1 |
1325                  +-----------+    +-----------+
1326                  |  stripe   |    |  stripe   |
1327                  |  unit 1   |    |  unit 5   |
1328                  +-----------+    +-----------+
1329                  |  stripe   |    |  stripe   |
1330                  |  unit 2   |    |  unit 6   |
1331                  +-----------+    +-----------+
1332                  |  stripe   |    |  stripe   |
1333                  |  unit 3   |    |  unit 7   |
1334                  +-----------+    +-----------+
1335                  |  stripe   |    |  stripe   |
1336                  |  unit 4   |    |  unit 8   |
1337                  +-----------+    +-----------+
1338                  | End cCCC  |    | End cCCC  |
1339                  | Object 0  |    | Object 1  |
1340                  \-----------/    \-----------/
1341
1342
1343 If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
1344 or large CephFS directories, you may see considerable read/write performance
1345 improvements by striping client data over multiple objects within an object set.
1346 Significant write performance occurs when the client writes the stripe units to
1347 their corresponding objects in parallel. Since objects get mapped to different
1348 placement groups and further mapped to different OSDs, each write occurs in
1349 parallel at the maximum write speed. A write to a single drive would be limited
1350 by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
1351 100MB/s).  By spreading that write over multiple objects (which map to different
1352 placement groups and OSDs) Ceph can reduce the number of seeks per drive and
1353 combine the throughput of multiple drives to achieve much faster write (or read)
1354 speeds.
1355
1356 .. note:: Striping is independent of object replicas. Since CRUSH
1357    replicates objects across OSDs, stripes get replicated automatically.
1358
1359 In the following diagram, client data gets striped across an object set
1360 (``object set 1`` in the following diagram) consisting of 4 objects, where the
1361 first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
1362 unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
1363 client determines if the object set is full. If the object set is not full, the
1364 client begins writing a stripe to the first object again (``object 0`` in the
1365 following diagram). If the object set is full, the client creates a new object
1366 set (``object set 2`` in the following diagram), and begins writing to the first
1367 stripe (``stripe unit 16``) in the first object in the new object set (``object
1368 4`` in the diagram below).
1369
1370 .. ditaa::
1371
1372                           +---------------+
1373                           |  Client Data  |
1374                           |     Format    |
1375                           | cCCC          |
1376                           +---------------+
1377                                   |
1378        +-----------------+--------+--------+-----------------+
1379        |                 |                 |                 |     +--\
1380        v                 v                 v                 v        |
1381  /-----------\     /-----------\     /-----------\     /-----------\  |
1382  | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
1383  | Object 0  |     | Object  1 |     | Object  2 |     | Object  3 |  |
1384  +-----------+     +-----------+     +-----------+     +-----------+  |
1385  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1386  |  unit 0   |     |  unit 1   |     |  unit 2   |     |  unit 3   |  |
1387  +-----------+     +-----------+     +-----------+     +-----------+  |
1388  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
1389  |  unit 4   |     |  unit 5   |     |  unit 6   |     |  unit 7   |    | Object
1390  +-----------+     +-----------+     +-----------+     +-----------+    +- Set
1391  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   1
1392  |  unit 8   |     |  unit 9   |     |  unit 10  |     |  unit 11  |  +-/
1393  +-----------+     +-----------+     +-----------+     +-----------+  |
1394  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1395  |  unit 12  |     |  unit 13  |     |  unit 14  |     |  unit 15  |  |
1396  +-----------+     +-----------+     +-----------+     +-----------+  |
1397  | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
1398  | Object 0  |     | Object 1  |     | Object 2  |     | Object 3  |  |
1399  \-----------/     \-----------/     \-----------/     \-----------/  |
1400                                                                       |
1401                                                                    +--/
1402
1403                                                                    +--\
1404                                                                       |
1405  /-----------\     /-----------\     /-----------\     /-----------\  |
1406  | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
1407  | Object  4 |     | Object  5 |     | Object  6 |     | Object  7 |  |
1408  +-----------+     +-----------+     +-----------+     +-----------+  |
1409  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1410  |  unit 16  |     |  unit 17  |     |  unit 18  |     |  unit 19  |  |
1411  +-----------+     +-----------+     +-----------+     +-----------+  |
1412  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
1413  |  unit 20  |     |  unit 21  |     |  unit 22  |     |  unit 23  |    | Object
1414  +-----------+     +-----------+     +-----------+     +-----------+    +- Set
1415  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   2
1416  |  unit 24  |     |  unit 25  |     |  unit 26  |     |  unit 27  |  +-/
1417  +-----------+     +-----------+     +-----------+     +-----------+  |
1418  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1419  |  unit 28  |     |  unit 29  |     |  unit 30  |     |  unit 31  |  |
1420  +-----------+     +-----------+     +-----------+     +-----------+  |
1421  | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
1422  | Object 4  |     | Object 5  |     | Object 6  |     | Object 7  |  |
1423  \-----------/     \-----------/     \-----------/     \-----------/  |
1424                                                                       |
1425                                                                    +--/
1426
1427 Three important variables determine how Ceph stripes data:
1428
1429 - **Object Size:** Objects in the Ceph Storage Cluster have a maximum
1430   configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
1431   enough to accommodate many stripe units, and should be a multiple of
1432   the stripe unit.
1433
1434 - **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
1435   The Ceph Client divides the data it will write to objects into equally
1436   sized stripe units, except for the last stripe unit. A stripe width,
1437   should be a fraction of the Object Size so that an object may contain
1438   many stripe units.
1439
1440 - **Stripe Count:** The Ceph Client writes a sequence of stripe units
1441   over a series of objects determined by the stripe count. The series
1442   of objects is called an object set. After the Ceph Client writes to
1443   the last object in the object set, it returns to the first object in
1444   the object set.
1445
1446 .. important:: Test the performance of your striping configuration before
1447    putting your cluster into production. You CANNOT change these striping
1448    parameters after you stripe the data and write it to objects.
1449
1450 Once the Ceph Client has striped data to stripe units and mapped the stripe
1451 units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
1452 and the placement groups to Ceph OSD Daemons before the objects are stored as
1453 files on a storage drive.
1454
1455 .. note:: Since a client writes to a single pool, all data striped into objects
1456    get mapped to placement groups in the same pool. So they use the same CRUSH
1457    map and the same access controls.
1458
1459
1460 .. index:: architecture; Ceph Clients
1461
1462 Ceph Clients
1463 ============
1464
1465 Ceph Clients include a number of service interfaces. These include:
1466
1467 - **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
1468   provides resizable, thin-provisioned block devices with snapshotting and
1469   cloning. Ceph stripes a block device across the cluster for high
1470   performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
1471   that uses ``librbd`` directly--avoiding the kernel object overhead for
1472   virtualized systems.
1473
1474 - **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
1475   provides RESTful APIs with interfaces that are compatible with Amazon S3
1476   and OpenStack Swift.
1477
1478 - **Filesystem**: The :term:`Ceph File System` (CephFS) service provides
1479   a POSIX compliant filesystem usable with ``mount`` or as
1480   a filesystem in user space (FUSE).
1481
1482 Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
1483 and high availability. The following diagram depicts the high-level
1484 architecture.
1485
1486 .. ditaa::
1487
1488             +--------------+  +----------------+  +-------------+
1489             | Block Device |  | Object Storage |  |   CephFS    |
1490             +--------------+  +----------------+  +-------------+
1491
1492             +--------------+  +----------------+  +-------------+
1493             |    librbd    |  |     librgw     |  |  libcephfs  |
1494             +--------------+  +----------------+  +-------------+
1495
1496             +---------------------------------------------------+
1497             |      Ceph Storage Cluster Protocol (librados)     |
1498             +---------------------------------------------------+
1499
1500             +---------------+ +---------------+ +---------------+
1501             |      OSDs     | |      MDSs     | |    Monitors   |
1502             +---------------+ +---------------+ +---------------+
1503
1504
1505 .. index:: architecture; Ceph Object Storage
1506
1507 Ceph Object Storage
1508 -------------------
1509
1510 The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
1511 a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
1512 Storage Cluster with its own data formats, and maintains its own user database,
1513 authentication, and access control. The RADOS Gateway uses a unified namespace,
1514 which means you can use either the OpenStack Swift-compatible API or the Amazon
1515 S3-compatible API. For example, you can write data using the S3-compatible API
1516 with one application and then read data using the Swift-compatible API with
1517 another application.
1518
1519 .. topic:: S3/Swift Objects and Store Cluster Objects Compared
1520
1521    Ceph's Object Storage uses the term *object* to describe the data it stores.
1522    S3 and Swift objects are not the same as the objects that Ceph writes to the
1523    Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
1524    Cluster objects. The S3 and Swift objects do not necessarily
1525    correspond in a 1:1 manner with an object stored in the storage cluster. It
1526    is possible for an S3 or Swift object to map to multiple Ceph objects.
1527
1528 See `Ceph Object Storage`_ for details.
1529
1530
1531 .. index:: Ceph Block Device; block device; RBD; Rados Block Device
1532
1533 Ceph Block Device
1534 -----------------
1535
1536 A Ceph Block Device stripes a block device image over multiple objects in the
1537 Ceph Storage Cluster, where each object gets mapped to a placement group and
1538 distributed, and the placement groups are spread across separate ``ceph-osd``
1539 daemons throughout the cluster.
1540
1541 .. important:: Striping allows RBD block devices to perform better than a single
1542    server could!
1543
1544 Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
1545 virtualization and cloud computing. In virtual machine scenarios, people
1546 typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
1547 QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
1548 service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
1549 with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
1550 ``libvirt`` to support OpenStack and CloudStack among other solutions.
1551
1552 While we do not provide ``librbd`` support with other hypervisors at this time,
1553 you may also use Ceph Block Device kernel objects to provide a block device to a
1554 client. Other virtualization technologies such as Xen can access the Ceph Block
1555 Device kernel object(s). This is done with the  command-line tool ``rbd``.
1556
1557
1558 .. index:: CephFS; Ceph File System; libcephfs; MDS; metadata server; ceph-mds
1559
1560 .. _arch-cephfs:
1561
1562 Ceph File System
1563 ----------------
1564
1565 The Ceph File System (CephFS) provides a POSIX-compliant filesystem as a
1566 service that is layered on top of the object-based Ceph Storage Cluster.
1567 CephFS files get mapped to objects that Ceph stores in the Ceph Storage
1568 Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
1569 a Filesystem in User Space (FUSE).
1570
1571 .. ditaa::
1572
1573             +-----------------------+  +------------------------+
1574             | CephFS Kernel Object  |  |      CephFS FUSE       |
1575             +-----------------------+  +------------------------+
1576
1577             +---------------------------------------------------+
1578             |            CephFS Library (libcephfs)             |
1579             +---------------------------------------------------+
1580
1581             +---------------------------------------------------+
1582             |      Ceph Storage Cluster Protocol (librados)     |
1583             +---------------------------------------------------+
1584
1585             +---------------+ +---------------+ +---------------+
1586             |      OSDs     | |      MDSs     | |    Monitors   |
1587             +---------------+ +---------------+ +---------------+
1588
1589
1590 The Ceph File System service includes the Ceph Metadata Server (MDS) deployed
1591 with the Ceph Storage cluster. The purpose of the MDS is to store all the
1592 filesystem metadata (directories, file ownership, access modes, etc) in
1593 high-availability Ceph Metadata Servers where the metadata resides in memory.
1594 The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
1595 operations like listing a directory or changing a directory (``ls``, ``cd``)
1596 would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
1597 the data means that the Ceph File System can provide high performance services
1598 without taxing the Ceph Storage Cluster.
1599
1600 CephFS separates the metadata from the data, storing the metadata in the MDS,
1601 and storing the file data in one or more objects in the Ceph Storage Cluster.
1602 The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
1603 single process, or it can be distributed out to multiple physical machines,
1604 either for high availability or for scalability.
1605
1606 - **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
1607   ready to take over the duties of any failed ``ceph-mds`` that was
1608   `active`. This is easy because all the data, including the journal, is
1609   stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
1610
1611 - **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
1612   will split the directory tree into subtrees (and shards of a single
1613   busy directory), effectively balancing the load amongst all `active`
1614   servers.
1615
1616 Combinations of `standby` and `active` etc are possible, for example
1617 running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
1618 instance for high availability.
1619
1620
1621
1622 .. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf
1623 .. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
1624 .. _Monitor Config Reference: ../rados/configuration/mon-config-ref
1625 .. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
1626 .. _Heartbeats: ../rados/configuration/mon-osd-interaction
1627 .. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
1628 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
1629 .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1630 .. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
1631 .. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
1632 .. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
1633 .. _Hardware Recommendations: ../start/hardware-recommendations
1634 .. _Network Config Reference: ../rados/configuration/network-config-ref
1635 .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1636 .. _striping: https://en.wikipedia.org/wiki/Data_striping
1637 .. _RAID: https://en.wikipedia.org/wiki/RAID
1638 .. _RAID 0: https://en.wikipedia.org/wiki/RAID_0#RAID_0
1639 .. _Ceph Object Storage: ../radosgw/
1640 .. _RESTful: https://en.wikipedia.org/wiki/RESTful
1641 .. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
1642 .. _Cache Tiering: ../rados/operations/cache-tiering
1643 .. _Set Pool Values: ../rados/operations/pools#set-pool-values
1644 .. _Kerberos: https://en.wikipedia.org/wiki/Kerberos_(protocol)
1645 .. _Cephx Config Guide: ../rados/configuration/auth-config-ref
1646 .. _User Management: ../rados/operations/user-management