pveceph.adoc

   1 [[chapter_pveceph]]
   2 ifdef::manvolnum[]
   3 pveceph(1)
   4 ==========
   5 :pve-toplevel:
   6
   7 NAME
   8 ----
   9
  10 pveceph - Manage Ceph Services on Proxmox VE Nodes
  11
  12 SYNOPSIS
  13 --------
  14
  15 include::pveceph.1-synopsis.adoc[]
  16
  17 DESCRIPTION
  18 -----------
  19 endif::manvolnum[]
  20 ifndef::manvolnum[]
  21 Manage Ceph Services on Proxmox VE Nodes
  22 ========================================
  23 :pve-toplevel:
  24 endif::manvolnum[]
  25
  26 [thumbnail="gui-ceph-status.png"]
  27
  28 {pve} unifies your compute and storage systems, i.e. you can use the same
  29 physical nodes within a cluster for both computing (processing VMs and
  30 containers) and replicated storage. The traditional silos of compute and
  31 storage resources can be wrapped up into a single hyper-converged appliance.
  32 Separate storage networks (SANs) and connections via network attached storages
  33 (NAS) disappear. With the integration of Ceph, an open source software-defined
  34 storage platform, {pve} has the ability to run and manage Ceph storage directly
  35 on the hypervisor nodes.
  36
  37 Ceph is a distributed object store and file system designed to provide
  38 excellent performance, reliability and scalability.
  39
  40 .Some of the advantages of Ceph are:
  41 - Easy setup and management with CLI and GUI support on Proxmox VE
  42 - Thin provisioning
  43 - Snapshots support
  44 - Self healing
  45 - No single point of failure
  46 - Scalable to the exabyte level
  47 - Setup pools with different performance and redundancy characteristics
  48 - Data is replicated, making it fault tolerant
  49 - Runs on economical commodity hardware
  50 - No need for hardware RAID controllers
  51 - Easy management
  52 - Open source
  53
  54 For small to mid sized deployments, it is possible to install a Ceph server for
  55 RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
  56 xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
  57 hardware has plenty of CPU power and RAM, so running storage services
  58 and VMs on the same node is possible.
  59
  60 To simplify management, we provide 'pveceph' - a tool to install and
  61 manage {ceph} services on {pve} nodes.
  62
  63 .Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
  64 - Ceph Monitor (ceph-mon)
  65 - Ceph Manager (ceph-mgr)
  66 - Ceph OSD (ceph-osd; Object Storage Daemon)
  67
  68 TIP: We recommend to get familiar with the Ceph vocabulary.
  69 footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
  70
  71
  72 Precondition
  73 ------------
  74
  75 To build a Proxmox Ceph Cluster there should be at least three (preferably)
  76 identical servers for the setup.
  77
  78 A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
  79 setup is also an option if there are no 10Gb switches available, see our wiki
  80 article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
  81
  82 Check also the recommendations from
  83 http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
  84
  85 .Avoid RAID
  86 While RAID controller are build for storage virtualisation, to combine
  87 independent disks to form one or more logical units. Their caching methods,
  88 algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
  89 targeted towards aforementioned logical units and not to Ceph.
  90
  91 WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
  92
  93
  94 Installation of Ceph Packages
  95 -----------------------------
  96
  97 On each node run the installation script as follows:
  98
  99 [source,bash]
 100 ----
 101 pveceph install
 102 ----
 103
 104 This sets up an `apt` package repository in
 105 `/etc/apt/sources.list.d/ceph.list` and installs the required software.
 106
 107
 108 Creating initial Ceph configuration
 109 -----------------------------------
 110
 111 [thumbnail="gui-ceph-config.png"]
 112
 113 After installation of packages, you need to create an initial Ceph
 114 configuration on just one node, based on your network (`10.10.10.0/24`
 115 in the following example) dedicated for Ceph:
 116
 117 [source,bash]
 118 ----
 119 pveceph init --network 10.10.10.0/24
 120 ----
 121
 122 This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
 123 automatically distributed to all {pve} nodes by using
 124 xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
 125 from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
 126 Ceph commands without the need to specify a configuration file.
 127
 128
 129 [[pve_ceph_monitors]]
 130 Creating Ceph Monitors
 131 ----------------------
 132
 133 [thumbnail="gui-ceph-monitor.png"]
 134
 135 The Ceph Monitor (MON)
 136 footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
 137 maintains a master copy of the cluster map. For high availability you need to
 138 have at least 3 monitors.
 139
 140 On each node where you want to place a monitor (three monitors are recommended),
 141 create it by using the 'Ceph -> Monitor' tab in the GUI or run.
 142
 143
 144 [source,bash]
 145 ----
 146 pveceph createmon
 147 ----
 148
 149 This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
 150 do not want to install a manager, specify the '-exclude-manager' option.
 151
 152
 153 [[pve_ceph_manager]]
 154 Creating Ceph Manager
 155 ----------------------
 156
 157 The Manager daemon runs alongside the monitors, providing an interface for
 158 monitoring the cluster. Since the Ceph luminous release the
 159 ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
 160 is required. During monitor installation the ceph manager will be installed as
 161 well.
 162
 163 NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
 164 high availability install more then one manager.
 165
 166 [source,bash]
 167 ----
 168 pveceph createmgr
 169 ----
 170
 171
 172 [[pve_ceph_osds]]
 173 Creating Ceph OSDs
 174 ------------------
 175
 176 [thumbnail="gui-ceph-osd-status.png"]
 177
 178 via GUI or via CLI as follows:
 179
 180 [source,bash]
 181 ----
 182 pveceph createosd /dev/sd[X]
 183 ----
 184
 185 TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
 186 among your, at least three nodes (4 OSDs on each node).
 187
 188 If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
 189 sector and any OSD leftover the following commands should be sufficient.
 190
 191 [source,bash]
 192 ----
 193 dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
 194 ceph-disk zap /dev/sd[X]
 195 ----
 196
 197 WARNING: The above commands will destroy data on the disk!
 198
 199 Ceph Bluestore
 200 ~~~~~~~~~~~~~~
 201
 202 Starting with the Ceph Kraken release, a new Ceph OSD storage type was
 203 introduced, the so called Bluestore
 204 footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
 205 This is the default when creating OSDs in Ceph luminous.
 206
 207 [source,bash]
 208 ----
 209 pveceph createosd /dev/sd[X]
 210 ----
 211
 212 NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
 213 to have a GPT footnoteref:[GPT, GPT partition table
 214 https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
 215 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
 216 disk as DB/WAL.
 217
 218 If you want to use a separate DB/WAL device for your OSDs, you can specify it
 219 through the '-journal_dev' option. The WAL is placed with the DB, if not
 220 specified separately.
 221
 222 [source,bash]
 223 ----
 224 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
 225 ----
 226
 227 NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
 228 internal journal or write-ahead log. It is recommended to use a fast SSDs or
 229 NVRAM for better performance.
 230
 231
 232 Ceph Filestore
 233 ~~~~~~~~~~~~~
 234 Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
 235 still be used and might give better performance in small setups, when backed by
 236 a NVMe SSD or similar.
 237
 238 [source,bash]
 239 ----
 240 pveceph createosd /dev/sd[X] -bluestore 0
 241 ----
 242
 243 NOTE: In order to select a disk in the GUI, the disk needs to have a
 244 GPT footnoteref:[GPT] partition table. You can
 245 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
 246 disk as journal. Currently the journal size is fixed to 5 GB.
 247
 248 If you want to use a dedicated SSD journal disk:
 249
 250 [source,bash]
 251 ----
 252 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
 253 ----
 254
 255 Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
 256 journal disk.
 257
 258 [source,bash]
 259 ----
 260 pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
 261 ----
 262
 263 This partitions the disk (data and journal partition), creates
 264 filesystems and starts the OSD, afterwards it is running and fully
 265 functional.
 266
 267 NOTE: This command refuses to initialize disk when it detects existing data. So
 268 if you want to overwrite a disk you should remove existing data first. You can
 269 do that using: 'ceph-disk zap /dev/sd[X]'
 270
 271 You can create OSDs containing both journal and data partitions or you
 272 can place the journal on a dedicated SSD. Using a SSD journal disk is
 273 highly recommended to achieve good performance.
 274
 275
 276 [[pve_ceph_pools]]
 277 Creating Ceph Pools
 278 -------------------
 279
 280 [thumbnail="gui-ceph-pools.png"]
 281
 282 A pool is a logical group for storing objects. It holds **P**lacement
 283 **G**roups (PG), a collection of objects.
 284
 285 When no options are given, we set a
 286 default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
 287 for serving objects in a degraded state.
 288
 289 NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
 290 "HEALTH_WARNING" if you have too few or too many PGs in your cluster.
 291
 292 It is advised to calculate the PG number depending on your setup, you can find
 293 the formula and the PG calculator footnote:[PG calculator
 294 http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
 295 never be decreased.
 296
 297
 298 You can create pools through command line or on the GUI on each PVE host under
 299 **Ceph -> Pools**.
 300
 301 [source,bash]
 302 ----
 303 pveceph createpool <name>
 304 ----
 305
 306 If you would like to automatically get also a storage definition for your pool,
 307 active the checkbox "Add storages" on the GUI or use the command line option
 308 '--add_storages' on pool creation.
 309
 310 Further information on Ceph pool handling can be found in the Ceph pool
 311 operation footnote:[Ceph pool operation
 312 http://docs.ceph.com/docs/luminous/rados/operations/pools/]
 313 manual.
 314
 315 Ceph CRUSH & device classes
 316 ---------------------------
 317 The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
 318 **U**nder **S**calable **H**ashing
 319 (CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
 320
 321 CRUSH calculates where to store to and retrieve data from, this has the
 322 advantage that no central index service is needed. CRUSH works with a map of
 323 OSDs, buckets (device locations) and rulesets (data replication) for pools.
 324
 325 NOTE: Further information can be found in the Ceph documentation, under the
 326 section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
 327
 328 This map can be altered to reflect different replication hierarchies. The object
 329 replicas can be separated (eg. failure domains), while maintaining the desired
 330 distribution.
 331
 332 A common use case is to use different classes of disks for different Ceph pools.
 333 For this reason, Ceph introduced the device classes with luminous, to
 334 accommodate the need for easy ruleset generation.
 335
 336 The device classes can be seen in the 'ceph osd tree' output. These classes
 337 represent their own root bucket, which can be seen with the below command.
 338
 339 [source, bash]
 340 ----
 341 ceph osd crush tree --show-shadow
 342 ----
 343
 344 Example output form the above command:
 345
 346 [source, bash]
 347 ----
 348 ID  CLASS WEIGHT  TYPE NAME
 349 -16  nvme 2.18307 root default~nvme
 350 -13  nvme 0.72769     host sumi1~nvme
 351  12  nvme 0.72769         osd.12
 352 -14  nvme 0.72769     host sumi2~nvme
 353  13  nvme 0.72769         osd.13
 354 -15  nvme 0.72769     host sumi3~nvme
 355  14  nvme 0.72769         osd.14
 356  -1       7.70544 root default
 357  -3       2.56848     host sumi1
 358  12  nvme 0.72769         osd.12
 359  -5       2.56848     host sumi2
 360  13  nvme 0.72769         osd.13
 361  -7       2.56848     host sumi3
 362  14  nvme 0.72769         osd.14
 363 ----
 364
 365 To let a pool distribute its objects only on a specific device class, you need
 366 to create a ruleset with the specific class first.
 367
 368 [source, bash]
 369 ----
 370 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
 371 ----
 372
 373 [frame="none",grid="none", align="left", cols="30%,70%"]
 374 |===
 375 |<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
 376 |<root>|which crush root it should belong to (default ceph root "default")
 377 |<failure-domain>|at which failure-domain the objects should be distributed (usually host)
 378 |<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
 379 |===
 380
 381 Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
 382
 383 [source, bash]
 384 ----
 385 ceph osd pool set <pool-name> crush_rule <rule-name>
 386 ----
 387
 388 TIP: If the pool already contains objects, all of these have to be moved
 389 accordingly. Depending on your setup this may introduce a big performance hit on
 390 your cluster. As an alternative, you can create a new pool and move disks
 391 separately.
 392
 393
 394 Ceph Client
 395 -----------
 396
 397 [thumbnail="gui-ceph-log.png"]
 398
 399 You can then configure {pve} to use such pools to store VM or
 400 Container images. Simply use the GUI too add a new `RBD` storage (see
 401 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
 402
 403 You also need to copy the keyring to a predefined location for a external Ceph
 404 cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
 405 done automatically.
 406
 407 NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
 408 the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
 409 `my-ceph-storage` in the following example:
 410
 411 [source,bash]
 412 ----
 413 mkdir /etc/pve/priv/ceph
 414 cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
 415 ----
 416
 417
 418 ifdef::manvolnum[]
 419 include::pve-copyright.adoc[]
 420 endif::manvolnum[]