pveceph.adoc

   1 [[chapter_pveceph]]
   2 ifdef::manvolnum[]
   3 pveceph(1)
   4 ==========
   5 :pve-toplevel:
   6
   7 NAME
   8 ----
   9
  10 pveceph - Manage Ceph Services on Proxmox VE Nodes
  11
  12 SYNOPSIS
  13 --------
  14
  15 include::pveceph.1-synopsis.adoc[]
  16
  17 DESCRIPTION
  18 -----------
  19 endif::manvolnum[]
  20 ifndef::manvolnum[]
  21 Manage Ceph Services on Proxmox VE Nodes
  22 ========================================
  23 :pve-toplevel:
  24 endif::manvolnum[]
  25
  26 [thumbnail="gui-ceph-status.png"]
  27
  28 {pve} unifies your compute and storage systems, i.e. you can use the same
  29 physical nodes within a cluster for both computing (processing VMs and
  30 containers) and replicated storage. The traditional silos of compute and
  31 storage resources can be wrapped up into a single hyper-converged appliance.
  32 Separate storage networks (SANs) and connections via network attached storages
  33 (NAS) disappear. With the integration of Ceph, an open source software-defined
  34 storage platform, {pve} has the ability to run and manage Ceph storage directly
  35 on the hypervisor nodes.
  36
  37 Ceph is a distributed object store and file system designed to provide
  38 excellent performance, reliability and scalability.
  39
  40 .Some advantages of Ceph on {pve} are:
  41 - Easy setup and management with CLI and GUI support
  42 - Thin provisioning
  43 - Snapshots support
  44 - Self healing
  45 - Scalable to the exabyte level
  46 - Setup pools with different performance and redundancy characteristics
  47 - Data is replicated, making it fault tolerant
  48 - Runs on economical commodity hardware
  49 - No need for hardware RAID controllers
  50 - Open source
  51
  52 For small to mid sized deployments, it is possible to install a Ceph server for
  53 RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
  54 xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
  55 hardware has plenty of CPU power and RAM, so running storage services
  56 and VMs on the same node is possible.
  57
  58 To simplify management, we provide 'pveceph' - a tool to install and
  59 manage {ceph} services on {pve} nodes.
  60
  61 .Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
  62 - Ceph Monitor (ceph-mon)
  63 - Ceph Manager (ceph-mgr)
  64 - Ceph OSD (ceph-osd; Object Storage Daemon)
  65
  66 TIP: We recommend to get familiar with the Ceph vocabulary.
  67 footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
  68
  69
  70 Precondition
  71 ------------
  72
  73 To build a Proxmox Ceph Cluster there should be at least three (preferably)
  74 identical servers for the setup.
  75
  76 A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
  77 setup is also an option if there are no 10Gb switches available, see our wiki
  78 article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
  79
  80 Check also the recommendations from
  81 http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
  82
  83 .Avoid RAID
  84 As Ceph handles data object redundancy and multiple parallel writes to disks
  85 (OSDs) on its own, using a RAID controller normally doesn’t improve
  86 performance or availability. On the contrary, Ceph is designed to handle whole
  87 disks on it's own, without any abstraction in between. RAID controller are not
  88 designed for the Ceph use case and may complicate things and sometimes even
  89 reduce performance, as their write and caching algorithms may interfere with
  90 the ones from Ceph.
  91
  92 WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
  93
  94
  95 Installation of Ceph Packages
  96 -----------------------------
  97
  98 On each node run the installation script as follows:
  99
 100 [source,bash]
 101 ----
 102 pveceph install
 103 ----
 104
 105 This sets up an `apt` package repository in
 106 `/etc/apt/sources.list.d/ceph.list` and installs the required software.
 107
 108
 109 Creating initial Ceph configuration
 110 -----------------------------------
 111
 112 [thumbnail="gui-ceph-config.png"]
 113
 114 After installation of packages, you need to create an initial Ceph
 115 configuration on just one node, based on your network (`10.10.10.0/24`
 116 in the following example) dedicated for Ceph:
 117
 118 [source,bash]
 119 ----
 120 pveceph init --network 10.10.10.0/24
 121 ----
 122
 123 This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
 124 automatically distributed to all {pve} nodes by using
 125 xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
 126 from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
 127 Ceph commands without the need to specify a configuration file.
 128
 129
 130 [[pve_ceph_monitors]]
 131 Creating Ceph Monitors
 132 ----------------------
 133
 134 [thumbnail="gui-ceph-monitor.png"]
 135
 136 The Ceph Monitor (MON)
 137 footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
 138 maintains a master copy of the cluster map. For high availability you need to
 139 have at least 3 monitors.
 140
 141 On each node where you want to place a monitor (three monitors are recommended),
 142 create it by using the 'Ceph -> Monitor' tab in the GUI or run.
 143
 144
 145 [source,bash]
 146 ----
 147 pveceph createmon
 148 ----
 149
 150 This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
 151 do not want to install a manager, specify the '-exclude-manager' option.
 152
 153
 154 [[pve_ceph_manager]]
 155 Creating Ceph Manager
 156 ----------------------
 157
 158 The Manager daemon runs alongside the monitors, providing an interface for
 159 monitoring the cluster. Since the Ceph luminous release the
 160 ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
 161 is required. During monitor installation the ceph manager will be installed as
 162 well.
 163
 164 NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
 165 high availability install more then one manager.
 166
 167 [source,bash]
 168 ----
 169 pveceph createmgr
 170 ----
 171
 172
 173 [[pve_ceph_osds]]
 174 Creating Ceph OSDs
 175 ------------------
 176
 177 [thumbnail="gui-ceph-osd-status.png"]
 178
 179 via GUI or via CLI as follows:
 180
 181 [source,bash]
 182 ----
 183 pveceph createosd /dev/sd[X]
 184 ----
 185
 186 TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
 187 among your, at least three nodes (4 OSDs on each node).
 188
 189 If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
 190 sector and any OSD leftover the following commands should be sufficient.
 191
 192 [source,bash]
 193 ----
 194 dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
 195 ceph-disk zap /dev/sd[X]
 196 ----
 197
 198 WARNING: The above commands will destroy data on the disk!
 199
 200 Ceph Bluestore
 201 ~~~~~~~~~~~~~~
 202
 203 Starting with the Ceph Kraken release, a new Ceph OSD storage type was
 204 introduced, the so called Bluestore
 205 footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
 206 This is the default when creating OSDs in Ceph luminous.
 207
 208 [source,bash]
 209 ----
 210 pveceph createosd /dev/sd[X]
 211 ----
 212
 213 NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
 214 to have a GPT footnoteref:[GPT, GPT partition table
 215 https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
 216 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
 217 disk as DB/WAL.
 218
 219 If you want to use a separate DB/WAL device for your OSDs, you can specify it
 220 through the '-journal_dev' option. The WAL is placed with the DB, if not
 221 specified separately.
 222
 223 [source,bash]
 224 ----
 225 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
 226 ----
 227
 228 NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
 229 internal journal or write-ahead log. It is recommended to use a fast SSDs or
 230 NVRAM for better performance.
 231
 232
 233 Ceph Filestore
 234 ~~~~~~~~~~~~~
 235 Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
 236 still be used and might give better performance in small setups, when backed by
 237 a NVMe SSD or similar.
 238
 239 [source,bash]
 240 ----
 241 pveceph createosd /dev/sd[X] -bluestore 0
 242 ----
 243
 244 NOTE: In order to select a disk in the GUI, the disk needs to have a
 245 GPT footnoteref:[GPT] partition table. You can
 246 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
 247 disk as journal. Currently the journal size is fixed to 5 GB.
 248
 249 If you want to use a dedicated SSD journal disk:
 250
 251 [source,bash]
 252 ----
 253 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
 254 ----
 255
 256 Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
 257 journal disk.
 258
 259 [source,bash]
 260 ----
 261 pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
 262 ----
 263
 264 This partitions the disk (data and journal partition), creates
 265 filesystems and starts the OSD, afterwards it is running and fully
 266 functional.
 267
 268 NOTE: This command refuses to initialize disk when it detects existing data. So
 269 if you want to overwrite a disk you should remove existing data first. You can
 270 do that using: 'ceph-disk zap /dev/sd[X]'
 271
 272 You can create OSDs containing both journal and data partitions or you
 273 can place the journal on a dedicated SSD. Using a SSD journal disk is
 274 highly recommended to achieve good performance.
 275
 276
 277 [[pve_ceph_pools]]
 278 Creating Ceph Pools
 279 -------------------
 280
 281 [thumbnail="gui-ceph-pools.png"]
 282
 283 A pool is a logical group for storing objects. It holds **P**lacement
 284 **G**roups (PG), a collection of objects.
 285
 286 When no options are given, we set a
 287 default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
 288 for serving objects in a degraded state.
 289
 290 NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
 291 "HEALTH_WARNING" if you have too few or too many PGs in your cluster.
 292
 293 It is advised to calculate the PG number depending on your setup, you can find
 294 the formula and the PG calculator footnote:[PG calculator
 295 http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
 296 never be decreased.
 297
 298
 299 You can create pools through command line or on the GUI on each PVE host under
 300 **Ceph -> Pools**.
 301
 302 [source,bash]
 303 ----
 304 pveceph createpool <name>
 305 ----
 306
 307 If you would like to automatically get also a storage definition for your pool,
 308 active the checkbox "Add storages" on the GUI or use the command line option
 309 '--add_storages' on pool creation.
 310
 311 Further information on Ceph pool handling can be found in the Ceph pool
 312 operation footnote:[Ceph pool operation
 313 http://docs.ceph.com/docs/luminous/rados/operations/pools/]
 314 manual.
 315
 316 Ceph CRUSH & device classes
 317 ---------------------------
 318 The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
 319 **U**nder **S**calable **H**ashing
 320 (CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
 321
 322 CRUSH calculates where to store to and retrieve data from, this has the
 323 advantage that no central index service is needed. CRUSH works with a map of
 324 OSDs, buckets (device locations) and rulesets (data replication) for pools.
 325
 326 NOTE: Further information can be found in the Ceph documentation, under the
 327 section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
 328
 329 This map can be altered to reflect different replication hierarchies. The object
 330 replicas can be separated (eg. failure domains), while maintaining the desired
 331 distribution.
 332
 333 A common use case is to use different classes of disks for different Ceph pools.
 334 For this reason, Ceph introduced the device classes with luminous, to
 335 accommodate the need for easy ruleset generation.
 336
 337 The device classes can be seen in the 'ceph osd tree' output. These classes
 338 represent their own root bucket, which can be seen with the below command.
 339
 340 [source, bash]
 341 ----
 342 ceph osd crush tree --show-shadow
 343 ----
 344
 345 Example output form the above command:
 346
 347 [source, bash]
 348 ----
 349 ID  CLASS WEIGHT  TYPE NAME
 350 -16  nvme 2.18307 root default~nvme
 351 -13  nvme 0.72769     host sumi1~nvme
 352  12  nvme 0.72769         osd.12
 353 -14  nvme 0.72769     host sumi2~nvme
 354  13  nvme 0.72769         osd.13
 355 -15  nvme 0.72769     host sumi3~nvme
 356  14  nvme 0.72769         osd.14
 357  -1       7.70544 root default
 358  -3       2.56848     host sumi1
 359  12  nvme 0.72769         osd.12
 360  -5       2.56848     host sumi2
 361  13  nvme 0.72769         osd.13
 362  -7       2.56848     host sumi3
 363  14  nvme 0.72769         osd.14
 364 ----
 365
 366 To let a pool distribute its objects only on a specific device class, you need
 367 to create a ruleset with the specific class first.
 368
 369 [source, bash]
 370 ----
 371 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
 372 ----
 373
 374 [frame="none",grid="none", align="left", cols="30%,70%"]
 375 |===
 376 |<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
 377 |<root>|which crush root it should belong to (default ceph root "default")
 378 |<failure-domain>|at which failure-domain the objects should be distributed (usually host)
 379 |<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
 380 |===
 381
 382 Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
 383
 384 [source, bash]
 385 ----
 386 ceph osd pool set <pool-name> crush_rule <rule-name>
 387 ----
 388
 389 TIP: If the pool already contains objects, all of these have to be moved
 390 accordingly. Depending on your setup this may introduce a big performance hit on
 391 your cluster. As an alternative, you can create a new pool and move disks
 392 separately.
 393
 394
 395 Ceph Client
 396 -----------
 397
 398 [thumbnail="gui-ceph-log.png"]
 399
 400 You can then configure {pve} to use such pools to store VM or
 401 Container images. Simply use the GUI too add a new `RBD` storage (see
 402 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
 403
 404 You also need to copy the keyring to a predefined location for a external Ceph
 405 cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
 406 done automatically.
 407
 408 NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
 409 the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
 410 `my-ceph-storage` in the following example:
 411
 412 [source,bash]
 413 ----
 414 mkdir /etc/pve/priv/ceph
 415 cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
 416 ----
 417
 418
 419 ifdef::manvolnum[]
 420 include::pve-copyright.adoc[]
 421 endif::manvolnum[]