pveceph.adoc

   1 [[chapter_pveceph]]
   2 ifdef::manvolnum[]
   3 pveceph(1)
   4 ==========
   5 :pve-toplevel:
   6
   7 NAME
   8 ----
   9
  10 pveceph - Manage Ceph Services on Proxmox VE Nodes
  11
  12 SYNOPSIS
  13 --------
  14
  15 include::pveceph.1-synopsis.adoc[]
  16
  17 DESCRIPTION
  18 -----------
  19 endif::manvolnum[]
  20 ifndef::manvolnum[]
  21 Manage Ceph Services on Proxmox VE Nodes
  22 ========================================
  23 :pve-toplevel:
  24 endif::manvolnum[]
  25
  26 [thumbnail="gui-ceph-status.png"]
  27
  28 {pve} unifies your compute and storage systems, i.e. you can use the same
  29 physical nodes within a cluster for both computing (processing VMs and
  30 containers) and replicated storage. The traditional silos of compute and
  31 storage resources can be wrapped up into a single hyper-converged appliance.
  32 Separate storage networks (SANs) and connections via network attached storages
  33 (NAS) disappear. With the integration of Ceph, an open source software-defined
  34 storage platform, {pve} has the ability to run and manage Ceph storage directly
  35 on the hypervisor nodes.
  36
  37 Ceph is a distributed object store and file system designed to provide
  38 excellent performance, reliability and scalability.
  39
  40 .Some advantages of Ceph on {pve} are:
  41 - Easy setup and management with CLI and GUI support
  42 - Thin provisioning
  43 - Snapshots support
  44 - Self healing
  45 - Scalable to the exabyte level
  46 - Setup pools with different performance and redundancy characteristics
  47 - Data is replicated, making it fault tolerant
  48 - Runs on economical commodity hardware
  49 - No need for hardware RAID controllers
  50 - Open source
  51
  52 For small to mid sized deployments, it is possible to install a Ceph server for
  53 RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
  54 xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
  55 hardware has plenty of CPU power and RAM, so running storage services
  56 and VMs on the same node is possible.
  57
  58 To simplify management, we provide 'pveceph' - a tool to install and
  59 manage {ceph} services on {pve} nodes.
  60
  61 .Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
  62 - Ceph Monitor (ceph-mon)
  63 - Ceph Manager (ceph-mgr)
  64 - Ceph OSD (ceph-osd; Object Storage Daemon)
  65
  66 TIP: We recommend to get familiar with the Ceph vocabulary.
  67 footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
  68
  69
  70 Precondition
  71 ------------
  72
  73 To build a Proxmox Ceph Cluster there should be at least three (preferably)
  74 identical servers for the setup.
  75
  76 A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
  77 setup is also an option if there are no 10Gb switches available, see our wiki
  78 article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
  79
  80 Check also the recommendations from
  81 http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
  82
  83 .Avoid RAID
  84 While RAID controller are build for storage virtualisation, to combine
  85 independent disks to form one or more logical units. Their caching methods,
  86 algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
  87 targeted towards aforementioned logical units and not to Ceph.
  88
  89 WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
  90
  91
  92 Installation of Ceph Packages
  93 -----------------------------
  94
  95 On each node run the installation script as follows:
  96
  97 [source,bash]
  98 ----
  99 pveceph install
 100 ----
 101
 102 This sets up an `apt` package repository in
 103 `/etc/apt/sources.list.d/ceph.list` and installs the required software.
 104
 105
 106 Creating initial Ceph configuration
 107 -----------------------------------
 108
 109 [thumbnail="gui-ceph-config.png"]
 110
 111 After installation of packages, you need to create an initial Ceph
 112 configuration on just one node, based on your network (`10.10.10.0/24`
 113 in the following example) dedicated for Ceph:
 114
 115 [source,bash]
 116 ----
 117 pveceph init --network 10.10.10.0/24
 118 ----
 119
 120 This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
 121 automatically distributed to all {pve} nodes by using
 122 xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
 123 from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
 124 Ceph commands without the need to specify a configuration file.
 125
 126
 127 [[pve_ceph_monitors]]
 128 Creating Ceph Monitors
 129 ----------------------
 130
 131 [thumbnail="gui-ceph-monitor.png"]
 132
 133 The Ceph Monitor (MON)
 134 footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
 135 maintains a master copy of the cluster map. For high availability you need to
 136 have at least 3 monitors.
 137
 138 On each node where you want to place a monitor (three monitors are recommended),
 139 create it by using the 'Ceph -> Monitor' tab in the GUI or run.
 140
 141
 142 [source,bash]
 143 ----
 144 pveceph createmon
 145 ----
 146
 147 This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
 148 do not want to install a manager, specify the '-exclude-manager' option.
 149
 150
 151 [[pve_ceph_manager]]
 152 Creating Ceph Manager
 153 ----------------------
 154
 155 The Manager daemon runs alongside the monitors, providing an interface for
 156 monitoring the cluster. Since the Ceph luminous release the
 157 ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
 158 is required. During monitor installation the ceph manager will be installed as
 159 well.
 160
 161 NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
 162 high availability install more then one manager.
 163
 164 [source,bash]
 165 ----
 166 pveceph createmgr
 167 ----
 168
 169
 170 [[pve_ceph_osds]]
 171 Creating Ceph OSDs
 172 ------------------
 173
 174 [thumbnail="gui-ceph-osd-status.png"]
 175
 176 via GUI or via CLI as follows:
 177
 178 [source,bash]
 179 ----
 180 pveceph createosd /dev/sd[X]
 181 ----
 182
 183 TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
 184 among your, at least three nodes (4 OSDs on each node).
 185
 186 If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
 187 sector and any OSD leftover the following commands should be sufficient.
 188
 189 [source,bash]
 190 ----
 191 dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
 192 ceph-disk zap /dev/sd[X]
 193 ----
 194
 195 WARNING: The above commands will destroy data on the disk!
 196
 197 Ceph Bluestore
 198 ~~~~~~~~~~~~~~
 199
 200 Starting with the Ceph Kraken release, a new Ceph OSD storage type was
 201 introduced, the so called Bluestore
 202 footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
 203 This is the default when creating OSDs in Ceph luminous.
 204
 205 [source,bash]
 206 ----
 207 pveceph createosd /dev/sd[X]
 208 ----
 209
 210 NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
 211 to have a GPT footnoteref:[GPT, GPT partition table
 212 https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
 213 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
 214 disk as DB/WAL.
 215
 216 If you want to use a separate DB/WAL device for your OSDs, you can specify it
 217 through the '-journal_dev' option. The WAL is placed with the DB, if not
 218 specified separately.
 219
 220 [source,bash]
 221 ----
 222 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
 223 ----
 224
 225 NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
 226 internal journal or write-ahead log. It is recommended to use a fast SSDs or
 227 NVRAM for better performance.
 228
 229
 230 Ceph Filestore
 231 ~~~~~~~~~~~~~
 232 Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
 233 still be used and might give better performance in small setups, when backed by
 234 a NVMe SSD or similar.
 235
 236 [source,bash]
 237 ----
 238 pveceph createosd /dev/sd[X] -bluestore 0
 239 ----
 240
 241 NOTE: In order to select a disk in the GUI, the disk needs to have a
 242 GPT footnoteref:[GPT] partition table. You can
 243 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
 244 disk as journal. Currently the journal size is fixed to 5 GB.
 245
 246 If you want to use a dedicated SSD journal disk:
 247
 248 [source,bash]
 249 ----
 250 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
 251 ----
 252
 253 Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
 254 journal disk.
 255
 256 [source,bash]
 257 ----
 258 pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
 259 ----
 260
 261 This partitions the disk (data and journal partition), creates
 262 filesystems and starts the OSD, afterwards it is running and fully
 263 functional.
 264
 265 NOTE: This command refuses to initialize disk when it detects existing data. So
 266 if you want to overwrite a disk you should remove existing data first. You can
 267 do that using: 'ceph-disk zap /dev/sd[X]'
 268
 269 You can create OSDs containing both journal and data partitions or you
 270 can place the journal on a dedicated SSD. Using a SSD journal disk is
 271 highly recommended to achieve good performance.
 272
 273
 274 [[pve_ceph_pools]]
 275 Creating Ceph Pools
 276 -------------------
 277
 278 [thumbnail="gui-ceph-pools.png"]
 279
 280 A pool is a logical group for storing objects. It holds **P**lacement
 281 **G**roups (PG), a collection of objects.
 282
 283 When no options are given, we set a
 284 default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
 285 for serving objects in a degraded state.
 286
 287 NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
 288 "HEALTH_WARNING" if you have too few or too many PGs in your cluster.
 289
 290 It is advised to calculate the PG number depending on your setup, you can find
 291 the formula and the PG calculator footnote:[PG calculator
 292 http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
 293 never be decreased.
 294
 295
 296 You can create pools through command line or on the GUI on each PVE host under
 297 **Ceph -> Pools**.
 298
 299 [source,bash]
 300 ----
 301 pveceph createpool <name>
 302 ----
 303
 304 If you would like to automatically get also a storage definition for your pool,
 305 active the checkbox "Add storages" on the GUI or use the command line option
 306 '--add_storages' on pool creation.
 307
 308 Further information on Ceph pool handling can be found in the Ceph pool
 309 operation footnote:[Ceph pool operation
 310 http://docs.ceph.com/docs/luminous/rados/operations/pools/]
 311 manual.
 312
 313 Ceph CRUSH & device classes
 314 ---------------------------
 315 The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
 316 **U**nder **S**calable **H**ashing
 317 (CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
 318
 319 CRUSH calculates where to store to and retrieve data from, this has the
 320 advantage that no central index service is needed. CRUSH works with a map of
 321 OSDs, buckets (device locations) and rulesets (data replication) for pools.
 322
 323 NOTE: Further information can be found in the Ceph documentation, under the
 324 section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
 325
 326 This map can be altered to reflect different replication hierarchies. The object
 327 replicas can be separated (eg. failure domains), while maintaining the desired
 328 distribution.
 329
 330 A common use case is to use different classes of disks for different Ceph pools.
 331 For this reason, Ceph introduced the device classes with luminous, to
 332 accommodate the need for easy ruleset generation.
 333
 334 The device classes can be seen in the 'ceph osd tree' output. These classes
 335 represent their own root bucket, which can be seen with the below command.
 336
 337 [source, bash]
 338 ----
 339 ceph osd crush tree --show-shadow
 340 ----
 341
 342 Example output form the above command:
 343
 344 [source, bash]
 345 ----
 346 ID  CLASS WEIGHT  TYPE NAME
 347 -16  nvme 2.18307 root default~nvme
 348 -13  nvme 0.72769     host sumi1~nvme
 349  12  nvme 0.72769         osd.12
 350 -14  nvme 0.72769     host sumi2~nvme
 351  13  nvme 0.72769         osd.13
 352 -15  nvme 0.72769     host sumi3~nvme
 353  14  nvme 0.72769         osd.14
 354  -1       7.70544 root default
 355  -3       2.56848     host sumi1
 356  12  nvme 0.72769         osd.12
 357  -5       2.56848     host sumi2
 358  13  nvme 0.72769         osd.13
 359  -7       2.56848     host sumi3
 360  14  nvme 0.72769         osd.14
 361 ----
 362
 363 To let a pool distribute its objects only on a specific device class, you need
 364 to create a ruleset with the specific class first.
 365
 366 [source, bash]
 367 ----
 368 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
 369 ----
 370
 371 [frame="none",grid="none", align="left", cols="30%,70%"]
 372 |===
 373 |<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
 374 |<root>|which crush root it should belong to (default ceph root "default")
 375 |<failure-domain>|at which failure-domain the objects should be distributed (usually host)
 376 |<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
 377 |===
 378
 379 Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
 380
 381 [source, bash]
 382 ----
 383 ceph osd pool set <pool-name> crush_rule <rule-name>
 384 ----
 385
 386 TIP: If the pool already contains objects, all of these have to be moved
 387 accordingly. Depending on your setup this may introduce a big performance hit on
 388 your cluster. As an alternative, you can create a new pool and move disks
 389 separately.
 390
 391
 392 Ceph Client
 393 -----------
 394
 395 [thumbnail="gui-ceph-log.png"]
 396
 397 You can then configure {pve} to use such pools to store VM or
 398 Container images. Simply use the GUI too add a new `RBD` storage (see
 399 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
 400
 401 You also need to copy the keyring to a predefined location for a external Ceph
 402 cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
 403 done automatically.
 404
 405 NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
 406 the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
 407 `my-ceph-storage` in the following example:
 408
 409 [source,bash]
 410 ----
 411 mkdir /etc/pve/priv/ceph
 412 cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
 413 ----
 414
 415
 416 ifdef::manvolnum[]
 417 include::pve-copyright.adoc[]
 418 endif::manvolnum[]