ceph/doc/cephadm/operations.rst

   1 ==================
   2 Cephadm Operations
   3 ==================
   4
   5 .. _watching_cephadm_logs:
   6
   7 Watching cephadm log messages
   8 =============================
   9
  10 Cephadm writes logs to the ``cephadm`` cluster log channel. You can
  11 monitor Ceph's activity in real time by reading the logs as they fill
  12 up. Run the following command to see the logs in real time:
  13
  14 .. prompt:: bash #
  15
  16   ceph -W cephadm
  17
  18 By default, this command shows info-level events and above.  To see
  19 debug-level messages as well as info-level events, run the following
  20 commands:
  21
  22 .. prompt:: bash #
  23
  24   ceph config set mgr mgr/cephadm/log_to_cluster_level debug
  25   ceph -W cephadm --watch-debug
  26
  27 .. warning::
  28
  29   The debug messages are very verbose!
  30
  31 You can see recent events by running the following command:
  32
  33 .. prompt:: bash #
  34
  35   ceph log last cephadm
  36
  37 These events are also logged to the ``ceph.cephadm.log`` file on
  38 monitor hosts as well as to the monitor daemons' stderr.
  39
  40
  41 .. _cephadm-logs:
  42
  43 Ceph daemon logs
  44 ================
  45
  46 Logging to journald
  47 -------------------
  48
  49 Ceph daemons traditionally write logs to ``/var/log/ceph``. Ceph daemons log to
  50 journald by default and Ceph logs are captured by the container runtime
  51 environment. They are accessible via ``journalctl``.
  52
  53 .. note:: Prior to Quincy, ceph daemons logged to stderr.
  54
  55 Example of logging to journald
  56 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  57
  58 For example, to view the logs for the daemon ``mon.foo`` for a cluster
  59 with ID ``5c5a50ae-272a-455d-99e9-32c6a013e694``, the command would be
  60 something like:
  61
  62 .. prompt:: bash #
  63
  64   journalctl -u ceph-5c5a50ae-272a-455d-99e9-32c6a013e694@mon.foo
  65
  66 This works well for normal operations when logging levels are low.
  67
  68 Logging to files
  69 ----------------
  70
  71 You can also configure Ceph daemons to log to files instead of to
  72 journald if you prefer logs to appear in files (as they did in earlier,
  73 pre-cephadm, pre-Octopus versions of Ceph).  When Ceph logs to files,
  74 the logs appear in ``/var/log/ceph/<cluster-fsid>``. If you choose to
  75 configure Ceph to log to files instead of to journald, remember to
  76 configure Ceph so that it will not log to journald (the commands for
  77 this are covered below).
  78
  79 Enabling logging to files
  80 ~~~~~~~~~~~~~~~~~~~~~~~~~
  81
  82 To enable logging to files, run the following commands:
  83
  84 .. prompt:: bash #
  85
  86   ceph config set global log_to_file true
  87   ceph config set global mon_cluster_log_to_file true
  88
  89 Disabling logging to journald
  90 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  91
  92 If you choose to log to files, we recommend disabling logging to journald or else
  93 everything will be logged twice. Run the following commands to disable logging
  94 to stderr:
  95
  96 .. prompt:: bash #
  97
  98   ceph config set global log_to_stderr false
  99   ceph config set global mon_cluster_log_to_stderr false
 100   ceph config set global log_to_journald false
 101   ceph config set global mon_cluster_log_to_journald false
 102
 103 .. note:: You can change the default by passing --log-to-file during
 104    bootstrapping a new cluster.
 105
 106 Modifying the log retention schedule
 107 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 108
 109 By default, cephadm sets up log rotation on each host to rotate these
 110 files.  You can configure the logging retention schedule by modifying
 111 ``/etc/logrotate.d/ceph.<cluster-fsid>``.
 112
 113
 114 Data location
 115 =============
 116
 117 Cephadm stores daemon data and logs in different locations than did
 118 older, pre-cephadm (pre Octopus) versions of ceph:
 119
 120 * ``/var/log/ceph/<cluster-fsid>`` contains all cluster logs. By
 121   default, cephadm logs via stderr and the container runtime. These
 122   logs will not exist unless you have enabled logging to files as
 123   described in `cephadm-logs`_.
 124 * ``/var/lib/ceph/<cluster-fsid>`` contains all cluster daemon data
 125   (besides logs).
 126 * ``/var/lib/ceph/<cluster-fsid>/<daemon-name>`` contains all data for
 127   an individual daemon.
 128 * ``/var/lib/ceph/<cluster-fsid>/crash`` contains crash reports for
 129   the cluster.
 130 * ``/var/lib/ceph/<cluster-fsid>/removed`` contains old daemon
 131   data directories for stateful daemons (e.g., monitor, prometheus)
 132   that have been removed by cephadm.
 133
 134 Disk usage
 135 ----------
 136
 137 Because a few Ceph daemons (notably, the monitors and prometheus) store a
 138 large amount of data in ``/var/lib/ceph`` , we recommend moving this
 139 directory to its own disk, partition, or logical volume so that it does not
 140 fill up the root file system.
 141
 142
 143 Health checks
 144 =============
 145 The cephadm module provides additional health checks to supplement the
 146 default health checks provided by the Cluster. These additional health
 147 checks fall into two categories:
 148
 149 - **cephadm operations**: Health checks in this category are always
 150   executed when the cephadm module is active.
 151 - **cluster configuration**: These health checks are *optional*, and
 152   focus on the configuration of the hosts in the cluster.
 153
 154 CEPHADM Operations
 155 ------------------
 156
 157 CEPHADM_PAUSED
 158 ~~~~~~~~~~~~~~
 159
 160 This indicates that cephadm background work has been paused with
 161 ``ceph orch pause``.  Cephadm continues to perform passive monitoring
 162 activities (like checking host and daemon status), but it will not
 163 make any changes (like deploying or removing daemons).
 164
 165 Resume cephadm work by running the following command:
 166
 167 .. prompt:: bash #
 168
 169   ceph orch resume
 170
 171 .. _cephadm-stray-host:
 172
 173 CEPHADM_STRAY_HOST
 174 ~~~~~~~~~~~~~~~~~~
 175
 176 This indicates that one or more hosts have Ceph daemons that are
 177 running, but are not registered as hosts managed by *cephadm*.  This
 178 means that those services cannot currently be managed by cephadm
 179 (e.g., restarted, upgraded, included in `ceph orch ps`).
 180
 181 * You can manage the host(s) by running the following command:
 182
 183   .. prompt:: bash #
 184
 185     ceph orch host add *<hostname>*
 186
 187   .. note::
 188
 189     You might need to configure SSH access to the remote host
 190     before this will work.
 191
 192 * See :ref:`cephadm-fqdn` for more information about host names and
 193   domain names.
 194
 195 * Alternatively, you can manually connect to the host and ensure that
 196   services on that host are removed or migrated to a host that is
 197   managed by *cephadm*.
 198
 199 * This warning can be disabled entirely by running the following
 200   command:
 201
 202   .. prompt:: bash #
 203
 204     ceph config set mgr mgr/cephadm/warn_on_stray_hosts false
 205
 206 CEPHADM_STRAY_DAEMON
 207 ~~~~~~~~~~~~~~~~~~~~
 208
 209 One or more Ceph daemons are running but not are not managed by
 210 *cephadm*.  This may be because they were deployed using a different
 211 tool, or because they were started manually.  Those
 212 services cannot currently be managed by cephadm (e.g., restarted,
 213 upgraded, or included in `ceph orch ps`).
 214
 215 * If the daemon is a stateful one (monitor or OSD), it should be adopted
 216   by cephadm; see :ref:`cephadm-adoption`.  For stateless daemons, it is
 217   usually easiest to provision a new daemon with the ``ceph orch apply``
 218   command and then stop the unmanaged daemon.
 219
 220 * If the stray daemon(s) are running on hosts not managed by cephadm, you can manage the host(s) by running the following command:
 221
 222   .. prompt:: bash #
 223
 224     ceph orch host add *<hostname>*
 225
 226   .. note::
 227
 228     You might need to configure SSH access to the remote host
 229     before this will work.
 230
 231 * See :ref:`cephadm-fqdn` for more information about host names and
 232   domain names.
 233
 234 * This warning can be disabled entirely by running the following command:
 235
 236   .. prompt:: bash #
 237
 238     ceph config set mgr mgr/cephadm/warn_on_stray_daemons false
 239
 240 CEPHADM_HOST_CHECK_FAILED
 241 ~~~~~~~~~~~~~~~~~~~~~~~~~
 242
 243 One or more hosts have failed the basic cephadm host check, which verifies
 244 that (1) the host is reachable and cephadm can be executed there, and (2)
 245 that the host satisfies basic prerequisites, like a working container
 246 runtime (podman or docker) and working time synchronization.
 247 If this test fails, cephadm will no be able to manage services on that host.
 248
 249 You can manually run this check by running the following command:
 250
 251 .. prompt:: bash #
 252
 253   ceph cephadm check-host *<hostname>*
 254
 255 You can remove a broken host from management by running the following command:
 256
 257 .. prompt:: bash #
 258
 259   ceph orch host rm *<hostname>*
 260
 261 You can disable this health warning by running the following command:
 262
 263 .. prompt:: bash #
 264
 265   ceph config set mgr mgr/cephadm/warn_on_failed_host_check false
 266
 267 Cluster Configuration Checks
 268 ----------------------------
 269 Cephadm periodically scans each of the hosts in the cluster in order
 270 to understand the state of the OS, disks, NICs etc. These facts can
 271 then be analysed for consistency across the hosts in the cluster to
 272 identify any configuration anomalies.
 273
 274 Enabling Cluster Configuration Checks
 275 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 276
 277 The configuration checks are an **optional** feature, and are enabled
 278 by running the following command:
 279
 280 .. prompt:: bash #
 281
 282   ceph config set mgr mgr/cephadm/config_checks_enabled true
 283
 284 States Returned by Cluster Configuration Checks
 285 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 286
 287 The configuration checks are triggered after each host scan (1m). The
 288 cephadm log entries will show the current state and outcome of the
 289 configuration checks as follows:
 290
 291 Disabled state (config_checks_enabled false):
 292
 293 .. code-block:: bash
 294
 295   ALL cephadm checks are disabled, use 'ceph config set mgr mgr/cephadm/config_checks_enabled true' to enable
 296
 297 Enabled state (config_checks_enabled true):
 298
 299 .. code-block:: bash
 300
 301   CEPHADM 8/8 checks enabled and executed (0 bypassed, 0 disabled). No issues detected
 302
 303 Managing Configuration Checks (subcommands)
 304 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 305
 306 The configuration checks themselves are managed through several cephadm subcommands.
 307
 308 To determine whether the configuration checks are enabled, run the following command:
 309
 310 .. prompt:: bash #
 311
 312   ceph cephadm config-check status
 313
 314 This command returns the status of the configuration checker as either "Enabled" or "Disabled".
 315
 316
 317 To list all the configuration checks and their current states, run the following command:
 318
 319 .. code-block:: console
 320
 321   # ceph cephadm config-check ls
 322
 323     NAME             HEALTHCHECK                      STATUS   DESCRIPTION
 324   kernel_security  CEPHADM_CHECK_KERNEL_LSM         enabled  checks SELINUX/Apparmor profiles are consistent across cluster hosts
 325   os_subscription  CEPHADM_CHECK_SUBSCRIPTION       enabled  checks subscription states are consistent for all cluster hosts
 326   public_network   CEPHADM_CHECK_PUBLIC_MEMBERSHIP  enabled  check that all hosts have a NIC on the Ceph public_netork
 327   osd_mtu_size     CEPHADM_CHECK_MTU                enabled  check that OSD hosts share a common MTU setting
 328   osd_linkspeed    CEPHADM_CHECK_LINKSPEED          enabled  check that OSD hosts share a common linkspeed
 329   network_missing  CEPHADM_CHECK_NETWORK_MISSING    enabled  checks that the cluster/public networks defined exist on the Ceph hosts
 330   ceph_release     CEPHADM_CHECK_CEPH_RELEASE       enabled  check for Ceph version consistency - ceph daemons should be on the same release (unless upgrade is active)
 331   kernel_version   CEPHADM_CHECK_KERNEL_VERSION     enabled  checks that the MAJ.MIN of the kernel on Ceph hosts is consistent
 332
 333 The name of each configuration check can be used to enable or disable a specific check by running a command of the following form:
 334 :
 335
 336 .. prompt:: bash #
 337
 338   ceph cephadm config-check disable <name>
 339
 340 For example:
 341
 342 .. prompt:: bash #
 343
 344   ceph cephadm config-check disable kernel_security
 345
 346 CEPHADM_CHECK_KERNEL_LSM
 347 ~~~~~~~~~~~~~~~~~~~~~~~~
 348 Each host within the cluster is expected to operate within the same Linux
 349 Security Module (LSM) state. For example, if the majority of the hosts are
 350 running with SELINUX in enforcing mode, any host not running in this mode is
 351 flagged as an anomaly and a healtcheck (WARNING) state raised.
 352
 353 CEPHADM_CHECK_SUBSCRIPTION
 354 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 355 This check relates to the status of vendor subscription. This check is
 356 performed only for hosts using RHEL, but helps to confirm that all hosts are
 357 covered by an active subscription, which ensures that patches and updates are
 358 available.
 359
 360 CEPHADM_CHECK_PUBLIC_MEMBERSHIP
 361 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 362 All members of the cluster should have NICs configured on at least one of the
 363 public network subnets. Hosts that are not on the public network will rely on
 364 routing, which may affect performance.
 365
 366 CEPHADM_CHECK_MTU
 367 ~~~~~~~~~~~~~~~~~
 368 The MTU of the NICs on OSDs can be a key factor in consistent performance. This
 369 check examines hosts that are running OSD services to ensure that the MTU is
 370 configured consistently within the cluster. This is determined by establishing
 371 the MTU setting that the majority of hosts is using. Any anomalies result in a
 372 Ceph health check.
 373
 374 CEPHADM_CHECK_LINKSPEED
 375 ~~~~~~~~~~~~~~~~~~~~~~~
 376 This check is similar to the MTU check. Linkspeed consistency is a factor in
 377 consistent cluster performance, just as the MTU of the NICs on the OSDs is.
 378 This check determines the linkspeed shared by the majority of OSD hosts, and a
 379 health check is run for any hosts that are set at a lower linkspeed rate.
 380
 381 CEPHADM_CHECK_NETWORK_MISSING
 382 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 383 The `public_network` and `cluster_network` settings support subnet definitions
 384 for IPv4 and IPv6. If these settings are not found on any host in the cluster,
 385 a health check is raised.
 386
 387 CEPHADM_CHECK_CEPH_RELEASE
 388 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 389 Under normal operations, the Ceph cluster runs daemons under the same ceph
 390 release (that is, the Ceph cluster runs all daemons under (for example)
 391 Octopus).  This check determines the active release for each daemon, and
 392 reports any anomalies as a healthcheck. *This check is bypassed if an upgrade
 393 process is active within the cluster.*
 394
 395 CEPHADM_CHECK_KERNEL_VERSION
 396 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 397 The OS kernel version (maj.min) is checked for consistency across the hosts.
 398 The kernel version of the majority of the hosts is used as the basis for
 399 identifying anomalies.
 400
 401 .. _client_keyrings_and_configs:
 402
 403 Client keyrings and configs
 404 ===========================
 405
 406 Cephadm can distribute copies of the ``ceph.conf`` file and client keyring
 407 files to hosts. It is usually a good idea to store a copy of the config and
 408 ``client.admin`` keyring on any host used to administer the cluster via the
 409 CLI.  By default, cephadm does this for any nodes that have the ``_admin``
 410 label (which normally includes the bootstrap host).
 411
 412 When a client keyring is placed under management, cephadm will:
 413
 414   - build a list of target hosts based on the specified placement spec (see
 415     :ref:`orchestrator-cli-placement-spec`)
 416   - store a copy of the ``/etc/ceph/ceph.conf`` file on the specified host(s)
 417   - store a copy of the keyring file on the specified host(s)
 418   - update the ``ceph.conf`` file as needed (e.g., due to a change in the cluster monitors)
 419   - update the keyring file if the entity's key is changed (e.g., via ``ceph
 420     auth ...`` commands)
 421   - ensure that the keyring file has the specified ownership and specified mode
 422   - remove the keyring file when client keyring management is disabled
 423   - remove the keyring file from old hosts if the keyring placement spec is
 424     updated (as needed)
 425
 426 Listing Client Keyrings
 427 -----------------------
 428
 429 To see the list of client keyrings are currently under management, run the following command:
 430
 431 .. prompt:: bash #
 432
 433   ceph orch client-keyring ls
 434
 435 Putting a Keyring Under Management
 436 ----------------------------------
 437
 438 To put a keyring under management, run a command of the following form:
 439
 440 .. prompt:: bash #
 441
 442   ceph orch client-keyring set <entity> <placement> [--mode=<mode>] [--owner=<uid>.<gid>] [--path=<path>]
 443
 444 - By default, the *path* is ``/etc/ceph/client.{entity}.keyring``, which is
 445   where Ceph looks by default.  Be careful when specifying alternate locations,
 446   as existing files may be overwritten.
 447 - A placement of ``*`` (all hosts) is common.
 448 - The mode defaults to ``0600`` and ownership to ``0:0`` (user root, group root).
 449
 450 For example, to create a ``client.rbd`` key and deploy it to hosts with the
 451 ``rbd-client`` label and make it group readable by uid/gid 107 (qemu), run the
 452 following commands:
 453
 454 .. prompt:: bash #
 455
 456   ceph auth get-or-create-key client.rbd mon 'profile rbd' mgr 'profile rbd' osd 'profile rbd pool=my_rbd_pool'
 457   ceph orch client-keyring set client.rbd label:rbd-client --owner 107:107 --mode 640
 458
 459 The resulting keyring file is:
 460
 461 .. code-block:: console
 462
 463   -rw-r-----. 1 qemu qemu 156 Apr 21 08:47 /etc/ceph/client.client.rbd.keyring
 464
 465 Disabling Management of a Keyring File
 466 --------------------------------------
 467
 468 To disable management of a keyring file, run a command of the following form:
 469
 470 .. prompt:: bash #
 471
 472   ceph orch client-keyring rm <entity>
 473
 474 .. note::
 475
 476   This deletes any keyring files for this entity that were previously written
 477   to cluster nodes.
 478
 479 .. _etc_ceph_conf_distribution:
 480
 481 /etc/ceph/ceph.conf
 482 ===================
 483
 484 Distributing ceph.conf to hosts that have no keyrings
 485 -----------------------------------------------------
 486
 487 It might be useful to distribute ``ceph.conf`` files to hosts without an
 488 associated client keyring file.  By default, cephadm deploys only a
 489 ``ceph.conf`` file to hosts where a client keyring is also distributed (see
 490 above).  To write config files to hosts without client keyrings, run the
 491 following command:
 492
 493 .. prompt:: bash #
 494
 495     ceph config set mgr mgr/cephadm/manage_etc_ceph_ceph_conf true
 496
 497 Using Placement Specs to specify which hosts get keyrings
 498 ---------------------------------------------------------
 499
 500 By default, the configs are written to all hosts (i.e., those listed by ``ceph
 501 orch host ls``).  To specify which hosts get a ``ceph.conf``, run a command of
 502 the following form:
 503
 504 .. prompt:: bash #
 505
 506   ceph config set mgr mgr/cephadm/manage_etc_ceph_ceph_conf_hosts <placement spec>
 507
 508 For example, to distribute configs to hosts with the ``bare_config`` label, run
 509 the following command:
 510
 511 Distributing ceph.conf to hosts tagged with bare_config
 512 -------------------------------------------------------
 513
 514 For example, to distribute configs to hosts with the ``bare_config`` label, run the following command:
 515
 516 .. prompt:: bash #
 517
 518   ceph config set mgr mgr/cephadm/manage_etc_ceph_ceph_conf_hosts label:bare_config
 519
 520 (See :ref:`orchestrator-cli-placement-spec` for more information about placement specs.)
 521
 522 Purging a cluster
 523 =================
 524
 525 .. danger:: THIS OPERATION WILL DESTROY ALL DATA STORED IN THIS CLUSTER
 526
 527 In order to destroy a cluster and delete all data stored in this cluster, pause
 528 cephadm to avoid deploying new daemons.
 529
 530 .. prompt:: bash #
 531
 532   ceph orch pause
 533
 534 Then verify the FSID of the cluster:
 535
 536 .. prompt:: bash #
 537
 538   ceph fsid
 539
 540 Purge ceph daemons from all hosts in the cluster
 541
 542 .. prompt:: bash #
 543
 544   # For each host:
 545   cephadm rm-cluster --force --zap-osds --fsid <fsid>