ceph/doc/cephadm/troubleshooting.rst

   1 Troubleshooting
   2 ===============
   3
   4 You might need to investigate why a cephadm command failed
   5 or why a certain service no longer runs properly.
   6
   7 Cephadm deploys daemons as containers. This means that
   8 troubleshooting those containerized daemons might work
   9 differently than you expect (and that is certainly true if
  10 you expect this troubleshooting to work the way that
  11 troubleshooting does when the daemons involved aren't
  12 containerized).
  13
  14 Here are some tools and commands to help you troubleshoot
  15 your Ceph environment.
  16
  17 .. _cephadm-pause:
  18
  19 Pausing or disabling cephadm
  20 ----------------------------
  21
  22 If something goes wrong and cephadm is behaving badly, you can
  23 pause most of the Ceph cluster's background activity by running
  24 the following command:
  25
  26 .. prompt:: bash #
  27
  28   ceph orch pause
  29
  30 This stops all changes in the Ceph cluster, but cephadm will
  31 still periodically check hosts to refresh its inventory of
  32 daemons and devices.  You can disable cephadm completely by
  33 running the following commands:
  34
  35 .. prompt:: bash #
  36
  37   ceph orch set backend ''
  38   ceph mgr module disable cephadm
  39
  40 These commands disable all of the ``ceph orch ...`` CLI commands.
  41 All previously deployed daemon containers continue to exist and
  42 will start as they did before you ran these commands.
  43
  44 See :ref:`cephadm-spec-unmanaged` for information on disabling
  45 individual services.
  46
  47
  48 Per-service and per-daemon events
  49 ---------------------------------
  50
  51 In order to help with the process of debugging failed daemon
  52 deployments, cephadm stores events per service and per daemon.
  53 These events often contain information relevant to
  54 troubleshooting
  55 your Ceph cluster.
  56
  57 Listing service events
  58 ~~~~~~~~~~~~~~~~~~~~~~
  59
  60 To see the events associated with a certain service, run a
  61 command of the and following form:
  62
  63 .. prompt:: bash #
  64
  65   ceph orch ls --service_name=<service-name> --format yaml
  66
  67 This will return something in the following form:
  68
  69 .. code-block:: yaml
  70
  71   service_type: alertmanager
  72   service_name: alertmanager
  73   placement:
  74     hosts:
  75     - unknown_host
  76   status:
  77     ...
  78     running: 1
  79     size: 1
  80   events:
  81   - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
  82   - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
  83     place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
  84
  85 Listing daemon events
  86 ~~~~~~~~~~~~~~~~~~~~~
  87
  88 To see the events associated with a certain daemon, run a
  89 command of the and following form:
  90
  91 .. prompt:: bash #
  92
  93   ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
  94
  95 This will return something in the following form:
  96
  97 .. code-block:: yaml
  98
  99   daemon_type: mds
 100   daemon_id: cephfs.hostname.ppdhsz
 101   hostname: hostname
 102   status_desc: running
 103   ...
 104   events:
 105   - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
 106     mds.cephfs.hostname.ppdhsz on host 'hostname'"
 107
 108
 109 Checking cephadm logs
 110 ---------------------
 111
 112 To learn how to monitor the cephadm logs as they are generated, read :ref:`watching_cephadm_logs`.
 113
 114 If your Ceph cluster has been configured to log events to files, there will exist a
 115 cephadm log file called ``ceph.cephadm.log`` on all monitor hosts (see
 116 :ref:`cephadm-logs` for a more complete explanation of this).
 117
 118 Gathering log files
 119 -------------------
 120
 121 Use journalctl to gather the log files of all daemons:
 122
 123 .. note:: By default cephadm now stores logs in journald. This means
 124    that you will no longer find daemon logs in ``/var/log/ceph/``.
 125
 126 To read the log file of one specific daemon, run::
 127
 128     cephadm logs --name <name-of-daemon>
 129
 130 Note: this only works when run on the same host where the daemon is running. To
 131 get logs of a daemon running on a different host, give the ``--fsid`` option::
 132
 133     cephadm logs --fsid <fsid> --name <name-of-daemon>
 134
 135 where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
 136
 137 To fetch all log files of all daemons on a given host, run::
 138
 139     for name in $(cephadm ls | jq -r '.[].name') ; do
 140       cephadm logs --fsid <fsid> --name "$name" > $name;
 141     done
 142
 143 Collecting systemd status
 144 -------------------------
 145
 146 To print the state of a systemd unit, run::
 147
 148       systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
 149
 150
 151 To fetch all state of all daemons of a given host, run::
 152
 153     fsid="$(cephadm shell ceph fsid)"
 154     for name in $(cephadm ls | jq -r '.[].name') ; do
 155       systemctl status "ceph-$fsid@$name.service" > $name;
 156     done
 157
 158
 159 List all downloaded container images
 160 ------------------------------------
 161
 162 To list all container images that are downloaded on a host:
 163
 164 .. note:: ``Image`` might also be called `ImageID`
 165
 166 ::
 167
 168     podman ps -a --format json | jq '.[].Image'
 169     "docker.io/library/centos:8"
 170     "registry.opensuse.org/opensuse/leap:15.2"
 171
 172
 173 Manually running containers
 174 ---------------------------
 175
 176 Cephadm writes small wrappers that run a containers. Refer to
 177 ``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
 178 container execution command.
 179
 180 .. _cephadm-ssh-errors:
 181
 182 ssh errors
 183 ----------
 184
 185 Error message::
 186
 187   execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
 188   ...
 189   raise OrchestratorError(msg) from e
 190   orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
 191   Please make sure that the host is reachable and accepts connections using the cephadm SSH key
 192   ...
 193
 194 Things users can do:
 195
 196 1. Ensure cephadm has an SSH identity key::
 197
 198      [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
 199      INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
 200      INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
 201      [root@mon1 ~] # chmod 0600 ~/cephadm_private_key
 202
 203  If this fails, cephadm doesn't have a key. Fix this by running the following command::
 204
 205      [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
 206
 207  or::
 208
 209      [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
 210
 211 2. Ensure that the ssh config is correct::
 212
 213      [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
 214
 215 3. Verify that we can connect to the host::
 216
 217      [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
 218
 219 Verifying that the Public Key is Listed in the authorized_keys file
 220 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 221
 222 To verify that the public key is in the authorized_keys file, run the following commands::
 223
 224      [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
 225      [root@mon1 ~]# grep "`cat ~/ceph.pub`"  /root/.ssh/authorized_keys
 226
 227 Failed to infer CIDR network error
 228 ----------------------------------
 229
 230 If you see this error::
 231
 232    ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
 233
 234 Or this error::
 235
 236    Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
 237
 238 This means that you must run a command of this form::
 239
 240   ceph config set mon public_network <mon_network>
 241
 242 For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
 243
 244 Accessing the admin socket
 245 --------------------------
 246
 247 Each Ceph daemon provides an admin socket that bypasses the
 248 MONs (See :ref:`rados-monitoring-using-admin-socket`).
 249
 250 To access the admin socket, first enter the daemon container on the host::
 251
 252     [root@mon1 ~]# cephadm enter --name <daemon-name>
 253     [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
 254
 255 Calling miscellaneous ceph tools
 256 --------------------------------
 257
 258 To call miscellaneous like ``ceph-objectstore-tool`` or
 259 ``ceph-monstore-tool``, you can run them by calling
 260 ``cephadm shell --name <daemon-name>`` like so::
 261
 262     root@myhostname # cephadm unit --name mon.myhostname stop
 263     root@myhostname # cephadm shell --name mon.myhostname
 264     [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap
 265     [ceph: root@myhostname /]# monmaptool --print monmap
 266     monmaptool: monmap file monmap
 267     epoch 1
 268     fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
 269     last_changed 2021-11-01T20:57:19.755111+0000
 270     created 2021-11-01T20:57:19.755111+0000
 271     min_mon_release 17 (quincy)
 272     election_strategy: 1
 273     0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
 274
 275 This command sets up the environment in a way that is suitable
 276 for extended daemon maintenance and running the daemon interactively.
 277
 278 .. _cephadm-restore-quorum:
 279
 280 Restoring the MON quorum
 281 ------------------------
 282
 283 In case the Ceph MONs cannot form a quorum, cephadm is not able
 284 to manage the cluster, until the quorum is restored.
 285
 286 In order to restore the MON quorum, remove unhealthy MONs
 287 form the monmap by following these steps:
 288
 289 1. Stop all MONs. For each MON host::
 290
 291     ssh {mon-host}
 292     cephadm unit --name mon.`hostname` stop
 293
 294
 295 2. Identify a surviving monitor and log in to that host::
 296
 297     ssh {mon-host}
 298     cephadm enter --name mon.`hostname`
 299
 300 3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
 301
 302 .. _cephadm-manually-deploy-mgr:
 303
 304 Manually deploying a MGR daemon
 305 -------------------------------
 306 cephadm requires a MGR daemon in order to manage the cluster. In case the cluster
 307 the last MGR of a cluster was removed, follow these steps in order to deploy
 308 a MGR ``mgr.hostname.smfvfd`` on a random host of your cluster manually.
 309
 310 Disable the cephadm scheduler, in order to prevent cephadm from removing the new
 311 MGR. See :ref:`cephadm-enable-cli`::
 312
 313   ceph config-key set mgr/cephadm/pause true
 314
 315 Then get or create the auth entry for the new MGR::
 316
 317   ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
 318
 319 Get the ceph.conf::
 320
 321   ceph config generate-minimal-conf
 322
 323 Get the container image::
 324
 325   ceph config get "mgr.hostname.smfvfd" container_image
 326
 327 Create a file ``config-json.json`` which contains the information necessary to deploy
 328 the daemon:
 329
 330 .. code-block:: json
 331
 332   {
 333     "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
 334     "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
 335   }
 336
 337 Deploy the daemon::
 338
 339   cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
 340
 341 Analyzing core dumps
 342 ---------------------
 343
 344 In case a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
 345
 346 .. prompt:: bash #
 347
 348   ulimit -c unlimited
 349
 350 core dumps will now be written to ``/var/lib/systemd/coredump``.
 351
 352 .. note::
 353
 354   core dumps are not namespaced by the kernel, which means
 355   they will be written to ``/var/lib/systemd/coredump`` on
 356   the container host.
 357
 358 Now, wait for the crash to happen again. (To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``)
 359
 360 Install debug packages by entering the cephadm shell and install ``ceph-debuginfo``::
 361
 362   # cephadm shell --mount /var/lib/systemd/coredump
 363   [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
 364   [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
 365   [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
 366   (gdb) bt
 367   #0  0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
 368   #1  0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
 369   #2  0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
 370   #3  0x0000563085ca3d7e in main ()