]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephadm/troubleshooting.rst
bump version to 18.2.4-pve3
[ceph.git] / ceph / doc / cephadm / troubleshooting.rst
CommitLineData
9f95a23c
TL
1Troubleshooting
2===============
3
aee94f69
TL
4This section explains how to investigate why a cephadm command failed or why a
5certain service no longer runs properly.
9f95a23c 6
aee94f69
TL
7Cephadm deploys daemons within containers. Troubleshooting containerized
8daemons requires a different process than does troubleshooting traditional
9daemons that were installed by means of packages.
522d829b 10
aee94f69
TL
11Here are some tools and commands to help you troubleshoot your Ceph
12environment.
9f95a23c 13
f67539c2
TL
14.. _cephadm-pause:
15
39ae355f 16Pausing or Disabling cephadm
801d1391
TL
17----------------------------
18
aee94f69
TL
19If something goes wrong and cephadm is behaving badly, pause most of the Ceph
20cluster's background activity by running the following command:
522d829b
TL
21
22.. prompt:: bash #
801d1391
TL
23
24 ceph orch pause
25
aee94f69
TL
26This stops all changes in the Ceph cluster, but cephadm will still periodically
27check hosts to refresh its inventory of daemons and devices. Disable cephadm
28completely by running the following commands:
522d829b
TL
29
30.. prompt:: bash #
801d1391
TL
31
32 ceph orch set backend ''
33 ceph mgr module disable cephadm
34
f38dd50b 35These commands disable all ``ceph orch ...`` CLI commands. All
aee94f69
TL
36previously deployed daemon containers continue to run and will start just as
37they were before you ran these commands.
801d1391 38
aee94f69 39See :ref:`cephadm-spec-unmanaged` for more on disabling individual services.
f67539c2
TL
40
41
39ae355f 42Per-service and Per-daemon Events
f67539c2
TL
43---------------------------------
44
aee94f69
TL
45To make it easier to debug failed daemons, cephadm stores events per service
46and per daemon. These events often contain information relevant to
47the troubleshooting of your Ceph cluster.
522d829b 48
39ae355f 49Listing Service Events
522d829b
TL
50~~~~~~~~~~~~~~~~~~~~~~
51
aee94f69
TL
52To see the events associated with a certain service, run a command of the
53following form:
522d829b
TL
54
55.. prompt:: bash #
f67539c2
TL
56
57 ceph orch ls --service_name=<service-name> --format yaml
58
f38dd50b 59This will return information in the following form:
f67539c2
TL
60
61.. code-block:: yaml
62
63 service_type: alertmanager
64 service_name: alertmanager
65 placement:
66 hosts:
67 - unknown_host
68 status:
69 ...
70 running: 1
71 size: 1
72 events:
73 - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
74 - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
75 place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
76
39ae355f 77Listing Daemon Events
522d829b
TL
78~~~~~~~~~~~~~~~~~~~~~
79
aee94f69
TL
80To see the events associated with a certain daemon, run a command of the
81following form:
522d829b
TL
82
83.. prompt:: bash #
f67539c2
TL
84
85 ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
86
522d829b
TL
87This will return something in the following form:
88
f67539c2
TL
89.. code-block:: yaml
90
91 daemon_type: mds
92 daemon_id: cephfs.hostname.ppdhsz
93 hostname: hostname
94 status_desc: running
95 ...
96 events:
97 - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
98 mds.cephfs.hostname.ppdhsz on host 'hostname'"
99
100
39ae355f 101Checking Cephadm Logs
801d1391
TL
102---------------------
103
aee94f69
TL
104To learn how to monitor cephadm logs as they are generated, read
105:ref:`watching_cephadm_logs`.
801d1391 106
aee94f69
TL
107If your Ceph cluster has been configured to log events to files, there will be
108a ``ceph.cephadm.log`` file on all monitor hosts. See :ref:`cephadm-logs` for a
109more complete explanation.
801d1391 110
39ae355f 111Gathering Log Files
9f95a23c
TL
112-------------------
113
aee94f69 114Use ``journalctl`` to gather the log files of all daemons:
9f95a23c
TL
115
116.. note:: By default cephadm now stores logs in journald. This means
117 that you will no longer find daemon logs in ``/var/log/ceph/``.
118
aee94f69
TL
119To read the log file of one specific daemon, run a command of the following
120form:
9f95a23c 121
aee94f69 122.. prompt:: bash
9f95a23c 123
aee94f69 124 cephadm logs --name <name-of-daemon>
9f95a23c 125
aee94f69
TL
126.. Note:: This works only when run on the same host that is running the daemon.
127 To get the logs of a daemon that is running on a different host, add the
128 ``--fsid`` option to the command, as in the following example:
9f95a23c 129
aee94f69 130 .. prompt:: bash
9f95a23c 131
aee94f69
TL
132 cephadm logs --fsid <fsid> --name <name-of-daemon>
133
134 In this example, ``<fsid>`` corresponds to the cluster ID returned by the
135 ``ceph status`` command.
136
137To fetch all log files of all daemons on a given host, run the following
138for-loop::
9f95a23c
TL
139
140 for name in $(cephadm ls | jq -r '.[].name') ; do
141 cephadm logs --fsid <fsid> --name "$name" > $name;
142 done
143
39ae355f 144Collecting Systemd Status
9f95a23c
TL
145-------------------------
146
aee94f69 147To print the state of a systemd unit, run a command of the following form:
9f95a23c 148
aee94f69 149.. prompt:: bash
9f95a23c 150
aee94f69 151 systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
9f95a23c 152
9f95a23c 153
aee94f69
TL
154To fetch the state of all daemons of a given host, run the following shell
155script::
156
157 fsid="$(cephadm shell ceph fsid)"
158 for name in $(cephadm ls | jq -r '.[].name') ; do
159 systemctl status "ceph-$fsid@$name.service" > $name;
160 done
9f95a23c
TL
161
162
39ae355f 163List all Downloaded Container Images
9f95a23c
TL
164------------------------------------
165
aee94f69
TL
166To list all container images that are downloaded on a host, run the following
167commands:
9f95a23c 168
aee94f69 169.. prompt:: bash #
9f95a23c 170
aee94f69 171 podman ps -a --format json | jq '.[].Image' "docker.io/library/centos:8" "registry.opensuse.org/opensuse/leap:15.2"
9f95a23c 172
aee94f69 173.. note:: ``Image`` might also be called ``ImageID``.
9f95a23c
TL
174
175
39ae355f 176Manually Running Containers
9f95a23c
TL
177---------------------------
178
39ae355f 179Cephadm uses small wrappers when running containers. Refer to
aee94f69
TL
180``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the container
181execution command.
1911f103 182
f6b5b4d7 183.. _cephadm-ssh-errors:
1911f103 184
39ae355f 185SSH Errors
1911f103
TL
186----------
187
188Error message::
189
f91f0fd5
TL
190 execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
191 ...
192 raise OrchestratorError(msg) from e
193 orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
194 Please make sure that the host is reachable and accepts connections using the cephadm SSH key
195 ...
1911f103 196
aee94f69
TL
197If you receive the above error message, try the following things to
198troubleshoot the SSH connection between ``cephadm`` and the monitor:
1911f103 199
aee94f69 2001. Ensure that ``cephadm`` has an SSH identity key::
f91f0fd5
TL
201
202 [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
1911f103
TL
203 INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
204 INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
f91f0fd5 205 [root@mon1 ~] # chmod 0600 ~/cephadm_private_key
1911f103
TL
206
207 If this fails, cephadm doesn't have a key. Fix this by running the following command::
f91f0fd5 208
1911f103
TL
209 [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
210
211 or::
f91f0fd5 212
aee94f69 213 [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssh-key -i -
1911f103 214
39ae355f 2152. Ensure that the SSH config is correct::
f91f0fd5 216
1911f103
TL
217 [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
218
aee94f69 2193. Verify that it is possible to connect to the host::
1911f103 220
f91f0fd5 221 [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
1911f103
TL
222
223Verifying that the Public Key is Listed in the authorized_keys file
522d829b
TL
224~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
225
aee94f69
TL
226To verify that the public key is in the ``authorized_keys`` file, run the
227following commands::
1911f103 228
f91f0fd5
TL
229 [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
230 [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
e306af50 231
39ae355f 232Failed to Infer CIDR network error
e306af50
TL
233----------------------------------
234
235If you see this error::
236
237 ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
238
239Or this error::
240
241 Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
242
aee94f69
TL
243This means that you must run a command of this form:
244
245.. prompt:: bash
e306af50 246
aee94f69 247 ceph config set mon public_network <mon_network>
e306af50 248
aee94f69
TL
249For more detail on operations of this kind, see
250:ref:`deploy_additional_monitors`.
e306af50 251
39ae355f 252Accessing the Admin Socket
e306af50
TL
253--------------------------
254
f38dd50b
TL
255Each Ceph daemon provides an admin socket that allows runtime option setting and statistic reading. See
256:ref:`rados-monitoring-using-admin-socket`.
e306af50 257
aee94f69 258#. To access the admin socket, enter the daemon container on the host::
e306af50 259
aee94f69
TL
260 [root@mon1 ~]# cephadm enter --name <daemon-name>
261
f38dd50b 262#. Run a command of the following forms to see the admin socket's configuration and other available actions::
aee94f69
TL
263
264 [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
f38dd50b 265 [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok help
f67539c2 266
39ae355f 267Running Various Ceph Tools
a4b75251
TL
268--------------------------------
269
aee94f69 270To run Ceph tools such as ``ceph-objectstore-tool`` or
39ae355f
TL
271``ceph-monstore-tool``, invoke the cephadm CLI with
272``cephadm shell --name <daemon-name>``. For example::
a4b75251
TL
273
274 root@myhostname # cephadm unit --name mon.myhostname stop
275 root@myhostname # cephadm shell --name mon.myhostname
276 [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap
277 [ceph: root@myhostname /]# monmaptool --print monmap
278 monmaptool: monmap file monmap
279 epoch 1
280 fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
281 last_changed 2021-11-01T20:57:19.755111+0000
282 created 2021-11-01T20:57:19.755111+0000
283 min_mon_release 17 (quincy)
284 election_strategy: 1
285 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
286
aee94f69
TL
287The cephadm shell sets up the environment in a way that is suitable for
288extended daemon maintenance and for the interactive running of daemons.
a4b75251
TL
289
290.. _cephadm-restore-quorum:
f67539c2 291
39ae355f
TL
292Restoring the Monitor Quorum
293----------------------------
f67539c2 294
aee94f69
TL
295If the Ceph Monitor daemons (mons) cannot form a quorum, ``cephadm`` will not
296be able to manage the cluster until quorum is restored.
f67539c2 297
39ae355f 298In order to restore the quorum, remove unhealthy monitors
f67539c2
TL
299form the monmap by following these steps:
300
aee94f69
TL
3011. Stop all Monitors. Use ``ssh`` to connect to each Monitor's host, and then
302 while connected to the Monitor's host use ``cephadm`` to stop the Monitor
303 daemon:
304
305 .. prompt:: bash
f67539c2 306
aee94f69
TL
307 ssh {mon-host}
308 cephadm unit --name {mon.hostname} stop
f67539c2
TL
309
310
aee94f69 3112. Identify a surviving Monitor and log in to its host:
f67539c2 312
aee94f69 313 .. prompt:: bash
f67539c2 314
aee94f69
TL
315 ssh {mon-host}
316 cephadm enter --name {mon.hostname}
317
3183. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`.
f67539c2 319
a4b75251 320.. _cephadm-manually-deploy-mgr:
f67539c2 321
39ae355f
TL
322Manually Deploying a Manager Daemon
323-----------------------------------
aee94f69
TL
324At least one Manager (``mgr``) daemon is required by cephadm in order to manage
325the cluster. If the last remaining Manager has been removed from the Ceph
326cluster, follow these steps in order to deploy a fresh Manager on an arbitrary
327host in your cluster. In this example, the freshly-deployed Manager daemon is
328called ``mgr.hostname.smfvfd``.
329
330#. Disable the cephadm scheduler, in order to prevent ``cephadm`` from removing
331 the new Manager. See :ref:`cephadm-enable-cli`:
332
333 .. prompt:: bash #
334
335 ceph config-key set mgr/cephadm/pause true
336
337#. Retrieve or create the "auth entry" for the new Manager:
f67539c2 338
aee94f69 339 .. prompt:: bash #
f67539c2 340
aee94f69 341 ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
f67539c2 342
aee94f69 343#. Retrieve the Monitor's configuration:
f67539c2 344
aee94f69 345 .. prompt:: bash #
f67539c2 346
aee94f69 347 ceph config generate-minimal-conf
f67539c2 348
aee94f69 349#. Retrieve the container image:
f67539c2 350
aee94f69 351 .. prompt:: bash #
f67539c2 352
aee94f69 353 ceph config get "mgr.hostname.smfvfd" container_image
f67539c2 354
aee94f69
TL
355#. Create a file called ``config-json.json``, which contains the information
356 necessary to deploy the daemon:
f67539c2 357
aee94f69 358 .. code-block:: json
f67539c2 359
aee94f69
TL
360 {
361 "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
362 "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
363 }
f67539c2 364
aee94f69 365#. Deploy the Manager daemon:
f67539c2 366
aee94f69 367 .. prompt:: bash #
f67539c2 368
aee94f69
TL
369 cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
370
371Capturing Core Dumps
20effc67
TL
372---------------------
373
aee94f69
TL
374A Ceph cluster that uses ``cephadm`` can be configured to capture core dumps.
375The initial capture and processing of the coredump is performed by
376`systemd-coredump
377<https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
378
379
380To enable coredump handling, run the following command
20effc67
TL
381
382.. prompt:: bash #
383
aee94f69 384 ulimit -c unlimited
20effc67 385
20effc67
TL
386
387.. note::
388
aee94f69
TL
389 Core dumps are not namespaced by the kernel. This means that core dumps are
390 written to ``/var/lib/systemd/coredump`` on the container host. The ``ulimit
391 -c unlimited`` setting will persist only until the system is rebooted.
392
393Wait for the crash to happen again. To simulate the crash of a daemon, run for
394example ``killall -3 ceph-mon``.
395
396
397Running the Debugger with cephadm
398----------------------------------
399
400Running a single debugging session
401~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
402
403Initiate a debugging session by using the ``cephadm shell`` command.
404From within the shell container we need to install the debugger and debuginfo
405packages. To debug a core file captured by systemd, run the following:
406
407
408#. Start the shell session:
409
410 .. prompt:: bash #
411
412 cephadm shell --mount /var/lib/system/coredump
413
414#. From within the shell session, run the following commands:
415
416 .. prompt:: bash #
417
418 dnf install ceph-debuginfo gdb zstd
419
420 .. prompt:: bash #
421
422 unzstd /var/lib/systemd/coredump/core.ceph-*.zst
423
424 .. prompt:: bash #
425
426 gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
427
428#. Run debugger commands at gdb's prompt:
429
430 .. prompt:: bash (gdb)
431
432 bt
433
434 ::
435
436 #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
437 #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
438 #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
439 #3 0x0000563085ca3d7e in main ()
440
441
442Running repeated debugging sessions
443~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
444
445When using ``cephadm shell``, as in the example above, any changes made to the
446container that is spawned by the shell command are ephemeral. After the shell
447session exits, the files that were downloaded and installed cease to be
f38dd50b
TL
448available. You can simply re-run the same commands every time ``cephadm shell``
449is invoked, but to save time and resources you can create a new container image
450and use it for repeated debugging sessions.
aee94f69 451
f38dd50b 452In the following example, we create a simple file that constructs the
aee94f69
TL
453container image. The command below uses podman but it is expected to work
454correctly even if ``podman`` is replaced with ``docker``::
455
456 cat >Containerfile <<EOF
457 ARG BASE_IMG=quay.io/ceph/ceph:v18
458 FROM \${BASE_IMG}
459 # install ceph debuginfo packages, gdb and other potentially useful packages
460 RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo
461 EOF
462 podman build -t ceph:debugging -f Containerfile .
463 # pass --build-arg=BASE_IMG=<your image> to customize the base image
464
465The above file creates a new local image named ``ceph:debugging``. This image
466can be used on the same machine that built it. The image can also be pushed to
f38dd50b
TL
467a container repository or saved and copied to a node that is running other Ceph
468containers. See the ``podman`` or ``docker`` documentation for more
aee94f69
TL
469information about the container workflow.
470
471After the image has been built, it can be used to initiate repeat debugging
472sessions. By using an image in this way, you avoid the trouble of having to
f38dd50b
TL
473re-install the debug tools and the debuginfo packages every time you need to
474run a debug session. To debug a core file using this image, in the same way as
aee94f69
TL
475previously described, run:
476
477.. prompt:: bash #
478
479 cephadm --image ceph:debugging shell --mount /var/lib/system/coredump
480
481
482Debugging live processes
483~~~~~~~~~~~~~~~~~~~~~~~~
484
485The gdb debugger can attach to running processes to debug them. This can be
486achieved with a containerized process by using the debug image and attaching it
487to the same PID namespace in which the process to be debugged resides.
488
489This requires running a container command with some custom arguments. We can
490generate a script that can debug a process in a running container.
491
492.. prompt:: bash #
493
494 cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
495
496This creates a script that includes the container command that ``cephadm``
497would use to create a shell. Modify the script by removing the ``--init``
498argument and replace it with the argument that joins to the namespace used for
499a running running container. For example, assume we want to debug the Manager
500and have determnined that the Manager is running in a container named
501``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. In this case,
502the argument
503``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
504should be used.
20effc67 505
aee94f69
TL
506We can run our debugging container with ``sh /tmp/debug.sh``. Within the shell,
507we can run commands such as ``ps`` to get the PID of the Manager process. In
508the following example this is ``2``. While running gdb, we can attach to the
509running process:
20effc67 510
aee94f69 511.. prompt:: bash (gdb)
20effc67 512
aee94f69
TL
513 attach 2
514 info threads
515 bt