4 You might need to investigate why a cephadm command failed
5 or why a certain service no longer runs properly.
7 Cephadm deploys daemons as containers. This means that
8 troubleshooting those containerized daemons might work
9 differently than you expect (and that is certainly true if
10 you expect this troubleshooting to work the way that
11 troubleshooting does when the daemons involved aren't
14 Here are some tools and commands to help you troubleshoot
15 your Ceph environment.
19 Pausing or disabling cephadm
20 ----------------------------
22 If something goes wrong and cephadm is behaving badly, you can
23 pause most of the Ceph cluster's background activity by running
24 the following command:
30 This stops all changes in the Ceph cluster, but cephadm will
31 still periodically check hosts to refresh its inventory of
32 daemons and devices. You can disable cephadm completely by
33 running the following commands:
37 ceph orch set backend ''
38 ceph mgr module disable cephadm
40 These commands disable all of the ``ceph orch ...`` CLI commands.
41 All previously deployed daemon containers continue to exist and
42 will start as they did before you ran these commands.
44 See :ref:`cephadm-spec-unmanaged` for information on disabling
48 Per-service and per-daemon events
49 ---------------------------------
51 In order to help with the process of debugging failed daemon
52 deployments, cephadm stores events per service and per daemon.
53 These events often contain information relevant to
57 Listing service events
58 ~~~~~~~~~~~~~~~~~~~~~~
60 To see the events associated with a certain service, run a
61 command of the and following form:
65 ceph orch ls --service_name=<service-name> --format yaml
67 This will return something in the following form:
71 service_type: alertmanager
72 service_name: alertmanager
81 - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
82 - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
83 place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
88 To see the events associated with a certain daemon, run a
89 command of the and following form:
93 ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
95 This will return something in the following form:
100 daemon_id: cephfs.hostname.ppdhsz
105 - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
106 mds.cephfs.hostname.ppdhsz on host 'hostname'"
109 Checking cephadm logs
110 ---------------------
112 To learn how to monitor the cephadm logs as they are generated, read :ref:`watching_cephadm_logs`.
114 If your Ceph cluster has been configured to log events to files, there will exist a
115 cephadm log file called ``ceph.cephadm.log`` on all monitor hosts (see
116 :ref:`cephadm-logs` for a more complete explanation of this).
121 Use journalctl to gather the log files of all daemons:
123 .. note:: By default cephadm now stores logs in journald. This means
124 that you will no longer find daemon logs in ``/var/log/ceph/``.
126 To read the log file of one specific daemon, run::
128 cephadm logs --name <name-of-daemon>
130 Note: this only works when run on the same host where the daemon is running. To
131 get logs of a daemon running on a different host, give the ``--fsid`` option::
133 cephadm logs --fsid <fsid> --name <name-of-daemon>
135 where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
137 To fetch all log files of all daemons on a given host, run::
139 for name in $(cephadm ls | jq -r '.[].name') ; do
140 cephadm logs --fsid <fsid> --name "$name" > $name;
143 Collecting systemd status
144 -------------------------
146 To print the state of a systemd unit, run::
148 systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
151 To fetch all state of all daemons of a given host, run::
153 fsid="$(cephadm shell ceph fsid)"
154 for name in $(cephadm ls | jq -r '.[].name') ; do
155 systemctl status "ceph-$fsid@$name.service" > $name;
159 List all downloaded container images
160 ------------------------------------
162 To list all container images that are downloaded on a host:
164 .. note:: ``Image`` might also be called `ImageID`
168 podman ps -a --format json | jq '.[].Image'
169 "docker.io/library/centos:8"
170 "registry.opensuse.org/opensuse/leap:15.2"
173 Manually running containers
174 ---------------------------
176 Cephadm writes small wrappers that run a containers. Refer to
177 ``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
178 container execution command.
180 .. _cephadm-ssh-errors:
187 execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
189 raise OrchestratorError(msg) from e
190 orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
191 Please make sure that the host is reachable and accepts connections using the cephadm SSH key
196 1. Ensure cephadm has an SSH identity key::
198 [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
199 INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
200 INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
201 [root@mon1 ~] # chmod 0600 ~/cephadm_private_key
203 If this fails, cephadm doesn't have a key. Fix this by running the following command::
205 [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
209 [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
211 2. Ensure that the ssh config is correct::
213 [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
215 3. Verify that we can connect to the host::
217 [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
219 Verifying that the Public Key is Listed in the authorized_keys file
220 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
222 To verify that the public key is in the authorized_keys file, run the following commands::
224 [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
225 [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
227 Failed to infer CIDR network error
228 ----------------------------------
230 If you see this error::
232 ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
236 Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
238 This means that you must run a command of this form::
240 ceph config set mon public_network <mon_network>
242 For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
244 Accessing the admin socket
245 --------------------------
247 Each Ceph daemon provides an admin socket that bypasses the
248 MONs (See :ref:`rados-monitoring-using-admin-socket`).
250 To access the admin socket, first enter the daemon container on the host::
252 [root@mon1 ~]# cephadm enter --name <daemon-name>
253 [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
255 Calling miscellaneous ceph tools
256 --------------------------------
258 To call miscellaneous like ``ceph-objectstore-tool`` or
259 ``ceph-monstore-tool``, you can run them by calling
260 ``cephadm shell --name <daemon-name>`` like so::
262 root@myhostname # cephadm unit --name mon.myhostname stop
263 root@myhostname # cephadm shell --name mon.myhostname
264 [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap
265 [ceph: root@myhostname /]# monmaptool --print monmap
266 monmaptool: monmap file monmap
268 fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
269 last_changed 2021-11-01T20:57:19.755111+0000
270 created 2021-11-01T20:57:19.755111+0000
271 min_mon_release 17 (quincy)
273 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
275 This command sets up the environment in a way that is suitable
276 for extended daemon maintenance and running the daemon interactively.
278 .. _cephadm-restore-quorum:
280 Restoring the MON quorum
281 ------------------------
283 In case the Ceph MONs cannot form a quorum, cephadm is not able
284 to manage the cluster, until the quorum is restored.
286 In order to restore the MON quorum, remove unhealthy MONs
287 form the monmap by following these steps:
289 1. Stop all MONs. For each MON host::
292 cephadm unit --name mon.`hostname` stop
295 2. Identify a surviving monitor and log in to that host::
298 cephadm enter --name mon.`hostname`
300 3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
302 .. _cephadm-manually-deploy-mgr:
304 Manually deploying a MGR daemon
305 -------------------------------
306 cephadm requires a MGR daemon in order to manage the cluster. In case the cluster
307 the last MGR of a cluster was removed, follow these steps in order to deploy
308 a MGR ``mgr.hostname.smfvfd`` on a random host of your cluster manually.
310 Disable the cephadm scheduler, in order to prevent cephadm from removing the new
311 MGR. See :ref:`cephadm-enable-cli`::
313 ceph config-key set mgr/cephadm/pause true
315 Then get or create the auth entry for the new MGR::
317 ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
321 ceph config generate-minimal-conf
323 Get the container image::
325 ceph config get "mgr.hostname.smfvfd" container_image
327 Create a file ``config-json.json`` which contains the information necessary to deploy
333 "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
334 "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
339 cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
342 ---------------------
344 In case a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
350 core dumps will now be written to ``/var/lib/systemd/coredump``.
354 core dumps are not namespaced by the kernel, which means
355 they will be written to ``/var/lib/systemd/coredump`` on
358 Now, wait for the crash to happen again. (To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``)
360 Install debug packages by entering the cephadm shell and install ``ceph-debuginfo``::
362 # cephadm shell --mount /var/lib/systemd/coredump
363 [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
364 [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
365 [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
367 #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
368 #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
369 #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
370 #3 0x0000563085ca3d7e in main ()