]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephadm/troubleshooting.rst
update ceph source to reef 18.1.2
[ceph.git] / ceph / doc / cephadm / troubleshooting.rst
CommitLineData
9f95a23c
TL
1Troubleshooting
2===============
3
39ae355f 4You may wish to investigate why a cephadm command failed
522d829b 5or why a certain service no longer runs properly.
9f95a23c 6
39ae355f
TL
7Cephadm deploys daemons within containers. This means that
8troubleshooting those containerized daemons will require
9a different process than traditional package-install daemons.
522d829b
TL
10
11Here are some tools and commands to help you troubleshoot
12your Ceph environment.
9f95a23c 13
f67539c2
TL
14.. _cephadm-pause:
15
39ae355f 16Pausing or Disabling cephadm
801d1391
TL
17----------------------------
18
522d829b
TL
19If something goes wrong and cephadm is behaving badly, you can
20pause most of the Ceph cluster's background activity by running
21the following command:
22
23.. prompt:: bash #
801d1391
TL
24
25 ceph orch pause
26
522d829b
TL
27This stops all changes in the Ceph cluster, but cephadm will
28still periodically check hosts to refresh its inventory of
29daemons and devices. You can disable cephadm completely by
30running the following commands:
31
32.. prompt:: bash #
801d1391
TL
33
34 ceph orch set backend ''
35 ceph mgr module disable cephadm
36
522d829b
TL
37These commands disable all of the ``ceph orch ...`` CLI commands.
38All previously deployed daemon containers continue to exist and
39will start as they did before you ran these commands.
801d1391 40
522d829b
TL
41See :ref:`cephadm-spec-unmanaged` for information on disabling
42individual services.
f67539c2
TL
43
44
39ae355f 45Per-service and Per-daemon Events
f67539c2
TL
46---------------------------------
47
39ae355f
TL
48In order to facilitate debugging failed daemons,
49cephadm stores events per service and per daemon.
522d829b 50These events often contain information relevant to
39ae355f 51troubleshooting your Ceph cluster.
522d829b 52
39ae355f 53Listing Service Events
522d829b
TL
54~~~~~~~~~~~~~~~~~~~~~~
55
56To see the events associated with a certain service, run a
57command of the and following form:
58
59.. prompt:: bash #
f67539c2
TL
60
61 ceph orch ls --service_name=<service-name> --format yaml
62
522d829b 63This will return something in the following form:
f67539c2
TL
64
65.. code-block:: yaml
66
67 service_type: alertmanager
68 service_name: alertmanager
69 placement:
70 hosts:
71 - unknown_host
72 status:
73 ...
74 running: 1
75 size: 1
76 events:
77 - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
78 - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
79 place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
80
39ae355f 81Listing Daemon Events
522d829b
TL
82~~~~~~~~~~~~~~~~~~~~~
83
84To see the events associated with a certain daemon, run a
85command of the and following form:
86
87.. prompt:: bash #
f67539c2
TL
88
89 ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
90
522d829b
TL
91This will return something in the following form:
92
f67539c2
TL
93.. code-block:: yaml
94
95 daemon_type: mds
96 daemon_id: cephfs.hostname.ppdhsz
97 hostname: hostname
98 status_desc: running
99 ...
100 events:
101 - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
102 mds.cephfs.hostname.ppdhsz on host 'hostname'"
103
104
39ae355f 105Checking Cephadm Logs
801d1391
TL
106---------------------
107
39ae355f 108To learn how to monitor cephadm logs as they are generated, read :ref:`watching_cephadm_logs`.
801d1391 109
39ae355f
TL
110If your Ceph cluster has been configured to log events to files, there will be a
111``ceph.cephadm.log`` file on all monitor hosts (see
112:ref:`cephadm-logs` for a more complete explanation).
801d1391 113
39ae355f 114Gathering Log Files
9f95a23c
TL
115-------------------
116
117Use journalctl to gather the log files of all daemons:
118
119.. note:: By default cephadm now stores logs in journald. This means
120 that you will no longer find daemon logs in ``/var/log/ceph/``.
121
122To read the log file of one specific daemon, run::
123
124 cephadm logs --name <name-of-daemon>
125
126Note: this only works when run on the same host where the daemon is running. To
127get logs of a daemon running on a different host, give the ``--fsid`` option::
128
129 cephadm logs --fsid <fsid> --name <name-of-daemon>
130
131where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
132
133To fetch all log files of all daemons on a given host, run::
134
135 for name in $(cephadm ls | jq -r '.[].name') ; do
136 cephadm logs --fsid <fsid> --name "$name" > $name;
137 done
138
39ae355f 139Collecting Systemd Status
9f95a23c
TL
140-------------------------
141
142To print the state of a systemd unit, run::
143
144 systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
145
146
147To fetch all state of all daemons of a given host, run::
148
149 fsid="$(cephadm shell ceph fsid)"
150 for name in $(cephadm ls | jq -r '.[].name') ; do
151 systemctl status "ceph-$fsid@$name.service" > $name;
152 done
153
154
39ae355f 155List all Downloaded Container Images
9f95a23c
TL
156------------------------------------
157
158To list all container images that are downloaded on a host:
159
160.. note:: ``Image`` might also be called `ImageID`
161
162::
163
164 podman ps -a --format json | jq '.[].Image'
165 "docker.io/library/centos:8"
166 "registry.opensuse.org/opensuse/leap:15.2"
167
168
39ae355f 169Manually Running Containers
9f95a23c
TL
170---------------------------
171
39ae355f 172Cephadm uses small wrappers when running containers. Refer to
9f95a23c
TL
173``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
174container execution command.
1911f103 175
f6b5b4d7 176.. _cephadm-ssh-errors:
1911f103 177
39ae355f 178SSH Errors
1911f103
TL
179----------
180
181Error message::
182
f91f0fd5
TL
183 execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
184 ...
185 raise OrchestratorError(msg) from e
186 orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
187 Please make sure that the host is reachable and accepts connections using the cephadm SSH key
188 ...
1911f103 189
39ae355f 190Things Ceph administrators can do:
1911f103
TL
191
1921. Ensure cephadm has an SSH identity key::
f91f0fd5
TL
193
194 [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
1911f103
TL
195 INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
196 INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
f91f0fd5 197 [root@mon1 ~] # chmod 0600 ~/cephadm_private_key
1911f103
TL
198
199 If this fails, cephadm doesn't have a key. Fix this by running the following command::
f91f0fd5 200
1911f103
TL
201 [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
202
203 or::
f91f0fd5
TL
204
205 [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
1911f103 206
39ae355f 2072. Ensure that the SSH config is correct::
f91f0fd5 208
1911f103
TL
209 [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
210
2113. Verify that we can connect to the host::
1911f103 212
f91f0fd5 213 [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
1911f103
TL
214
215Verifying that the Public Key is Listed in the authorized_keys file
522d829b
TL
216~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
217
1911f103
TL
218To verify that the public key is in the authorized_keys file, run the following commands::
219
f91f0fd5
TL
220 [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
221 [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
e306af50 222
39ae355f 223Failed to Infer CIDR network error
e306af50
TL
224----------------------------------
225
226If you see this error::
227
228 ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
229
230Or this error::
231
232 Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
233
234This means that you must run a command of this form::
235
236 ceph config set mon public_network <mon_network>
237
238For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
239
39ae355f 240Accessing the Admin Socket
e306af50
TL
241--------------------------
242
243Each Ceph daemon provides an admin socket that bypasses the
244MONs (See :ref:`rados-monitoring-using-admin-socket`).
245
246To access the admin socket, first enter the daemon container on the host::
247
248 [root@mon1 ~]# cephadm enter --name <daemon-name>
249 [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
f67539c2 250
39ae355f 251Running Various Ceph Tools
a4b75251
TL
252--------------------------------
253
39ae355f
TL
254To run Ceph tools like ``ceph-objectstore-tool`` or
255``ceph-monstore-tool``, invoke the cephadm CLI with
256``cephadm shell --name <daemon-name>``. For example::
a4b75251
TL
257
258 root@myhostname # cephadm unit --name mon.myhostname stop
259 root@myhostname # cephadm shell --name mon.myhostname
260 [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap
261 [ceph: root@myhostname /]# monmaptool --print monmap
262 monmaptool: monmap file monmap
263 epoch 1
264 fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
265 last_changed 2021-11-01T20:57:19.755111+0000
266 created 2021-11-01T20:57:19.755111+0000
267 min_mon_release 17 (quincy)
268 election_strategy: 1
269 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
270
39ae355f
TL
271The cephadm shell sets up the environment in a way that is suitable
272for extended daemon maintenance and running daemons interactively.
a4b75251
TL
273
274.. _cephadm-restore-quorum:
f67539c2 275
39ae355f
TL
276Restoring the Monitor Quorum
277----------------------------
f67539c2 278
39ae355f
TL
279If the Ceph monitor daemons (mons) cannot form a quorum, cephadm will not be
280able to manage the cluster until quorum is restored.
f67539c2 281
39ae355f 282In order to restore the quorum, remove unhealthy monitors
f67539c2
TL
283form the monmap by following these steps:
284
39ae355f 2851. Stop all mons. For each mon host::
f67539c2
TL
286
287 ssh {mon-host}
288 cephadm unit --name mon.`hostname` stop
289
290
2912. Identify a surviving monitor and log in to that host::
292
293 ssh {mon-host}
294 cephadm enter --name mon.`hostname`
295
2963. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
297
a4b75251 298.. _cephadm-manually-deploy-mgr:
f67539c2 299
39ae355f
TL
300Manually Deploying a Manager Daemon
301-----------------------------------
302At least one manager (mgr) daemon is required by cephadm in order to manage the
303cluster. If the last mgr in a cluster has been removed, follow these steps in
304order to deploy a manager called (for example)
305``mgr.hostname.smfvfd`` on a random host of your cluster manually.
f67539c2
TL
306
307Disable the cephadm scheduler, in order to prevent cephadm from removing the new
39ae355f 308manager. See :ref:`cephadm-enable-cli`::
f67539c2
TL
309
310 ceph config-key set mgr/cephadm/pause true
311
39ae355f 312Then get or create the auth entry for the new manager::
f67539c2
TL
313
314 ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
315
316Get the ceph.conf::
317
318 ceph config generate-minimal-conf
319
320Get the container image::
321
322 ceph config get "mgr.hostname.smfvfd" container_image
323
20effc67 324Create a file ``config-json.json`` which contains the information necessary to deploy
f67539c2
TL
325the daemon:
326
327.. code-block:: json
328
329 {
330 "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
331 "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
332 }
333
334Deploy the daemon::
335
336 cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
337
39ae355f 338Analyzing Core Dumps
20effc67
TL
339---------------------
340
39ae355f 341When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
20effc67
TL
342
343.. prompt:: bash #
344
345 ulimit -c unlimited
346
39ae355f 347Core dumps will now be written to ``/var/lib/systemd/coredump``.
20effc67
TL
348
349.. note::
350
39ae355f 351 Core dumps are not namespaced by the kernel, which means
20effc67
TL
352 they will be written to ``/var/lib/systemd/coredump`` on
353 the container host.
354
39ae355f 355Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``.
20effc67 356
39ae355f 357Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll::
20effc67
TL
358
359 # cephadm shell --mount /var/lib/systemd/coredump
360 [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
361 [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
362 [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
363 (gdb) bt
364 #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
365 #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
366 #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
367 #3 0x0000563085ca3d7e in main ()