]> git.proxmox.com Git - ceph.git/blob - ceph/doc/cephadm/troubleshooting.rst
b7d295b1296c3d9167e92a397353e7e625e88cd4
[ceph.git] / ceph / doc / cephadm / troubleshooting.rst
1 Troubleshooting
2 ===============
3
4 You might need to investigate why a cephadm command failed
5 or why a certain service no longer runs properly.
6
7 Cephadm deploys daemons as containers. This means that
8 troubleshooting those containerized daemons might work
9 differently than you expect (and that is certainly true if
10 you expect this troubleshooting to work the way that
11 troubleshooting does when the daemons involved aren't
12 containerized).
13
14 Here are some tools and commands to help you troubleshoot
15 your Ceph environment.
16
17 .. _cephadm-pause:
18
19 Pausing or disabling cephadm
20 ----------------------------
21
22 If something goes wrong and cephadm is behaving badly, you can
23 pause most of the Ceph cluster's background activity by running
24 the following command:
25
26 .. prompt:: bash #
27
28 ceph orch pause
29
30 This stops all changes in the Ceph cluster, but cephadm will
31 still periodically check hosts to refresh its inventory of
32 daemons and devices. You can disable cephadm completely by
33 running the following commands:
34
35 .. prompt:: bash #
36
37 ceph orch set backend ''
38 ceph mgr module disable cephadm
39
40 These commands disable all of the ``ceph orch ...`` CLI commands.
41 All previously deployed daemon containers continue to exist and
42 will start as they did before you ran these commands.
43
44 See :ref:`cephadm-spec-unmanaged` for information on disabling
45 individual services.
46
47
48 Per-service and per-daemon events
49 ---------------------------------
50
51 In order to help with the process of debugging failed daemon
52 deployments, cephadm stores events per service and per daemon.
53 These events often contain information relevant to
54 troubleshooting
55 your Ceph cluster.
56
57 Listing service events
58 ~~~~~~~~~~~~~~~~~~~~~~
59
60 To see the events associated with a certain service, run a
61 command of the and following form:
62
63 .. prompt:: bash #
64
65 ceph orch ls --service_name=<service-name> --format yaml
66
67 This will return something in the following form:
68
69 .. code-block:: yaml
70
71 service_type: alertmanager
72 service_name: alertmanager
73 placement:
74 hosts:
75 - unknown_host
76 status:
77 ...
78 running: 1
79 size: 1
80 events:
81 - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
82 - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
83 place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
84
85 Listing daemon events
86 ~~~~~~~~~~~~~~~~~~~~~
87
88 To see the events associated with a certain daemon, run a
89 command of the and following form:
90
91 .. prompt:: bash #
92
93 ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
94
95 This will return something in the following form:
96
97 .. code-block:: yaml
98
99 daemon_type: mds
100 daemon_id: cephfs.hostname.ppdhsz
101 hostname: hostname
102 status_desc: running
103 ...
104 events:
105 - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
106 mds.cephfs.hostname.ppdhsz on host 'hostname'"
107
108
109 Checking cephadm logs
110 ---------------------
111
112 To learn how to monitor the cephadm logs as they are generated, read :ref:`watching_cephadm_logs`.
113
114 If your Ceph cluster has been configured to log events to files, there will exist a
115 cephadm log file called ``ceph.cephadm.log`` on all monitor hosts (see
116 :ref:`cephadm-logs` for a more complete explanation of this).
117
118 Gathering log files
119 -------------------
120
121 Use journalctl to gather the log files of all daemons:
122
123 .. note:: By default cephadm now stores logs in journald. This means
124 that you will no longer find daemon logs in ``/var/log/ceph/``.
125
126 To read the log file of one specific daemon, run::
127
128 cephadm logs --name <name-of-daemon>
129
130 Note: this only works when run on the same host where the daemon is running. To
131 get logs of a daemon running on a different host, give the ``--fsid`` option::
132
133 cephadm logs --fsid <fsid> --name <name-of-daemon>
134
135 where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
136
137 To fetch all log files of all daemons on a given host, run::
138
139 for name in $(cephadm ls | jq -r '.[].name') ; do
140 cephadm logs --fsid <fsid> --name "$name" > $name;
141 done
142
143 Collecting systemd status
144 -------------------------
145
146 To print the state of a systemd unit, run::
147
148 systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
149
150
151 To fetch all state of all daemons of a given host, run::
152
153 fsid="$(cephadm shell ceph fsid)"
154 for name in $(cephadm ls | jq -r '.[].name') ; do
155 systemctl status "ceph-$fsid@$name.service" > $name;
156 done
157
158
159 List all downloaded container images
160 ------------------------------------
161
162 To list all container images that are downloaded on a host:
163
164 .. note:: ``Image`` might also be called `ImageID`
165
166 ::
167
168 podman ps -a --format json | jq '.[].Image'
169 "docker.io/library/centos:8"
170 "registry.opensuse.org/opensuse/leap:15.2"
171
172
173 Manually running containers
174 ---------------------------
175
176 Cephadm writes small wrappers that run a containers. Refer to
177 ``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
178 container execution command.
179
180 .. _cephadm-ssh-errors:
181
182 ssh errors
183 ----------
184
185 Error message::
186
187 execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
188 ...
189 raise OrchestratorError(msg) from e
190 orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
191 Please make sure that the host is reachable and accepts connections using the cephadm SSH key
192 ...
193
194 Things users can do:
195
196 1. Ensure cephadm has an SSH identity key::
197
198 [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
199 INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
200 INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
201 [root@mon1 ~] # chmod 0600 ~/cephadm_private_key
202
203 If this fails, cephadm doesn't have a key. Fix this by running the following command::
204
205 [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
206
207 or::
208
209 [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
210
211 2. Ensure that the ssh config is correct::
212
213 [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
214
215 3. Verify that we can connect to the host::
216
217 [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
218
219 Verifying that the Public Key is Listed in the authorized_keys file
220 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
221
222 To verify that the public key is in the authorized_keys file, run the following commands::
223
224 [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
225 [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
226
227 Failed to infer CIDR network error
228 ----------------------------------
229
230 If you see this error::
231
232 ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
233
234 Or this error::
235
236 Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
237
238 This means that you must run a command of this form::
239
240 ceph config set mon public_network <mon_network>
241
242 For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
243
244 Accessing the admin socket
245 --------------------------
246
247 Each Ceph daemon provides an admin socket that bypasses the
248 MONs (See :ref:`rados-monitoring-using-admin-socket`).
249
250 To access the admin socket, first enter the daemon container on the host::
251
252 [root@mon1 ~]# cephadm enter --name <daemon-name>
253 [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
254
255 Calling miscellaneous ceph tools
256 --------------------------------
257
258 To call miscellaneous like ``ceph-objectstore-tool`` or
259 ``ceph-monstore-tool``, you can run them by calling
260 ``cephadm shell --name <daemon-name>`` like so::
261
262 root@myhostname # cephadm unit --name mon.myhostname stop
263 root@myhostname # cephadm shell --name mon.myhostname
264 [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap
265 [ceph: root@myhostname /]# monmaptool --print monmap
266 monmaptool: monmap file monmap
267 epoch 1
268 fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
269 last_changed 2021-11-01T20:57:19.755111+0000
270 created 2021-11-01T20:57:19.755111+0000
271 min_mon_release 17 (quincy)
272 election_strategy: 1
273 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
274
275 This command sets up the environment in a way that is suitable
276 for extended daemon maintenance and running the daemon interactively.
277
278 .. _cephadm-restore-quorum:
279
280 Restoring the MON quorum
281 ------------------------
282
283 In case the Ceph MONs cannot form a quorum, cephadm is not able
284 to manage the cluster, until the quorum is restored.
285
286 In order to restore the MON quorum, remove unhealthy MONs
287 form the monmap by following these steps:
288
289 1. Stop all MONs. For each MON host::
290
291 ssh {mon-host}
292 cephadm unit --name mon.`hostname` stop
293
294
295 2. Identify a surviving monitor and log in to that host::
296
297 ssh {mon-host}
298 cephadm enter --name mon.`hostname`
299
300 3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
301
302 .. _cephadm-manually-deploy-mgr:
303
304 Manually deploying a MGR daemon
305 -------------------------------
306 cephadm requires a MGR daemon in order to manage the cluster. In case the cluster
307 the last MGR of a cluster was removed, follow these steps in order to deploy
308 a MGR ``mgr.hostname.smfvfd`` on a random host of your cluster manually.
309
310 Disable the cephadm scheduler, in order to prevent cephadm from removing the new
311 MGR. See :ref:`cephadm-enable-cli`::
312
313 ceph config-key set mgr/cephadm/pause true
314
315 Then get or create the auth entry for the new MGR::
316
317 ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
318
319 Get the ceph.conf::
320
321 ceph config generate-minimal-conf
322
323 Get the container image::
324
325 ceph config get "mgr.hostname.smfvfd" container_image
326
327 Create a file ``config-json.json`` which contains the information necessary to deploy
328 the daemon:
329
330 .. code-block:: json
331
332 {
333 "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
334 "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
335 }
336
337 Deploy the daemon::
338
339 cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
340
341 Analyzing core dumps
342 ---------------------
343
344 In case a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
345
346 .. prompt:: bash #
347
348 ulimit -c unlimited
349
350 core dumps will now be written to ``/var/lib/systemd/coredump``.
351
352 .. note::
353
354 core dumps are not namespaced by the kernel, which means
355 they will be written to ``/var/lib/systemd/coredump`` on
356 the container host.
357
358 Now, wait for the crash to happen again. (To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``)
359
360 Install debug packages by entering the cephadm shell and install ``ceph-debuginfo``::
361
362 # cephadm shell --mount /var/lib/systemd/coredump
363 [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
364 [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
365 [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
366 (gdb) bt
367 #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
368 #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
369 #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
370 #3 0x0000563085ca3d7e in main ()