]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephadm/troubleshooting.rst
import ceph 16.2.7
[ceph.git] / ceph / doc / cephadm / troubleshooting.rst
CommitLineData
9f95a23c
TL
1Troubleshooting
2===============
3
522d829b
TL
4You might need to investigate why a cephadm command failed
5or why a certain service no longer runs properly.
9f95a23c 6
522d829b
TL
7Cephadm deploys daemons as containers. This means that
8troubleshooting those containerized daemons might work
9differently than you expect (and that is certainly true if
10you expect this troubleshooting to work the way that
11troubleshooting does when the daemons involved aren't
12containerized).
13
14Here are some tools and commands to help you troubleshoot
15your Ceph environment.
9f95a23c 16
f67539c2
TL
17.. _cephadm-pause:
18
801d1391
TL
19Pausing or disabling cephadm
20----------------------------
21
522d829b
TL
22If something goes wrong and cephadm is behaving badly, you can
23pause most of the Ceph cluster's background activity by running
24the following command:
25
26.. prompt:: bash #
801d1391
TL
27
28 ceph orch pause
29
522d829b
TL
30This stops all changes in the Ceph cluster, but cephadm will
31still periodically check hosts to refresh its inventory of
32daemons and devices. You can disable cephadm completely by
33running the following commands:
34
35.. prompt:: bash #
801d1391
TL
36
37 ceph orch set backend ''
38 ceph mgr module disable cephadm
39
522d829b
TL
40These commands disable all of the ``ceph orch ...`` CLI commands.
41All previously deployed daemon containers continue to exist and
42will start as they did before you ran these commands.
801d1391 43
522d829b
TL
44See :ref:`cephadm-spec-unmanaged` for information on disabling
45individual services.
f67539c2
TL
46
47
48Per-service and per-daemon events
49---------------------------------
50
522d829b
TL
51In order to help with the process of debugging failed daemon
52deployments, cephadm stores events per service and per daemon.
53These events often contain information relevant to
54troubleshooting
55your Ceph cluster.
56
57Listing service events
58~~~~~~~~~~~~~~~~~~~~~~
59
60To see the events associated with a certain service, run a
61command of the and following form:
62
63.. prompt:: bash #
f67539c2
TL
64
65 ceph orch ls --service_name=<service-name> --format yaml
66
522d829b 67This will return something in the following form:
f67539c2
TL
68
69.. code-block:: yaml
70
71 service_type: alertmanager
72 service_name: alertmanager
73 placement:
74 hosts:
75 - unknown_host
76 status:
77 ...
78 running: 1
79 size: 1
80 events:
81 - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
82 - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
83 place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
84
522d829b
TL
85Listing daemon events
86~~~~~~~~~~~~~~~~~~~~~
87
88To see the events associated with a certain daemon, run a
89command of the and following form:
90
91.. prompt:: bash #
f67539c2
TL
92
93 ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
94
522d829b
TL
95This will return something in the following form:
96
f67539c2
TL
97.. code-block:: yaml
98
99 daemon_type: mds
100 daemon_id: cephfs.hostname.ppdhsz
101 hostname: hostname
102 status_desc: running
103 ...
104 events:
105 - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
106 mds.cephfs.hostname.ppdhsz on host 'hostname'"
107
108
801d1391
TL
109Checking cephadm logs
110---------------------
111
522d829b 112To learn how to monitor the cephadm logs as they are generated, read :ref:`watching_cephadm_logs`.
801d1391 113
522d829b
TL
114If your Ceph cluster has been configured to log events to files, there will exist a
115cephadm log file called ``ceph.cephadm.log`` on all monitor hosts (see
116:ref:`cephadm-logs` for a more complete explanation of this).
801d1391 117
9f95a23c
TL
118Gathering log files
119-------------------
120
121Use journalctl to gather the log files of all daemons:
122
123.. note:: By default cephadm now stores logs in journald. This means
124 that you will no longer find daemon logs in ``/var/log/ceph/``.
125
126To read the log file of one specific daemon, run::
127
128 cephadm logs --name <name-of-daemon>
129
130Note: this only works when run on the same host where the daemon is running. To
131get logs of a daemon running on a different host, give the ``--fsid`` option::
132
133 cephadm logs --fsid <fsid> --name <name-of-daemon>
134
135where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
136
137To fetch all log files of all daemons on a given host, run::
138
139 for name in $(cephadm ls | jq -r '.[].name') ; do
140 cephadm logs --fsid <fsid> --name "$name" > $name;
141 done
142
143Collecting systemd status
144-------------------------
145
146To print the state of a systemd unit, run::
147
148 systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
149
150
151To fetch all state of all daemons of a given host, run::
152
153 fsid="$(cephadm shell ceph fsid)"
154 for name in $(cephadm ls | jq -r '.[].name') ; do
155 systemctl status "ceph-$fsid@$name.service" > $name;
156 done
157
158
159List all downloaded container images
160------------------------------------
161
162To list all container images that are downloaded on a host:
163
164.. note:: ``Image`` might also be called `ImageID`
165
166::
167
168 podman ps -a --format json | jq '.[].Image'
169 "docker.io/library/centos:8"
170 "registry.opensuse.org/opensuse/leap:15.2"
171
172
173Manually running containers
174---------------------------
175
176Cephadm writes small wrappers that run a containers. Refer to
177``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
178container execution command.
1911f103 179
f6b5b4d7 180.. _cephadm-ssh-errors:
1911f103
TL
181
182ssh errors
183----------
184
185Error message::
186
f91f0fd5
TL
187 execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
188 ...
189 raise OrchestratorError(msg) from e
190 orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
191 Please make sure that the host is reachable and accepts connections using the cephadm SSH key
192 ...
1911f103
TL
193
194Things users can do:
195
1961. Ensure cephadm has an SSH identity key::
f91f0fd5
TL
197
198 [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
1911f103
TL
199 INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
200 INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
f91f0fd5 201 [root@mon1 ~] # chmod 0600 ~/cephadm_private_key
1911f103
TL
202
203 If this fails, cephadm doesn't have a key. Fix this by running the following command::
f91f0fd5 204
1911f103
TL
205 [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
206
207 or::
f91f0fd5
TL
208
209 [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
1911f103
TL
210
2112. Ensure that the ssh config is correct::
f91f0fd5 212
1911f103
TL
213 [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
214
2153. Verify that we can connect to the host::
1911f103 216
f91f0fd5 217 [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
1911f103
TL
218
219Verifying that the Public Key is Listed in the authorized_keys file
522d829b
TL
220~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
221
1911f103
TL
222To verify that the public key is in the authorized_keys file, run the following commands::
223
f91f0fd5
TL
224 [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
225 [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
e306af50
TL
226
227Failed to infer CIDR network error
228----------------------------------
229
230If you see this error::
231
232 ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
233
234Or this error::
235
236 Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
237
238This means that you must run a command of this form::
239
240 ceph config set mon public_network <mon_network>
241
242For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
243
244Accessing the admin socket
245--------------------------
246
247Each Ceph daemon provides an admin socket that bypasses the
248MONs (See :ref:`rados-monitoring-using-admin-socket`).
249
250To access the admin socket, first enter the daemon container on the host::
251
252 [root@mon1 ~]# cephadm enter --name <daemon-name>
253 [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
f67539c2 254
a4b75251
TL
255Calling miscellaneous ceph tools
256--------------------------------
257
258To call miscellaneous like ``ceph-objectstore-tool`` or
259``ceph-monstore-tool``, you can run them by calling
260``cephadm shell --name <daemon-name>`` like so::
261
262 root@myhostname # cephadm unit --name mon.myhostname stop
263 root@myhostname # cephadm shell --name mon.myhostname
264 [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap
265 [ceph: root@myhostname /]# monmaptool --print monmap
266 monmaptool: monmap file monmap
267 epoch 1
268 fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
269 last_changed 2021-11-01T20:57:19.755111+0000
270 created 2021-11-01T20:57:19.755111+0000
271 min_mon_release 17 (quincy)
272 election_strategy: 1
273 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
274
275This command sets up the environment in a way that is suitable
276for extended daemon maintenance and running the deamon interactively.
277
278.. _cephadm-restore-quorum:
f67539c2
TL
279
280Restoring the MON quorum
281------------------------
282
283In case the Ceph MONs cannot form a quorum, cephadm is not able
284to manage the cluster, until the quorum is restored.
285
286In order to restore the MON quorum, remove unhealthy MONs
287form the monmap by following these steps:
288
2891. Stop all MONs. For each MON host::
290
291 ssh {mon-host}
292 cephadm unit --name mon.`hostname` stop
293
294
2952. Identify a surviving monitor and log in to that host::
296
297 ssh {mon-host}
298 cephadm enter --name mon.`hostname`
299
3003. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
301
a4b75251 302.. _cephadm-manually-deploy-mgr:
f67539c2
TL
303
304Manually deploying a MGR daemon
305-------------------------------
306cephadm requires a MGR daemon in order to manage the cluster. In case the cluster
307the last MGR of a cluster was removed, follow these steps in order to deploy
308a MGR ``mgr.hostname.smfvfd`` on a random host of your cluster manually.
309
310Disable the cephadm scheduler, in order to prevent cephadm from removing the new
311MGR. See :ref:`cephadm-enable-cli`::
312
313 ceph config-key set mgr/cephadm/pause true
314
315Then get or create the auth entry for the new MGR::
316
317 ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
318
319Get the ceph.conf::
320
321 ceph config generate-minimal-conf
322
323Get the container image::
324
325 ceph config get "mgr.hostname.smfvfd" container_image
326
327Create a file ``config-json.json`` which contains the information neccessary to deploy
328the daemon:
329
330.. code-block:: json
331
332 {
333 "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
334 "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
335 }
336
337Deploy the daemon::
338
339 cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
340