]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | Troubleshooting |
2 | =============== | |
3 | ||
39ae355f | 4 | You may wish to investigate why a cephadm command failed |
522d829b | 5 | or why a certain service no longer runs properly. |
9f95a23c | 6 | |
39ae355f TL |
7 | Cephadm deploys daemons within containers. This means that |
8 | troubleshooting those containerized daemons will require | |
9 | a different process than traditional package-install daemons. | |
522d829b TL |
10 | |
11 | Here are some tools and commands to help you troubleshoot | |
12 | your Ceph environment. | |
9f95a23c | 13 | |
f67539c2 TL |
14 | .. _cephadm-pause: |
15 | ||
39ae355f | 16 | Pausing or Disabling cephadm |
801d1391 TL |
17 | ---------------------------- |
18 | ||
522d829b TL |
19 | If something goes wrong and cephadm is behaving badly, you can |
20 | pause most of the Ceph cluster's background activity by running | |
21 | the following command: | |
22 | ||
23 | .. prompt:: bash # | |
801d1391 TL |
24 | |
25 | ceph orch pause | |
26 | ||
522d829b TL |
27 | This stops all changes in the Ceph cluster, but cephadm will |
28 | still periodically check hosts to refresh its inventory of | |
29 | daemons and devices. You can disable cephadm completely by | |
30 | running the following commands: | |
31 | ||
32 | .. prompt:: bash # | |
801d1391 TL |
33 | |
34 | ceph orch set backend '' | |
35 | ceph mgr module disable cephadm | |
36 | ||
522d829b TL |
37 | These commands disable all of the ``ceph orch ...`` CLI commands. |
38 | All previously deployed daemon containers continue to exist and | |
39 | will start as they did before you ran these commands. | |
801d1391 | 40 | |
522d829b TL |
41 | See :ref:`cephadm-spec-unmanaged` for information on disabling |
42 | individual services. | |
f67539c2 TL |
43 | |
44 | ||
39ae355f | 45 | Per-service and Per-daemon Events |
f67539c2 TL |
46 | --------------------------------- |
47 | ||
39ae355f TL |
48 | In order to facilitate debugging failed daemons, |
49 | cephadm stores events per service and per daemon. | |
522d829b | 50 | These events often contain information relevant to |
39ae355f | 51 | troubleshooting your Ceph cluster. |
522d829b | 52 | |
39ae355f | 53 | Listing Service Events |
522d829b TL |
54 | ~~~~~~~~~~~~~~~~~~~~~~ |
55 | ||
56 | To see the events associated with a certain service, run a | |
57 | command of the and following form: | |
58 | ||
59 | .. prompt:: bash # | |
f67539c2 TL |
60 | |
61 | ceph orch ls --service_name=<service-name> --format yaml | |
62 | ||
522d829b | 63 | This will return something in the following form: |
f67539c2 TL |
64 | |
65 | .. code-block:: yaml | |
66 | ||
67 | service_type: alertmanager | |
68 | service_name: alertmanager | |
69 | placement: | |
70 | hosts: | |
71 | - unknown_host | |
72 | status: | |
73 | ... | |
74 | running: 1 | |
75 | size: 1 | |
76 | events: | |
77 | - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created" | |
78 | - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot | |
79 | place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"' | |
80 | ||
39ae355f | 81 | Listing Daemon Events |
522d829b TL |
82 | ~~~~~~~~~~~~~~~~~~~~~ |
83 | ||
84 | To see the events associated with a certain daemon, run a | |
85 | command of the and following form: | |
86 | ||
87 | .. prompt:: bash # | |
f67539c2 TL |
88 | |
89 | ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml | |
90 | ||
522d829b TL |
91 | This will return something in the following form: |
92 | ||
f67539c2 TL |
93 | .. code-block:: yaml |
94 | ||
95 | daemon_type: mds | |
96 | daemon_id: cephfs.hostname.ppdhsz | |
97 | hostname: hostname | |
98 | status_desc: running | |
99 | ... | |
100 | events: | |
101 | - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured | |
102 | mds.cephfs.hostname.ppdhsz on host 'hostname'" | |
103 | ||
104 | ||
39ae355f | 105 | Checking Cephadm Logs |
801d1391 TL |
106 | --------------------- |
107 | ||
39ae355f | 108 | To learn how to monitor cephadm logs as they are generated, read :ref:`watching_cephadm_logs`. |
801d1391 | 109 | |
39ae355f TL |
110 | If your Ceph cluster has been configured to log events to files, there will be a |
111 | ``ceph.cephadm.log`` file on all monitor hosts (see | |
112 | :ref:`cephadm-logs` for a more complete explanation). | |
801d1391 | 113 | |
39ae355f | 114 | Gathering Log Files |
9f95a23c TL |
115 | ------------------- |
116 | ||
117 | Use journalctl to gather the log files of all daemons: | |
118 | ||
119 | .. note:: By default cephadm now stores logs in journald. This means | |
120 | that you will no longer find daemon logs in ``/var/log/ceph/``. | |
121 | ||
122 | To read the log file of one specific daemon, run:: | |
123 | ||
124 | cephadm logs --name <name-of-daemon> | |
125 | ||
126 | Note: this only works when run on the same host where the daemon is running. To | |
127 | get logs of a daemon running on a different host, give the ``--fsid`` option:: | |
128 | ||
129 | cephadm logs --fsid <fsid> --name <name-of-daemon> | |
130 | ||
131 | where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``. | |
132 | ||
133 | To fetch all log files of all daemons on a given host, run:: | |
134 | ||
135 | for name in $(cephadm ls | jq -r '.[].name') ; do | |
136 | cephadm logs --fsid <fsid> --name "$name" > $name; | |
137 | done | |
138 | ||
39ae355f | 139 | Collecting Systemd Status |
9f95a23c TL |
140 | ------------------------- |
141 | ||
142 | To print the state of a systemd unit, run:: | |
143 | ||
144 | systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service"; | |
145 | ||
146 | ||
147 | To fetch all state of all daemons of a given host, run:: | |
148 | ||
149 | fsid="$(cephadm shell ceph fsid)" | |
150 | for name in $(cephadm ls | jq -r '.[].name') ; do | |
151 | systemctl status "ceph-$fsid@$name.service" > $name; | |
152 | done | |
153 | ||
154 | ||
39ae355f | 155 | List all Downloaded Container Images |
9f95a23c TL |
156 | ------------------------------------ |
157 | ||
158 | To list all container images that are downloaded on a host: | |
159 | ||
160 | .. note:: ``Image`` might also be called `ImageID` | |
161 | ||
162 | :: | |
163 | ||
164 | podman ps -a --format json | jq '.[].Image' | |
165 | "docker.io/library/centos:8" | |
166 | "registry.opensuse.org/opensuse/leap:15.2" | |
167 | ||
168 | ||
39ae355f | 169 | Manually Running Containers |
9f95a23c TL |
170 | --------------------------- |
171 | ||
39ae355f | 172 | Cephadm uses small wrappers when running containers. Refer to |
9f95a23c TL |
173 | ``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the |
174 | container execution command. | |
1911f103 | 175 | |
f6b5b4d7 | 176 | .. _cephadm-ssh-errors: |
1911f103 | 177 | |
39ae355f | 178 | SSH Errors |
1911f103 TL |
179 | ---------- |
180 | ||
181 | Error message:: | |
182 | ||
f91f0fd5 TL |
183 | execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2 |
184 | ... | |
185 | raise OrchestratorError(msg) from e | |
186 | orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2). | |
187 | Please make sure that the host is reachable and accepts connections using the cephadm SSH key | |
188 | ... | |
1911f103 | 189 | |
39ae355f | 190 | Things Ceph administrators can do: |
1911f103 TL |
191 | |
192 | 1. Ensure cephadm has an SSH identity key:: | |
f91f0fd5 TL |
193 | |
194 | [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key | |
1911f103 TL |
195 | INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98 |
196 | INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key' | |
f91f0fd5 | 197 | [root@mon1 ~] # chmod 0600 ~/cephadm_private_key |
1911f103 TL |
198 | |
199 | If this fails, cephadm doesn't have a key. Fix this by running the following command:: | |
f91f0fd5 | 200 | |
1911f103 TL |
201 | [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key |
202 | ||
203 | or:: | |
f91f0fd5 TL |
204 | |
205 | [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i - | |
1911f103 | 206 | |
39ae355f | 207 | 2. Ensure that the SSH config is correct:: |
f91f0fd5 | 208 | |
1911f103 TL |
209 | [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config |
210 | ||
211 | 3. Verify that we can connect to the host:: | |
1911f103 | 212 | |
f91f0fd5 | 213 | [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1 |
1911f103 TL |
214 | |
215 | Verifying that the Public Key is Listed in the authorized_keys file | |
522d829b TL |
216 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
217 | ||
1911f103 TL |
218 | To verify that the public key is in the authorized_keys file, run the following commands:: |
219 | ||
f91f0fd5 TL |
220 | [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub |
221 | [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys | |
e306af50 | 222 | |
39ae355f | 223 | Failed to Infer CIDR network error |
e306af50 TL |
224 | ---------------------------------- |
225 | ||
226 | If you see this error:: | |
227 | ||
228 | ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later | |
229 | ||
230 | Or this error:: | |
231 | ||
232 | Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP | |
233 | ||
234 | This means that you must run a command of this form:: | |
235 | ||
236 | ceph config set mon public_network <mon_network> | |
237 | ||
238 | For more detail on operations of this kind, see :ref:`deploy_additional_monitors` | |
239 | ||
39ae355f | 240 | Accessing the Admin Socket |
e306af50 TL |
241 | -------------------------- |
242 | ||
243 | Each Ceph daemon provides an admin socket that bypasses the | |
244 | MONs (See :ref:`rados-monitoring-using-admin-socket`). | |
245 | ||
246 | To access the admin socket, first enter the daemon container on the host:: | |
247 | ||
248 | [root@mon1 ~]# cephadm enter --name <daemon-name> | |
249 | [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show | |
f67539c2 | 250 | |
39ae355f | 251 | Running Various Ceph Tools |
a4b75251 TL |
252 | -------------------------------- |
253 | ||
39ae355f TL |
254 | To run Ceph tools like ``ceph-objectstore-tool`` or |
255 | ``ceph-monstore-tool``, invoke the cephadm CLI with | |
256 | ``cephadm shell --name <daemon-name>``. For example:: | |
a4b75251 TL |
257 | |
258 | root@myhostname # cephadm unit --name mon.myhostname stop | |
259 | root@myhostname # cephadm shell --name mon.myhostname | |
260 | [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap | |
261 | [ceph: root@myhostname /]# monmaptool --print monmap | |
262 | monmaptool: monmap file monmap | |
263 | epoch 1 | |
264 | fsid 28596f44-3b56-11ec-9034-482ae35a5fbb | |
265 | last_changed 2021-11-01T20:57:19.755111+0000 | |
266 | created 2021-11-01T20:57:19.755111+0000 | |
267 | min_mon_release 17 (quincy) | |
268 | election_strategy: 1 | |
269 | 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname | |
270 | ||
39ae355f TL |
271 | The cephadm shell sets up the environment in a way that is suitable |
272 | for extended daemon maintenance and running daemons interactively. | |
a4b75251 TL |
273 | |
274 | .. _cephadm-restore-quorum: | |
f67539c2 | 275 | |
39ae355f TL |
276 | Restoring the Monitor Quorum |
277 | ---------------------------- | |
f67539c2 | 278 | |
39ae355f TL |
279 | If the Ceph monitor daemons (mons) cannot form a quorum, cephadm will not be |
280 | able to manage the cluster until quorum is restored. | |
f67539c2 | 281 | |
39ae355f | 282 | In order to restore the quorum, remove unhealthy monitors |
f67539c2 TL |
283 | form the monmap by following these steps: |
284 | ||
39ae355f | 285 | 1. Stop all mons. For each mon host:: |
f67539c2 TL |
286 | |
287 | ssh {mon-host} | |
288 | cephadm unit --name mon.`hostname` stop | |
289 | ||
290 | ||
291 | 2. Identify a surviving monitor and log in to that host:: | |
292 | ||
293 | ssh {mon-host} | |
294 | cephadm enter --name mon.`hostname` | |
295 | ||
296 | 3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy` | |
297 | ||
a4b75251 | 298 | .. _cephadm-manually-deploy-mgr: |
f67539c2 | 299 | |
39ae355f TL |
300 | Manually Deploying a Manager Daemon |
301 | ----------------------------------- | |
302 | At least one manager (mgr) daemon is required by cephadm in order to manage the | |
303 | cluster. If the last mgr in a cluster has been removed, follow these steps in | |
304 | order to deploy a manager called (for example) | |
305 | ``mgr.hostname.smfvfd`` on a random host of your cluster manually. | |
f67539c2 TL |
306 | |
307 | Disable the cephadm scheduler, in order to prevent cephadm from removing the new | |
39ae355f | 308 | manager. See :ref:`cephadm-enable-cli`:: |
f67539c2 TL |
309 | |
310 | ceph config-key set mgr/cephadm/pause true | |
311 | ||
39ae355f | 312 | Then get or create the auth entry for the new manager:: |
f67539c2 TL |
313 | |
314 | ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *" | |
315 | ||
316 | Get the ceph.conf:: | |
317 | ||
318 | ceph config generate-minimal-conf | |
319 | ||
320 | Get the container image:: | |
321 | ||
322 | ceph config get "mgr.hostname.smfvfd" container_image | |
323 | ||
20effc67 | 324 | Create a file ``config-json.json`` which contains the information necessary to deploy |
f67539c2 TL |
325 | the daemon: |
326 | ||
327 | .. code-block:: json | |
328 | ||
329 | { | |
330 | "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n", | |
331 | "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n" | |
332 | } | |
333 | ||
334 | Deploy the daemon:: | |
335 | ||
336 | cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json | |
337 | ||
39ae355f | 338 | Analyzing Core Dumps |
20effc67 TL |
339 | --------------------- |
340 | ||
39ae355f | 341 | When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run |
20effc67 TL |
342 | |
343 | .. prompt:: bash # | |
344 | ||
345 | ulimit -c unlimited | |
346 | ||
39ae355f | 347 | Core dumps will now be written to ``/var/lib/systemd/coredump``. |
20effc67 TL |
348 | |
349 | .. note:: | |
350 | ||
39ae355f | 351 | Core dumps are not namespaced by the kernel, which means |
20effc67 TL |
352 | they will be written to ``/var/lib/systemd/coredump`` on |
353 | the container host. | |
354 | ||
39ae355f | 355 | Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``. |
20effc67 | 356 | |
39ae355f | 357 | Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll:: |
20effc67 TL |
358 | |
359 | # cephadm shell --mount /var/lib/systemd/coredump | |
360 | [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd | |
361 | [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst | |
362 | [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-... | |
363 | (gdb) bt | |
364 | #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 | |
365 | #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6 | |
366 | #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2 | |
367 | #3 0x0000563085ca3d7e in main () |