]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | Troubleshooting |
2 | =============== | |
3 | ||
aee94f69 TL |
4 | This section explains how to investigate why a cephadm command failed or why a |
5 | certain service no longer runs properly. | |
9f95a23c | 6 | |
aee94f69 TL |
7 | Cephadm deploys daemons within containers. Troubleshooting containerized |
8 | daemons requires a different process than does troubleshooting traditional | |
9 | daemons that were installed by means of packages. | |
522d829b | 10 | |
aee94f69 TL |
11 | Here are some tools and commands to help you troubleshoot your Ceph |
12 | environment. | |
9f95a23c | 13 | |
f67539c2 TL |
14 | .. _cephadm-pause: |
15 | ||
39ae355f | 16 | Pausing or Disabling cephadm |
801d1391 TL |
17 | ---------------------------- |
18 | ||
aee94f69 TL |
19 | If something goes wrong and cephadm is behaving badly, pause most of the Ceph |
20 | cluster's background activity by running the following command: | |
522d829b TL |
21 | |
22 | .. prompt:: bash # | |
801d1391 TL |
23 | |
24 | ceph orch pause | |
25 | ||
aee94f69 TL |
26 | This stops all changes in the Ceph cluster, but cephadm will still periodically |
27 | check hosts to refresh its inventory of daemons and devices. Disable cephadm | |
28 | completely by running the following commands: | |
522d829b TL |
29 | |
30 | .. prompt:: bash # | |
801d1391 TL |
31 | |
32 | ceph orch set backend '' | |
33 | ceph mgr module disable cephadm | |
34 | ||
f38dd50b | 35 | These commands disable all ``ceph orch ...`` CLI commands. All |
aee94f69 TL |
36 | previously deployed daemon containers continue to run and will start just as |
37 | they were before you ran these commands. | |
801d1391 | 38 | |
aee94f69 | 39 | See :ref:`cephadm-spec-unmanaged` for more on disabling individual services. |
f67539c2 TL |
40 | |
41 | ||
39ae355f | 42 | Per-service and Per-daemon Events |
f67539c2 TL |
43 | --------------------------------- |
44 | ||
aee94f69 TL |
45 | To make it easier to debug failed daemons, cephadm stores events per service |
46 | and per daemon. These events often contain information relevant to | |
47 | the troubleshooting of your Ceph cluster. | |
522d829b | 48 | |
39ae355f | 49 | Listing Service Events |
522d829b TL |
50 | ~~~~~~~~~~~~~~~~~~~~~~ |
51 | ||
aee94f69 TL |
52 | To see the events associated with a certain service, run a command of the |
53 | following form: | |
522d829b TL |
54 | |
55 | .. prompt:: bash # | |
f67539c2 TL |
56 | |
57 | ceph orch ls --service_name=<service-name> --format yaml | |
58 | ||
f38dd50b | 59 | This will return information in the following form: |
f67539c2 TL |
60 | |
61 | .. code-block:: yaml | |
62 | ||
63 | service_type: alertmanager | |
64 | service_name: alertmanager | |
65 | placement: | |
66 | hosts: | |
67 | - unknown_host | |
68 | status: | |
69 | ... | |
70 | running: 1 | |
71 | size: 1 | |
72 | events: | |
73 | - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created" | |
74 | - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot | |
75 | place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"' | |
76 | ||
39ae355f | 77 | Listing Daemon Events |
522d829b TL |
78 | ~~~~~~~~~~~~~~~~~~~~~ |
79 | ||
aee94f69 TL |
80 | To see the events associated with a certain daemon, run a command of the |
81 | following form: | |
522d829b TL |
82 | |
83 | .. prompt:: bash # | |
f67539c2 TL |
84 | |
85 | ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml | |
86 | ||
522d829b TL |
87 | This will return something in the following form: |
88 | ||
f67539c2 TL |
89 | .. code-block:: yaml |
90 | ||
91 | daemon_type: mds | |
92 | daemon_id: cephfs.hostname.ppdhsz | |
93 | hostname: hostname | |
94 | status_desc: running | |
95 | ... | |
96 | events: | |
97 | - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured | |
98 | mds.cephfs.hostname.ppdhsz on host 'hostname'" | |
99 | ||
100 | ||
39ae355f | 101 | Checking Cephadm Logs |
801d1391 TL |
102 | --------------------- |
103 | ||
aee94f69 TL |
104 | To learn how to monitor cephadm logs as they are generated, read |
105 | :ref:`watching_cephadm_logs`. | |
801d1391 | 106 | |
aee94f69 TL |
107 | If your Ceph cluster has been configured to log events to files, there will be |
108 | a ``ceph.cephadm.log`` file on all monitor hosts. See :ref:`cephadm-logs` for a | |
109 | more complete explanation. | |
801d1391 | 110 | |
39ae355f | 111 | Gathering Log Files |
9f95a23c TL |
112 | ------------------- |
113 | ||
aee94f69 | 114 | Use ``journalctl`` to gather the log files of all daemons: |
9f95a23c TL |
115 | |
116 | .. note:: By default cephadm now stores logs in journald. This means | |
117 | that you will no longer find daemon logs in ``/var/log/ceph/``. | |
118 | ||
aee94f69 TL |
119 | To read the log file of one specific daemon, run a command of the following |
120 | form: | |
9f95a23c | 121 | |
aee94f69 | 122 | .. prompt:: bash |
9f95a23c | 123 | |
aee94f69 | 124 | cephadm logs --name <name-of-daemon> |
9f95a23c | 125 | |
aee94f69 TL |
126 | .. Note:: This works only when run on the same host that is running the daemon. |
127 | To get the logs of a daemon that is running on a different host, add the | |
128 | ``--fsid`` option to the command, as in the following example: | |
9f95a23c | 129 | |
aee94f69 | 130 | .. prompt:: bash |
9f95a23c | 131 | |
aee94f69 TL |
132 | cephadm logs --fsid <fsid> --name <name-of-daemon> |
133 | ||
134 | In this example, ``<fsid>`` corresponds to the cluster ID returned by the | |
135 | ``ceph status`` command. | |
136 | ||
137 | To fetch all log files of all daemons on a given host, run the following | |
138 | for-loop:: | |
9f95a23c TL |
139 | |
140 | for name in $(cephadm ls | jq -r '.[].name') ; do | |
141 | cephadm logs --fsid <fsid> --name "$name" > $name; | |
142 | done | |
143 | ||
39ae355f | 144 | Collecting Systemd Status |
9f95a23c TL |
145 | ------------------------- |
146 | ||
aee94f69 | 147 | To print the state of a systemd unit, run a command of the following form: |
9f95a23c | 148 | |
aee94f69 | 149 | .. prompt:: bash |
9f95a23c | 150 | |
aee94f69 | 151 | systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service"; |
9f95a23c | 152 | |
9f95a23c | 153 | |
aee94f69 TL |
154 | To fetch the state of all daemons of a given host, run the following shell |
155 | script:: | |
156 | ||
157 | fsid="$(cephadm shell ceph fsid)" | |
158 | for name in $(cephadm ls | jq -r '.[].name') ; do | |
159 | systemctl status "ceph-$fsid@$name.service" > $name; | |
160 | done | |
9f95a23c TL |
161 | |
162 | ||
39ae355f | 163 | List all Downloaded Container Images |
9f95a23c TL |
164 | ------------------------------------ |
165 | ||
aee94f69 TL |
166 | To list all container images that are downloaded on a host, run the following |
167 | commands: | |
9f95a23c | 168 | |
aee94f69 | 169 | .. prompt:: bash # |
9f95a23c | 170 | |
aee94f69 | 171 | podman ps -a --format json | jq '.[].Image' "docker.io/library/centos:8" "registry.opensuse.org/opensuse/leap:15.2" |
9f95a23c | 172 | |
aee94f69 | 173 | .. note:: ``Image`` might also be called ``ImageID``. |
9f95a23c TL |
174 | |
175 | ||
39ae355f | 176 | Manually Running Containers |
9f95a23c TL |
177 | --------------------------- |
178 | ||
39ae355f | 179 | Cephadm uses small wrappers when running containers. Refer to |
aee94f69 TL |
180 | ``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the container |
181 | execution command. | |
1911f103 | 182 | |
f6b5b4d7 | 183 | .. _cephadm-ssh-errors: |
1911f103 | 184 | |
39ae355f | 185 | SSH Errors |
1911f103 TL |
186 | ---------- |
187 | ||
188 | Error message:: | |
189 | ||
f91f0fd5 TL |
190 | execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2 |
191 | ... | |
192 | raise OrchestratorError(msg) from e | |
193 | orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2). | |
194 | Please make sure that the host is reachable and accepts connections using the cephadm SSH key | |
195 | ... | |
1911f103 | 196 | |
aee94f69 TL |
197 | If you receive the above error message, try the following things to |
198 | troubleshoot the SSH connection between ``cephadm`` and the monitor: | |
1911f103 | 199 | |
aee94f69 | 200 | 1. Ensure that ``cephadm`` has an SSH identity key:: |
f91f0fd5 TL |
201 | |
202 | [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key | |
1911f103 TL |
203 | INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98 |
204 | INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key' | |
f91f0fd5 | 205 | [root@mon1 ~] # chmod 0600 ~/cephadm_private_key |
1911f103 TL |
206 | |
207 | If this fails, cephadm doesn't have a key. Fix this by running the following command:: | |
f91f0fd5 | 208 | |
1911f103 TL |
209 | [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key |
210 | ||
211 | or:: | |
f91f0fd5 | 212 | |
aee94f69 | 213 | [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssh-key -i - |
1911f103 | 214 | |
39ae355f | 215 | 2. Ensure that the SSH config is correct:: |
f91f0fd5 | 216 | |
1911f103 TL |
217 | [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config |
218 | ||
aee94f69 | 219 | 3. Verify that it is possible to connect to the host:: |
1911f103 | 220 | |
f91f0fd5 | 221 | [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1 |
1911f103 TL |
222 | |
223 | Verifying that the Public Key is Listed in the authorized_keys file | |
522d829b TL |
224 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
225 | ||
aee94f69 TL |
226 | To verify that the public key is in the ``authorized_keys`` file, run the |
227 | following commands:: | |
1911f103 | 228 | |
f91f0fd5 TL |
229 | [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub |
230 | [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys | |
e306af50 | 231 | |
39ae355f | 232 | Failed to Infer CIDR network error |
e306af50 TL |
233 | ---------------------------------- |
234 | ||
235 | If you see this error:: | |
236 | ||
237 | ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later | |
238 | ||
239 | Or this error:: | |
240 | ||
241 | Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP | |
242 | ||
aee94f69 TL |
243 | This means that you must run a command of this form: |
244 | ||
245 | .. prompt:: bash | |
e306af50 | 246 | |
aee94f69 | 247 | ceph config set mon public_network <mon_network> |
e306af50 | 248 | |
aee94f69 TL |
249 | For more detail on operations of this kind, see |
250 | :ref:`deploy_additional_monitors`. | |
e306af50 | 251 | |
39ae355f | 252 | Accessing the Admin Socket |
e306af50 TL |
253 | -------------------------- |
254 | ||
f38dd50b TL |
255 | Each Ceph daemon provides an admin socket that allows runtime option setting and statistic reading. See |
256 | :ref:`rados-monitoring-using-admin-socket`. | |
e306af50 | 257 | |
aee94f69 | 258 | #. To access the admin socket, enter the daemon container on the host:: |
e306af50 | 259 | |
aee94f69 TL |
260 | [root@mon1 ~]# cephadm enter --name <daemon-name> |
261 | ||
f38dd50b | 262 | #. Run a command of the following forms to see the admin socket's configuration and other available actions:: |
aee94f69 TL |
263 | |
264 | [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show | |
f38dd50b | 265 | [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok help |
f67539c2 | 266 | |
39ae355f | 267 | Running Various Ceph Tools |
a4b75251 TL |
268 | -------------------------------- |
269 | ||
aee94f69 | 270 | To run Ceph tools such as ``ceph-objectstore-tool`` or |
39ae355f TL |
271 | ``ceph-monstore-tool``, invoke the cephadm CLI with |
272 | ``cephadm shell --name <daemon-name>``. For example:: | |
a4b75251 TL |
273 | |
274 | root@myhostname # cephadm unit --name mon.myhostname stop | |
275 | root@myhostname # cephadm shell --name mon.myhostname | |
276 | [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap | |
277 | [ceph: root@myhostname /]# monmaptool --print monmap | |
278 | monmaptool: monmap file monmap | |
279 | epoch 1 | |
280 | fsid 28596f44-3b56-11ec-9034-482ae35a5fbb | |
281 | last_changed 2021-11-01T20:57:19.755111+0000 | |
282 | created 2021-11-01T20:57:19.755111+0000 | |
283 | min_mon_release 17 (quincy) | |
284 | election_strategy: 1 | |
285 | 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname | |
286 | ||
aee94f69 TL |
287 | The cephadm shell sets up the environment in a way that is suitable for |
288 | extended daemon maintenance and for the interactive running of daemons. | |
a4b75251 TL |
289 | |
290 | .. _cephadm-restore-quorum: | |
f67539c2 | 291 | |
39ae355f TL |
292 | Restoring the Monitor Quorum |
293 | ---------------------------- | |
f67539c2 | 294 | |
aee94f69 TL |
295 | If the Ceph Monitor daemons (mons) cannot form a quorum, ``cephadm`` will not |
296 | be able to manage the cluster until quorum is restored. | |
f67539c2 | 297 | |
39ae355f | 298 | In order to restore the quorum, remove unhealthy monitors |
f67539c2 TL |
299 | form the monmap by following these steps: |
300 | ||
aee94f69 TL |
301 | 1. Stop all Monitors. Use ``ssh`` to connect to each Monitor's host, and then |
302 | while connected to the Monitor's host use ``cephadm`` to stop the Monitor | |
303 | daemon: | |
304 | ||
305 | .. prompt:: bash | |
f67539c2 | 306 | |
aee94f69 TL |
307 | ssh {mon-host} |
308 | cephadm unit --name {mon.hostname} stop | |
f67539c2 TL |
309 | |
310 | ||
aee94f69 | 311 | 2. Identify a surviving Monitor and log in to its host: |
f67539c2 | 312 | |
aee94f69 | 313 | .. prompt:: bash |
f67539c2 | 314 | |
aee94f69 TL |
315 | ssh {mon-host} |
316 | cephadm enter --name {mon.hostname} | |
317 | ||
318 | 3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`. | |
f67539c2 | 319 | |
a4b75251 | 320 | .. _cephadm-manually-deploy-mgr: |
f67539c2 | 321 | |
39ae355f TL |
322 | Manually Deploying a Manager Daemon |
323 | ----------------------------------- | |
aee94f69 TL |
324 | At least one Manager (``mgr``) daemon is required by cephadm in order to manage |
325 | the cluster. If the last remaining Manager has been removed from the Ceph | |
326 | cluster, follow these steps in order to deploy a fresh Manager on an arbitrary | |
327 | host in your cluster. In this example, the freshly-deployed Manager daemon is | |
328 | called ``mgr.hostname.smfvfd``. | |
329 | ||
330 | #. Disable the cephadm scheduler, in order to prevent ``cephadm`` from removing | |
331 | the new Manager. See :ref:`cephadm-enable-cli`: | |
332 | ||
333 | .. prompt:: bash # | |
334 | ||
335 | ceph config-key set mgr/cephadm/pause true | |
336 | ||
337 | #. Retrieve or create the "auth entry" for the new Manager: | |
f67539c2 | 338 | |
aee94f69 | 339 | .. prompt:: bash # |
f67539c2 | 340 | |
aee94f69 | 341 | ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *" |
f67539c2 | 342 | |
aee94f69 | 343 | #. Retrieve the Monitor's configuration: |
f67539c2 | 344 | |
aee94f69 | 345 | .. prompt:: bash # |
f67539c2 | 346 | |
aee94f69 | 347 | ceph config generate-minimal-conf |
f67539c2 | 348 | |
aee94f69 | 349 | #. Retrieve the container image: |
f67539c2 | 350 | |
aee94f69 | 351 | .. prompt:: bash # |
f67539c2 | 352 | |
aee94f69 | 353 | ceph config get "mgr.hostname.smfvfd" container_image |
f67539c2 | 354 | |
aee94f69 TL |
355 | #. Create a file called ``config-json.json``, which contains the information |
356 | necessary to deploy the daemon: | |
f67539c2 | 357 | |
aee94f69 | 358 | .. code-block:: json |
f67539c2 | 359 | |
aee94f69 TL |
360 | { |
361 | "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n", | |
362 | "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n" | |
363 | } | |
f67539c2 | 364 | |
aee94f69 | 365 | #. Deploy the Manager daemon: |
f67539c2 | 366 | |
aee94f69 | 367 | .. prompt:: bash # |
f67539c2 | 368 | |
aee94f69 TL |
369 | cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json |
370 | ||
371 | Capturing Core Dumps | |
20effc67 TL |
372 | --------------------- |
373 | ||
aee94f69 TL |
374 | A Ceph cluster that uses ``cephadm`` can be configured to capture core dumps. |
375 | The initial capture and processing of the coredump is performed by | |
376 | `systemd-coredump | |
377 | <https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_. | |
378 | ||
379 | ||
380 | To enable coredump handling, run the following command | |
20effc67 TL |
381 | |
382 | .. prompt:: bash # | |
383 | ||
aee94f69 | 384 | ulimit -c unlimited |
20effc67 | 385 | |
20effc67 TL |
386 | |
387 | .. note:: | |
388 | ||
aee94f69 TL |
389 | Core dumps are not namespaced by the kernel. This means that core dumps are |
390 | written to ``/var/lib/systemd/coredump`` on the container host. The ``ulimit | |
391 | -c unlimited`` setting will persist only until the system is rebooted. | |
392 | ||
393 | Wait for the crash to happen again. To simulate the crash of a daemon, run for | |
394 | example ``killall -3 ceph-mon``. | |
395 | ||
396 | ||
397 | Running the Debugger with cephadm | |
398 | ---------------------------------- | |
399 | ||
400 | Running a single debugging session | |
401 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
402 | ||
403 | Initiate a debugging session by using the ``cephadm shell`` command. | |
404 | From within the shell container we need to install the debugger and debuginfo | |
405 | packages. To debug a core file captured by systemd, run the following: | |
406 | ||
407 | ||
408 | #. Start the shell session: | |
409 | ||
410 | .. prompt:: bash # | |
411 | ||
412 | cephadm shell --mount /var/lib/system/coredump | |
413 | ||
414 | #. From within the shell session, run the following commands: | |
415 | ||
416 | .. prompt:: bash # | |
417 | ||
418 | dnf install ceph-debuginfo gdb zstd | |
419 | ||
420 | .. prompt:: bash # | |
421 | ||
422 | unzstd /var/lib/systemd/coredump/core.ceph-*.zst | |
423 | ||
424 | .. prompt:: bash # | |
425 | ||
426 | gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst | |
427 | ||
428 | #. Run debugger commands at gdb's prompt: | |
429 | ||
430 | .. prompt:: bash (gdb) | |
431 | ||
432 | bt | |
433 | ||
434 | :: | |
435 | ||
436 | #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 | |
437 | #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6 | |
438 | #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2 | |
439 | #3 0x0000563085ca3d7e in main () | |
440 | ||
441 | ||
442 | Running repeated debugging sessions | |
443 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
444 | ||
445 | When using ``cephadm shell``, as in the example above, any changes made to the | |
446 | container that is spawned by the shell command are ephemeral. After the shell | |
447 | session exits, the files that were downloaded and installed cease to be | |
f38dd50b TL |
448 | available. You can simply re-run the same commands every time ``cephadm shell`` |
449 | is invoked, but to save time and resources you can create a new container image | |
450 | and use it for repeated debugging sessions. | |
aee94f69 | 451 | |
f38dd50b | 452 | In the following example, we create a simple file that constructs the |
aee94f69 TL |
453 | container image. The command below uses podman but it is expected to work |
454 | correctly even if ``podman`` is replaced with ``docker``:: | |
455 | ||
456 | cat >Containerfile <<EOF | |
457 | ARG BASE_IMG=quay.io/ceph/ceph:v18 | |
458 | FROM \${BASE_IMG} | |
459 | # install ceph debuginfo packages, gdb and other potentially useful packages | |
460 | RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo | |
461 | EOF | |
462 | podman build -t ceph:debugging -f Containerfile . | |
463 | # pass --build-arg=BASE_IMG=<your image> to customize the base image | |
464 | ||
465 | The above file creates a new local image named ``ceph:debugging``. This image | |
466 | can be used on the same machine that built it. The image can also be pushed to | |
f38dd50b TL |
467 | a container repository or saved and copied to a node that is running other Ceph |
468 | containers. See the ``podman`` or ``docker`` documentation for more | |
aee94f69 TL |
469 | information about the container workflow. |
470 | ||
471 | After the image has been built, it can be used to initiate repeat debugging | |
472 | sessions. By using an image in this way, you avoid the trouble of having to | |
f38dd50b TL |
473 | re-install the debug tools and the debuginfo packages every time you need to |
474 | run a debug session. To debug a core file using this image, in the same way as | |
aee94f69 TL |
475 | previously described, run: |
476 | ||
477 | .. prompt:: bash # | |
478 | ||
479 | cephadm --image ceph:debugging shell --mount /var/lib/system/coredump | |
480 | ||
481 | ||
482 | Debugging live processes | |
483 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
484 | ||
485 | The gdb debugger can attach to running processes to debug them. This can be | |
486 | achieved with a containerized process by using the debug image and attaching it | |
487 | to the same PID namespace in which the process to be debugged resides. | |
488 | ||
489 | This requires running a container command with some custom arguments. We can | |
490 | generate a script that can debug a process in a running container. | |
491 | ||
492 | .. prompt:: bash # | |
493 | ||
494 | cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh | |
495 | ||
496 | This creates a script that includes the container command that ``cephadm`` | |
497 | would use to create a shell. Modify the script by removing the ``--init`` | |
498 | argument and replace it with the argument that joins to the namespace used for | |
499 | a running running container. For example, assume we want to debug the Manager | |
500 | and have determnined that the Manager is running in a container named | |
501 | ``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. In this case, | |
502 | the argument | |
503 | ``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk`` | |
504 | should be used. | |
20effc67 | 505 | |
aee94f69 TL |
506 | We can run our debugging container with ``sh /tmp/debug.sh``. Within the shell, |
507 | we can run commands such as ``ps`` to get the PID of the Manager process. In | |
508 | the following example this is ``2``. While running gdb, we can attach to the | |
509 | running process: | |
20effc67 | 510 | |
aee94f69 | 511 | .. prompt:: bash (gdb) |
20effc67 | 512 | |
aee94f69 TL |
513 | attach 2 |
514 | info threads | |
515 | bt |