[ceph.git] / ceph / doc / cephadm / troubleshooting.rst

Troubleshooting
===============

This section explains how to investigate why a cephadm command failed or why a
certain service no longer runs properly.

Cephadm deploys daemons within containers. Troubleshooting containerized
daemons requires a different process than does troubleshooting traditional
daemons that were installed by means of packages.

Here are some tools and commands to help you troubleshoot your Ceph
environment.

.. _cephadm-pause:

Pausing or Disabling cephadm
----------------------------

If something goes wrong and cephadm is behaving badly, pause most of the Ceph
cluster's background activity by running the following command: 

.. prompt:: bash #

  ceph orch pause

This stops all changes in the Ceph cluster, but cephadm will still periodically
check hosts to refresh its inventory of daemons and devices. Disable cephadm
completely by running the following commands:

.. prompt:: bash #

  ceph orch set backend ''
  ceph mgr module disable cephadm

These commands disable all ``ceph orch ...`` CLI commands. All
previously deployed daemon containers continue to run and will start just as
they were before you ran these commands.

See :ref:`cephadm-spec-unmanaged` for more on disabling individual services.


Per-service and Per-daemon Events
---------------------------------

To make it easier to debug failed daemons, cephadm stores events per service
and per daemon. These events often contain information relevant to
the troubleshooting of your Ceph cluster. 

Listing Service Events
~~~~~~~~~~~~~~~~~~~~~~

To see the events associated with a certain service, run a command of the 
following form:

.. prompt:: bash #

  ceph orch ls --service_name=<service-name> --format yaml

This will return information in the following form:

.. code-block:: yaml

  service_type: alertmanager
  service_name: alertmanager
  placement:
    hosts:
    - unknown_host
  status:
    ...
    running: 1
    size: 1
  events:
  - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
  - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
    place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'

Listing Daemon Events
~~~~~~~~~~~~~~~~~~~~~

To see the events associated with a certain daemon, run a command of the
following form:

.. prompt:: bash #

  ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml

This will return something in the following form:

.. code-block:: yaml

  daemon_type: mds
  daemon_id: cephfs.hostname.ppdhsz
  hostname: hostname
  status_desc: running
  ...
  events:
  - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
    mds.cephfs.hostname.ppdhsz on host 'hostname'"


Checking Cephadm Logs
---------------------

To learn how to monitor cephadm logs as they are generated, read
:ref:`watching_cephadm_logs`.

If your Ceph cluster has been configured to log events to files, there will be
a ``ceph.cephadm.log`` file on all monitor hosts. See :ref:`cephadm-logs` for a
more complete explanation.

Gathering Log Files
-------------------

Use ``journalctl`` to gather the log files of all daemons:

.. note:: By default cephadm now stores logs in journald. This means
   that you will no longer find daemon logs in ``/var/log/ceph/``.

To read the log file of one specific daemon, run a command of the following
form:

.. prompt:: bash

   cephadm logs --name <name-of-daemon>

.. Note:: This works only when run on the same host that is running the daemon.
   To get the logs of a daemon that is running on a different host, add the
   ``--fsid`` option to the command, as in the following example:

   .. prompt:: bash

      cephadm logs --fsid <fsid> --name <name-of-daemon>

   In this example, ``<fsid>`` corresponds to the cluster ID returned by the
   ``ceph status`` command.

To fetch all log files of all daemons on a given host, run the following
for-loop::

    for name in $(cephadm ls | jq -r '.[].name') ; do
      cephadm logs --fsid <fsid> --name "$name" > $name;
    done

Collecting Systemd Status
-------------------------

To print the state of a systemd unit, run a command of the following form: 

.. prompt:: bash

   systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";


To fetch the state of all daemons of a given host, run the following shell
script::

   fsid="$(cephadm shell ceph fsid)"
   for name in $(cephadm ls | jq -r '.[].name') ; do
     systemctl status "ceph-$fsid@$name.service" > $name;
   done


List all Downloaded Container Images
------------------------------------

To list all container images that are downloaded on a host, run the following
commands:

.. prompt:: bash #

   podman ps -a --format json | jq '.[].Image' "docker.io/library/centos:8" "registry.opensuse.org/opensuse/leap:15.2"

.. note:: ``Image`` might also be called ``ImageID``.


Manually Running Containers
---------------------------

Cephadm uses small wrappers when running containers. Refer to
``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the container
execution command.

.. _cephadm-ssh-errors:

SSH Errors
----------

Error message::

  execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
  ...
  raise OrchestratorError(msg) from e
  orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
  Please make sure that the host is reachable and accepts connections using the cephadm SSH key
  ...

If you receive the above error message, try the following things to
troubleshoot the SSH connection between ``cephadm`` and the monitor:

1. Ensure that ``cephadm`` has an SSH identity key::

     [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
     INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
     INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
     [root@mon1 ~] # chmod 0600 ~/cephadm_private_key

 If this fails, cephadm doesn't have a key. Fix this by running the following command::

     [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key

 or::

     [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssh-key -i -

2. Ensure that the SSH config is correct::

     [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config

3. Verify that it is possible to connect to the host::

     [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1

Verifying that the Public Key is Listed in the authorized_keys file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To verify that the public key is in the ``authorized_keys`` file, run the
following commands::

     [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
     [root@mon1 ~]# grep "`cat ~/ceph.pub`"  /root/.ssh/authorized_keys

Failed to Infer CIDR network error
----------------------------------

If you see this error::

   ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later

Or this error::

   Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP

This means that you must run a command of this form:

.. prompt:: bash

   ceph config set mon public_network <mon_network>

For more detail on operations of this kind, see
:ref:`deploy_additional_monitors`.

Accessing the Admin Socket
--------------------------

Each Ceph daemon provides an admin socket that allows runtime option setting and statistic reading. See
:ref:`rados-monitoring-using-admin-socket`.

#. To access the admin socket, enter the daemon container on the host::

   [root@mon1 ~]# cephadm enter --name <daemon-name>

#. Run a command of the following forms to see the admin socket's configuration and other available actions::
  
   [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
   [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok help

Running Various Ceph Tools
--------------------------------

To run Ceph tools such as ``ceph-objectstore-tool`` or 
``ceph-monstore-tool``, invoke the cephadm CLI with
``cephadm shell --name <daemon-name>``.  For example::

    root@myhostname # cephadm unit --name mon.myhostname stop
    root@myhostname # cephadm shell --name mon.myhostname
    [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap         
    [ceph: root@myhostname /]# monmaptool --print monmap
    monmaptool: monmap file monmap
    epoch 1
    fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
    last_changed 2021-11-01T20:57:19.755111+0000
    created 2021-11-01T20:57:19.755111+0000
    min_mon_release 17 (quincy)
    election_strategy: 1
    0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname

The cephadm shell sets up the environment in a way that is suitable for
extended daemon maintenance and for the interactive running of daemons. 

.. _cephadm-restore-quorum:

Restoring the Monitor Quorum
----------------------------

If the Ceph Monitor daemons (mons) cannot form a quorum, ``cephadm`` will not
be able to manage the cluster until quorum is restored.

In order to restore the quorum, remove unhealthy monitors
form the monmap by following these steps:

1. Stop all Monitors. Use ``ssh`` to connect to each Monitor's host, and then
   while connected to the Monitor's host use ``cephadm`` to stop the Monitor
   daemon:

   .. prompt:: bash

      ssh {mon-host}
      cephadm unit --name {mon.hostname} stop


2. Identify a surviving Monitor and log in to its host:

   .. prompt:: bash

      ssh {mon-host}
      cephadm enter --name {mon.hostname}

3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`.

.. _cephadm-manually-deploy-mgr:

Manually Deploying a Manager Daemon
-----------------------------------
At least one Manager (``mgr``) daemon is required by cephadm in order to manage
the cluster. If the last remaining Manager has been removed from the Ceph
cluster, follow these steps in order to deploy a fresh Manager on an arbitrary
host in your cluster. In this example, the freshly-deployed Manager daemon is
called ``mgr.hostname.smfvfd``.

#. Disable the cephadm scheduler, in order to prevent ``cephadm`` from removing
   the new Manager. See :ref:`cephadm-enable-cli`:

   .. prompt:: bash #

      ceph config-key set mgr/cephadm/pause true

#. Retrieve or create the "auth entry" for the new Manager:

   .. prompt:: bash #

      ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"

#. Retrieve the Monitor's configuration:

   .. prompt:: bash #

      ceph config generate-minimal-conf

#. Retrieve the container image:

   .. prompt:: bash #

      ceph config get "mgr.hostname.smfvfd" container_image

#. Create a file called ``config-json.json``, which contains the information
   necessary to deploy the daemon:

   .. code-block:: json

     {
       "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
       "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
     }

#. Deploy the Manager daemon:

   .. prompt:: bash #

      cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json

Capturing Core Dumps
---------------------

A Ceph cluster that uses ``cephadm`` can be configured to capture core dumps.
The initial capture and processing of the coredump is performed by
`systemd-coredump
<https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.


To enable coredump handling, run the following command

.. prompt:: bash #

   ulimit -c unlimited


.. note::

  Core dumps are not namespaced by the kernel. This means that core dumps are
  written to ``/var/lib/systemd/coredump`` on the container host. The ``ulimit
  -c unlimited`` setting  will persist  only until the system is rebooted.

Wait for the crash to happen again. To simulate the crash of a daemon, run for
example ``killall -3 ceph-mon``.


Running the Debugger with cephadm
----------------------------------

Running a single debugging session
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Initiate a debugging session by using the ``cephadm shell`` command.
From within the shell container we need to install the debugger and debuginfo
packages. To debug a core file captured by systemd, run the following:


#. Start the shell session:

   .. prompt:: bash #

      cephadm shell --mount /var/lib/system/coredump

#. From within the shell session, run the following commands:

   .. prompt:: bash #

      dnf install ceph-debuginfo gdb zstd

   .. prompt:: bash #
      
    unzstd /var/lib/systemd/coredump/core.ceph-*.zst

   .. prompt:: bash #

    gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst

#. Run debugger commands at gdb's prompt:

   .. prompt:: bash (gdb)

      bt
      
   ::

      #0  0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      #1  0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
      #2  0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
      #3  0x0000563085ca3d7e in main ()


Running repeated debugging sessions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When using ``cephadm shell``, as in the example above, any changes made to the
container that is spawned by the shell command are ephemeral. After the shell
session exits, the files that were downloaded and installed cease to be
available. You can simply re-run the same commands every time ``cephadm shell``
is invoked, but to save time and resources you can create a new container image
and use it for repeated debugging sessions.

In the following example, we create a simple file that constructs the
container image. The command below uses podman but it is expected to work
correctly even if ``podman`` is replaced with ``docker``::

  cat >Containerfile <<EOF
  ARG BASE_IMG=quay.io/ceph/ceph:v18
  FROM \${BASE_IMG}
  # install ceph debuginfo packages, gdb and other potentially useful packages
  RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo
  EOF
  podman build -t ceph:debugging -f Containerfile .
  # pass --build-arg=BASE_IMG=<your image> to customize the base image

The above file creates a new local image named ``ceph:debugging``. This image
can be used on the same machine that built it. The image can also be pushed to
a container repository or saved and copied to a node that is running other Ceph
containers. See the ``podman`` or ``docker`` documentation for more
information about the container workflow.

After the image has been built, it can be used to initiate repeat debugging
sessions. By using an image in this way, you avoid the trouble of having to
re-install the debug tools and the debuginfo packages every time you need to
run a debug session. To debug a core file using this image, in the same way as
previously described, run:

.. prompt:: bash #

    cephadm --image ceph:debugging shell --mount /var/lib/system/coredump


Debugging live processes
~~~~~~~~~~~~~~~~~~~~~~~~

The gdb debugger can attach to running processes to debug them. This can be
achieved with a containerized process by using the debug image and attaching it
to the same PID namespace in which the process to be debugged resides.

This requires running a container command with some custom arguments. We can
generate a script that can debug a process in a running container.

.. prompt:: bash #

   cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh

This creates a script that includes the container command that ``cephadm``
would use to create a shell. Modify the script by removing the ``--init``
argument and replace it with the argument that joins to the namespace used for
a running running container. For example, assume we want to debug the Manager
and have determnined that the Manager is running in a container named
``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. In this case,
the argument
``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
should be used.

We can run our debugging container with ``sh /tmp/debug.sh``. Within the shell,
we can run commands such as ``ps`` to get the PID of the Manager process. In
the following example this is ``2``. While running gdb, we can attach to the
running process:

.. prompt:: bash (gdb)

   attach 2
   info threads
   bt
Commit	Line	Data
9f95a23c TL	1	Troubleshooting
	2	===============
	3
aee94f69 TL	4	This section explains how to investigate why a cephadm command failed or why a
aee94f69 TL	5	certain service no longer runs properly.
9f95a23c	6
aee94f69 TL	7	Cephadm deploys daemons within containers. Troubleshooting containerized
	8	daemons requires a different process than does troubleshooting traditional
	9	daemons that were installed by means of packages.
522d829b	10
aee94f69 TL	11	Here are some tools and commands to help you troubleshoot your Ceph
aee94f69 TL	12	environment.
9f95a23c	13
f67539c2 TL	14	.. _cephadm-pause:
f67539c2 TL	15
39ae355f	16	Pausing or Disabling cephadm
801d1391 TL	17	----------------------------
801d1391 TL	18
aee94f69 TL	19	If something goes wrong and cephadm is behaving badly, pause most of the Ceph
aee94f69 TL	20	cluster's background activity by running the following command:
522d829b TL	21
522d829b TL	22	.. prompt:: bash #
801d1391 TL	23
	24	ceph orch pause
	25
aee94f69 TL	26	This stops all changes in the Ceph cluster, but cephadm will still periodically
	27	check hosts to refresh its inventory of daemons and devices. Disable cephadm
	28	completely by running the following commands:
522d829b TL	29
522d829b TL	30	.. prompt:: bash #
801d1391 TL	31
	32	ceph orch set backend ''
	33	ceph mgr module disable cephadm
	34
f38dd50b	35	These commands disable all ``ceph orch ...`` CLI commands. All
aee94f69 TL	36	previously deployed daemon containers continue to run and will start just as
aee94f69 TL	37	they were before you ran these commands.
801d1391	38
aee94f69	39	See :ref:`cephadm-spec-unmanaged` for more on disabling individual services.
f67539c2 TL	40
f67539c2 TL	41
39ae355f	42	Per-service and Per-daemon Events
f67539c2 TL	43	---------------------------------
f67539c2 TL	44
aee94f69 TL	45	To make it easier to debug failed daemons, cephadm stores events per service
	46	and per daemon. These events often contain information relevant to
	47	the troubleshooting of your Ceph cluster.
522d829b	48
39ae355f	49	Listing Service Events
522d829b TL	50	~~~~~~~~~~~~~~~~~~~~~~
522d829b TL	51
aee94f69 TL	52	To see the events associated with a certain service, run a command of the
aee94f69 TL	53	following form:
522d829b TL	54
522d829b TL	55	.. prompt:: bash #
f67539c2 TL	56
	57	ceph orch ls --service_name=<service-name> --format yaml
	58
f38dd50b	59	This will return information in the following form:
f67539c2 TL	60
	61	.. code-block:: yaml
	62
	63	service_type: alertmanager
	64	service_name: alertmanager
	65	placement:
	66	hosts:
	67	- unknown_host
	68	status:
	69	...
	70	running: 1
	71	size: 1
	72	events:
	73	- 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
	74	- '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
	75	place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
	76
39ae355f	77	Listing Daemon Events
522d829b TL	78	~~~~~~~~~~~~~~~~~~~~~
522d829b TL	79
aee94f69 TL	80	To see the events associated with a certain daemon, run a command of the
aee94f69 TL	81	following form:
522d829b TL	82
522d829b TL	83	.. prompt:: bash #
f67539c2 TL	84
	85	ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
	86
522d829b TL	87	This will return something in the following form:
522d829b TL	88
f67539c2 TL	89	.. code-block:: yaml
	90
	91	daemon_type: mds
	92	daemon_id: cephfs.hostname.ppdhsz
	93	hostname: hostname
	94	status_desc: running
	95	...
	96	events:
	97	- 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
	98	mds.cephfs.hostname.ppdhsz on host 'hostname'"
	99
	100
39ae355f	101	Checking Cephadm Logs
801d1391 TL	102	---------------------
801d1391 TL	103
aee94f69 TL	104	To learn how to monitor cephadm logs as they are generated, read
aee94f69 TL	105	:ref:`watching_cephadm_logs`.
801d1391	106
aee94f69 TL	107	If your Ceph cluster has been configured to log events to files, there will be
	108	a ``ceph.cephadm.log`` file on all monitor hosts. See :ref:`cephadm-logs` for a
	109	more complete explanation.
801d1391	110
39ae355f	111	Gathering Log Files
9f95a23c TL	112	-------------------
9f95a23c TL	113
aee94f69	114	Use ``journalctl`` to gather the log files of all daemons:
9f95a23c TL	115
	116	.. note:: By default cephadm now stores logs in journald. This means
	117	that you will no longer find daemon logs in ``/var/log/ceph/``.
	118
aee94f69 TL	119	To read the log file of one specific daemon, run a command of the following
aee94f69 TL	120	form:
9f95a23c	121
aee94f69	122	.. prompt:: bash
9f95a23c	123
aee94f69	124	cephadm logs --name <name-of-daemon>
9f95a23c	125
aee94f69 TL	126	.. Note:: This works only when run on the same host that is running the daemon.
	127	To get the logs of a daemon that is running on a different host, add the
	128	``--fsid`` option to the command, as in the following example:
9f95a23c	129
aee94f69	130	.. prompt:: bash
9f95a23c	131
aee94f69 TL	132	cephadm logs --fsid <fsid> --name <name-of-daemon>
	133
	134	In this example, ``<fsid>`` corresponds to the cluster ID returned by the
	135	``ceph status`` command.
	136
	137	To fetch all log files of all daemons on a given host, run the following
	138	for-loop::
9f95a23c TL	139
	140	for name in $(cephadm ls \| jq -r '.[].name') ; do
	141	cephadm logs --fsid <fsid> --name "$name" > $name;
	142	done
	143
39ae355f	144	Collecting Systemd Status
9f95a23c TL	145	-------------------------
9f95a23c TL	146
aee94f69	147	To print the state of a systemd unit, run a command of the following form:
9f95a23c	148
aee94f69	149	.. prompt:: bash
9f95a23c	150
aee94f69	151	systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
9f95a23c	152
9f95a23c	153
aee94f69 TL	154	To fetch the state of all daemons of a given host, run the following shell
	155	script::
	156
	157	fsid="$(cephadm shell ceph fsid)"
	158	for name in $(cephadm ls \| jq -r '.[].name') ; do
	159	systemctl status "ceph-$fsid@$name.service" > $name;
	160	done
9f95a23c TL	161
9f95a23c TL	162
39ae355f	163	List all Downloaded Container Images
9f95a23c TL	164	------------------------------------
9f95a23c TL	165
aee94f69 TL	166	To list all container images that are downloaded on a host, run the following
aee94f69 TL	167	commands:
9f95a23c	168
aee94f69	169	.. prompt:: bash #
9f95a23c	170
aee94f69	171	podman ps -a --format json \| jq '.[].Image' "docker.io/library/centos:8" "registry.opensuse.org/opensuse/leap:15.2"
9f95a23c	172
aee94f69	173	.. note:: ``Image`` might also be called ``ImageID``.
9f95a23c TL	174
9f95a23c TL	175
39ae355f	176	Manually Running Containers
9f95a23c TL	177	---------------------------
9f95a23c TL	178
39ae355f	179	Cephadm uses small wrappers when running containers. Refer to
aee94f69 TL	180	``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the container
aee94f69 TL	181	execution command.
1911f103	182
f6b5b4d7	183	.. _cephadm-ssh-errors:
1911f103	184
39ae355f	185	SSH Errors
1911f103 TL	186	----------
	187
	188	Error message::
	189
f91f0fd5 TL	190	execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
	191	...
	192	raise OrchestratorError(msg) from e
	193	orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
	194	Please make sure that the host is reachable and accepts connections using the cephadm SSH key
	195	...
1911f103	196
aee94f69 TL	197	If you receive the above error message, try the following things to
aee94f69 TL	198	troubleshoot the SSH connection between ``cephadm`` and the monitor:
1911f103	199
aee94f69	200	1. Ensure that ``cephadm`` has an SSH identity key::
f91f0fd5 TL	201
f91f0fd5 TL	202	[root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
1911f103 TL	203	INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
1911f103 TL	204	INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
f91f0fd5	205	[root@mon1 ~] # chmod 0600 ~/cephadm_private_key
1911f103 TL	206
1911f103 TL	207	If this fails, cephadm doesn't have a key. Fix this by running the following command::
f91f0fd5	208
1911f103 TL	209	[root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
	210
	211	or::
f91f0fd5	212
aee94f69	213	[root@mon1 ~]# cat ~/cephadm_private_key \| cephadm shell -- ceph cephadm set-ssh-key -i -
1911f103	214
39ae355f	215	2. Ensure that the SSH config is correct::
f91f0fd5	216
1911f103 TL	217	[root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
1911f103 TL	218
aee94f69	219	3. Verify that it is possible to connect to the host::
1911f103	220
f91f0fd5	221	[root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
1911f103 TL	222
1911f103 TL	223	Verifying that the Public Key is Listed in the authorized_keys file
522d829b TL	224	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
522d829b TL	225
aee94f69 TL	226	To verify that the public key is in the ``authorized_keys`` file, run the
aee94f69 TL	227	following commands::
1911f103	228
f91f0fd5 TL	229	[root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
f91f0fd5 TL	230	[root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
e306af50	231
39ae355f	232	Failed to Infer CIDR network error
e306af50 TL	233	----------------------------------
	234
	235	If you see this error::
	236
	237	ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
	238
	239	Or this error::
	240
	241	Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
	242
aee94f69 TL	243	This means that you must run a command of this form:
	244
	245	.. prompt:: bash
e306af50	246
aee94f69	247	ceph config set mon public_network <mon_network>
e306af50	248
aee94f69 TL	249	For more detail on operations of this kind, see
aee94f69 TL	250	:ref:`deploy_additional_monitors`.
e306af50	251
39ae355f	252	Accessing the Admin Socket
e306af50 TL	253	--------------------------
e306af50 TL	254
f38dd50b TL	255	Each Ceph daemon provides an admin socket that allows runtime option setting and statistic reading. See
f38dd50b TL	256	:ref:`rados-monitoring-using-admin-socket`.
e306af50	257
aee94f69	258	#. To access the admin socket, enter the daemon container on the host::
e306af50	259
aee94f69 TL	260	[root@mon1 ~]# cephadm enter --name <daemon-name>
aee94f69 TL	261
f38dd50b	262	#. Run a command of the following forms to see the admin socket's configuration and other available actions::
aee94f69 TL	263
aee94f69 TL	264	[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
f38dd50b	265	[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok help
f67539c2	266
39ae355f	267	Running Various Ceph Tools
a4b75251 TL	268	--------------------------------
a4b75251 TL	269
aee94f69	270	To run Ceph tools such as ``ceph-objectstore-tool`` or
39ae355f TL	271	``ceph-monstore-tool``, invoke the cephadm CLI with
39ae355f TL	272	``cephadm shell --name <daemon-name>``. For example::
a4b75251 TL	273
	274	root@myhostname # cephadm unit --name mon.myhostname stop
	275	root@myhostname # cephadm shell --name mon.myhostname
	276	[ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap
	277	[ceph: root@myhostname /]# monmaptool --print monmap
	278	monmaptool: monmap file monmap
	279	epoch 1
	280	fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
	281	last_changed 2021-11-01T20:57:19.755111+0000
	282	created 2021-11-01T20:57:19.755111+0000
	283	min_mon_release 17 (quincy)
	284	election_strategy: 1
	285	0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
	286
aee94f69 TL	287	The cephadm shell sets up the environment in a way that is suitable for
aee94f69 TL	288	extended daemon maintenance and for the interactive running of daemons.
a4b75251 TL	289
a4b75251 TL	290	.. _cephadm-restore-quorum:
f67539c2	291
39ae355f TL	292	Restoring the Monitor Quorum
39ae355f TL	293	----------------------------
f67539c2	294
aee94f69 TL	295	If the Ceph Monitor daemons (mons) cannot form a quorum, ``cephadm`` will not
aee94f69 TL	296	be able to manage the cluster until quorum is restored.
f67539c2	297
39ae355f	298	In order to restore the quorum, remove unhealthy monitors
f67539c2 TL	299	form the monmap by following these steps:
f67539c2 TL	300
aee94f69 TL	301	1. Stop all Monitors. Use ``ssh`` to connect to each Monitor's host, and then
	302	while connected to the Monitor's host use ``cephadm`` to stop the Monitor
	303	daemon:
	304
	305	.. prompt:: bash
f67539c2	306
aee94f69 TL	307	ssh {mon-host}
aee94f69 TL	308	cephadm unit --name {mon.hostname} stop
f67539c2 TL	309
f67539c2 TL	310
aee94f69	311	2. Identify a surviving Monitor and log in to its host:
f67539c2	312
aee94f69	313	.. prompt:: bash
f67539c2	314
aee94f69 TL	315	ssh {mon-host}
	316	cephadm enter --name {mon.hostname}
	317
	318	3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`.
f67539c2	319
a4b75251	320	.. _cephadm-manually-deploy-mgr:
f67539c2	321
39ae355f TL	322	Manually Deploying a Manager Daemon
39ae355f TL	323	-----------------------------------
aee94f69 TL	324	At least one Manager (``mgr``) daemon is required by cephadm in order to manage
	325	the cluster. If the last remaining Manager has been removed from the Ceph
	326	cluster, follow these steps in order to deploy a fresh Manager on an arbitrary
	327	host in your cluster. In this example, the freshly-deployed Manager daemon is
	328	called ``mgr.hostname.smfvfd``.
	329
	330	#. Disable the cephadm scheduler, in order to prevent ``cephadm`` from removing
	331	the new Manager. See :ref:`cephadm-enable-cli`:
	332
	333	.. prompt:: bash #
	334
	335	ceph config-key set mgr/cephadm/pause true
	336
	337	#. Retrieve or create the "auth entry" for the new Manager:
f67539c2	338
aee94f69	339	.. prompt:: bash #
f67539c2	340
aee94f69	341	ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow " mds "allow "
f67539c2	342
aee94f69	343	#. Retrieve the Monitor's configuration:
f67539c2	344
aee94f69	345	.. prompt:: bash #
f67539c2	346
aee94f69	347	ceph config generate-minimal-conf
f67539c2	348
aee94f69	349	#. Retrieve the container image:
f67539c2	350
aee94f69	351	.. prompt:: bash #
f67539c2	352
aee94f69	353	ceph config get "mgr.hostname.smfvfd" container_image
f67539c2	354
aee94f69 TL	355	#. Create a file called ``config-json.json``, which contains the information
aee94f69 TL	356	necessary to deploy the daemon:
f67539c2	357
aee94f69	358	.. code-block:: json
f67539c2	359
aee94f69 TL	360	{
	361	"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
	362	"keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
	363	}
f67539c2	364
aee94f69	365	#. Deploy the Manager daemon:
f67539c2	366
aee94f69	367	.. prompt:: bash #
f67539c2	368
aee94f69 TL	369	cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
	370
	371	Capturing Core Dumps
20effc67 TL	372	---------------------
20effc67 TL	373
aee94f69 TL	374	A Ceph cluster that uses ``cephadm`` can be configured to capture core dumps.
	375	The initial capture and processing of the coredump is performed by
	376	`systemd-coredump
	377	<https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
	378
	379
	380	To enable coredump handling, run the following command
20effc67 TL	381
	382	.. prompt:: bash #
	383
aee94f69	384	ulimit -c unlimited
20effc67	385
20effc67 TL	386
	387	.. note::
	388
aee94f69 TL	389	Core dumps are not namespaced by the kernel. This means that core dumps are
	390	written to ``/var/lib/systemd/coredump`` on the container host. The ``ulimit
	391	-c unlimited`` setting will persist only until the system is rebooted.
	392
	393	Wait for the crash to happen again. To simulate the crash of a daemon, run for
	394	example ``killall -3 ceph-mon``.
	395
	396
	397	Running the Debugger with cephadm
	398	----------------------------------
	399
	400	Running a single debugging session
	401	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	402
	403	Initiate a debugging session by using the ``cephadm shell`` command.
	404	From within the shell container we need to install the debugger and debuginfo
	405	packages. To debug a core file captured by systemd, run the following:
	406
	407
	408	#. Start the shell session:
	409
	410	.. prompt:: bash #
	411
	412	cephadm shell --mount /var/lib/system/coredump
	413
	414	#. From within the shell session, run the following commands:
	415
	416	.. prompt:: bash #
	417
	418	dnf install ceph-debuginfo gdb zstd
	419
	420	.. prompt:: bash #
	421
	422	unzstd /var/lib/systemd/coredump/core.ceph-*.zst
	423
	424	.. prompt:: bash #
	425
	426	gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
	427
	428	#. Run debugger commands at gdb's prompt:
	429
	430	.. prompt:: bash (gdb)
	431
	432	bt
	433
	434	::
	435
	436	#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
	437	#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
	438	#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
	439	#3 0x0000563085ca3d7e in main ()
	440
	441
	442	Running repeated debugging sessions
	443	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	444
	445	When using ``cephadm shell``, as in the example above, any changes made to the
	446	container that is spawned by the shell command are ephemeral. After the shell
	447	session exits, the files that were downloaded and installed cease to be
f38dd50b TL	448	available. You can simply re-run the same commands every time ``cephadm shell``
	449	is invoked, but to save time and resources you can create a new container image
	450	and use it for repeated debugging sessions.
aee94f69	451
f38dd50b	452	In the following example, we create a simple file that constructs the
aee94f69 TL	453	container image. The command below uses podman but it is expected to work
	454	correctly even if ``podman`` is replaced with ``docker``::
	455
	456	cat >Containerfile <<EOF
	457	ARG BASE_IMG=quay.io/ceph/ceph:v18
	458	FROM \${BASE_IMG}
	459	# install ceph debuginfo packages, gdb and other potentially useful packages
	460	RUN dnf install --enablerepo='debug' -y ceph-debuginfo gdb zstd strace python3-debuginfo
	461	EOF
	462	podman build -t ceph:debugging -f Containerfile .
	463	# pass --build-arg=BASE_IMG=<your image> to customize the base image
	464
	465	The above file creates a new local image named ``ceph:debugging``. This image
	466	can be used on the same machine that built it. The image can also be pushed to
f38dd50b TL	467	a container repository or saved and copied to a node that is running other Ceph
f38dd50b TL	468	containers. See the ``podman`` or ``docker`` documentation for more
aee94f69 TL	469	information about the container workflow.
	470
	471	After the image has been built, it can be used to initiate repeat debugging
	472	sessions. By using an image in this way, you avoid the trouble of having to
f38dd50b TL	473	re-install the debug tools and the debuginfo packages every time you need to
f38dd50b TL	474	run a debug session. To debug a core file using this image, in the same way as
aee94f69 TL	475	previously described, run:
	476
	477	.. prompt:: bash #
	478
	479	cephadm --image ceph:debugging shell --mount /var/lib/system/coredump
	480
	481
	482	Debugging live processes
	483	~~~~~~~~~~~~~~~~~~~~~~~~
	484
	485	The gdb debugger can attach to running processes to debug them. This can be
	486	achieved with a containerized process by using the debug image and attaching it
	487	to the same PID namespace in which the process to be debugged resides.
	488
	489	This requires running a container command with some custom arguments. We can
	490	generate a script that can debug a process in a running container.
	491
	492	.. prompt:: bash #
	493
	494	cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
	495
	496	This creates a script that includes the container command that ``cephadm``
	497	would use to create a shell. Modify the script by removing the ``--init``
	498	argument and replace it with the argument that joins to the namespace used for
	499	a running running container. For example, assume we want to debug the Manager
	500	and have determnined that the Manager is running in a container named
	501	``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. In this case,
	502	the argument
	503	``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
	504	should be used.
20effc67	505
aee94f69 TL	506	We can run our debugging container with ``sh /tmp/debug.sh``. Within the shell,
	507	we can run commands such as ``ps`` to get the PID of the Manager process. In
	508	the following example this is ``2``. While running gdb, we can attach to the
	509	running process:
20effc67	510
aee94f69	511	.. prompt:: bash (gdb)
20effc67	512
aee94f69 TL	513	attach 2
	514	info threads
	515	bt