[ceph.git] / ceph / doc / rados / troubleshooting / troubleshooting-osd.rst

======================
 Troubleshooting OSDs
======================

Before troubleshooting your OSDs, check your monitors and network first. If
you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
a health status, it means that the monitors have a quorum.
If you don't have a monitor quorum or if there are errors with the monitor
status, `address the monitor issues first <../troubleshooting-mon>`_.
Check your networks to ensure they
are running properly, because networks may have a significant impact on OSD
operation and performance.


Obtaining Data About OSDs
=========================

A good first step in troubleshooting your OSDs is to obtain information in
addition to the information you collected while `monitoring your OSDs`_
(e.g., ``ceph osd tree``).


Ceph Logs
---------

If you haven't changed the default path, you can find Ceph log files at
``/var/log/ceph``::

	ls /var/log/ceph

If you don't get enough log detail, you can change your logging level.  See
`Logging and Debugging`_ for details to ensure that Ceph performs adequately
under high logging volume.


Admin Socket
------------

Use the admin socket tool to retrieve runtime information. For details, list
the sockets for your Ceph processes::

	ls /var/run/ceph

Then, execute the following, replacing ``{daemon-name}`` with an actual
daemon (e.g., ``osd.0``)::

  ceph daemon osd.0 help

Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``)::

  ceph daemon {socket-file} help


The admin socket, among other things, allows you to:

- List your configuration at runtime
- Dump historic operations
- Dump the operation priority queue state
- Dump operations in flight
- Dump perfcounters


Display Freespace
-----------------

Filesystem issues may arise. To display your filesystem's free space, execute
``df``. ::

	df -h

Execute ``df --help`` for additional usage.


I/O Statistics
--------------

Use `iostat`_ to identify I/O-related issues. ::

	iostat -x


Diagnostic Messages
-------------------

To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
or ``tail``.  For example::

	dmesg | grep scsi


Stopping w/out Rebalancing
==========================

Periodically, you may need to perform maintenance on a subset of your cluster,
or resolve a problem that affects a failure domain (e.g., a rack). If you do not
want CRUSH to automatically rebalance the cluster as you stop OSDs for
maintenance, set the cluster to ``noout`` first::

	ceph osd set noout

Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
failure domain that requires maintenance work. ::

	stop ceph-osd id={num}

.. note:: Placement groups within the OSDs you stop will become ``degraded``
   while you are addressing issues with within the failure domain.

Once you have completed your maintenance, restart the OSDs. ::

	start ceph-osd id={num}

Finally, you must unset the cluster from ``noout``. ::

	ceph osd unset noout


.. _osd-not-running:

OSD Not Running
===============

Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
allow it to rejoin the cluster and recover.

An OSD Won't Start
------------------

If you start your cluster and an OSD won't start, check the following:

- **Configuration File:** If you were not able to get OSDs running from
  a new installation, check your configuration file to ensure it conforms
  (e.g., ``host`` not ``hostname``, etc.).

- **Check Paths:** Check the paths in your configuration, and the actual
  paths themselves for data and journals. If you separate the OSD data from
  the journal data and there are errors in your configuration file or in the
  actual mounts, you may have trouble starting OSDs. If you want to store the
  journal on a block device, you should partition your journal disk and assign
  one partition per OSD.

- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
  hitting the default maximum number of threads (e.g., usually 32k), especially
  during recovery. You can increase the number of threads using ``sysctl`` to
  see if increasing the maximum number of threads to the maximum possible
  number of threads allowed (i.e.,  4194303) will help. For example::

	sysctl -w kernel.pid_max=4194303

  If increasing the maximum thread count resolves the issue, you can make it
  permanent by including a ``kernel.pid_max`` setting in the
  ``/etc/sysctl.conf`` file. For example::

	kernel.pid_max = 4194303

- **Kernel Version:** Identify the kernel version and distribution you
  are using. Ceph uses some third party tools by default, which may be
  buggy or may conflict with certain distributions and/or kernel
  versions (e.g., Google perftools). Check the `OS recommendations`_
  to ensure you have addressed any issues related to your kernel.

- **Segment Fault:** If there is a segment fault, turn your logging up
  (if it isn't already), and try again. If it segment faults again,
  contact the ceph-devel email list and provide your Ceph configuration
  file, your monitor output and the contents of your log file(s).


An OSD Failed
-------------

When a ``ceph-osd`` process dies, the monitor will learn about the failure
from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
command::

	ceph health
	HEALTH_WARN 1/3 in osds are down

Specifically, you will get a warning whenever there are ``ceph-osd``
processes that are marked ``in`` and ``down``.  You can identify which
``ceph-osds`` are ``down`` with::

	ceph health detail
	HEALTH_WARN 1/3 in osds are down
	osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

If there is a disk
failure or other fault preventing ``ceph-osd`` from functioning or
restarting, an error message should be present in its log file in
``/var/log/ceph``.

If the daemon stopped because of a heartbeat failure, the underlying
kernel file system may be unresponsive. Check ``dmesg`` output for disk
or other kernel errors.

If the problem is a software error (failed assertion or other
unexpected error), it should be reported to the `ceph-devel`_ email list.


No Free Drive Space
-------------------

Ceph prevents you from writing to a full OSD so that you don't lose data.
In an operational cluster, you should receive a warning when your cluster
is getting near its full ratio. The ``mon osd full ratio`` defaults to
``0.95``, or 95% of capacity before it stops clients from writing data.
The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
capacity when it blocks backfills from starting. The
``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
when it generates a health warning.

Full cluster issues usually arise when testing how Ceph handles an OSD
failure on a small cluster. When one node has a high percentage of the
cluster's data, the cluster can easily eclipse its nearfull and full ratio
immediately. If you are testing how Ceph reacts to OSD failures on a small
cluster, you should leave ample free disk space and consider temporarily
lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio``  and
``mon osd nearfull ratio``.

Full ``ceph-osds`` will be reported by ``ceph health``::

	ceph health
	HEALTH_WARN 1 nearfull osd(s)

Or::

	ceph health detail
	HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
	osd.3 is full at 97%
	osd.4 is backfill full at 91%
	osd.2 is near full at 87%

The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
the cluster to redistribute data to the newly available storage.

If you cannot start an OSD because it is full, you may delete some data by deleting
some placement group directories in the full OSD.

.. important:: If you choose to delete a placement group directory on a full OSD,
   **DO NOT** delete the same placement group directory on another full OSD, or
   **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
   at least one OSD.

See `Monitor Config Reference`_ for additional details.


OSDs are Slow/Unresponsive
==========================

A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
have eliminated other troubleshooting possibilities before delving into OSD
performance issues. For example, ensure that your network(s) is working properly
and your OSDs are running. Check to see if OSDs are throttling recovery traffic.

.. tip:: Newer versions of Ceph provide better recovery handling by preventing
   recovering OSDs from using up system resources so that ``up`` and ``in``
   OSDs aren't available or are otherwise slow.


Networking Issues
-----------------

Ceph is a distributed storage system, so it  depends upon networks to peer with
OSDs, replicate objects, recover from faults and check heartbeats. Networking
issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
details.

Ensure that Ceph processes and Ceph-dependent processes are connected and/or
listening. ::

	netstat -a | grep ceph
	netstat -l | grep ceph
	sudo netstat -p | grep ceph

Check network statistics. ::

	netstat -s


Drive Configuration
-------------------

A storage drive should only support one OSD. Sequential read and sequential
write throughput can bottleneck if other processes share the drive, including
journals, operating systems, monitors, other OSDs and non-Ceph processes.

Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
option to accelerate the response time--particularly when using the ``XFS`` or
``ext4`` filesystems. By contrast, the ``btrfs`` filesystem can write and journal
simultaneously.

.. note:: Partitioning a drive does not change its total throughput or
   sequential read/write limits. Running a journal in a separate partition
   may help, but you should prefer a separate physical drive.


Bad Sectors / Fragmented Disk
-----------------------------

Check your disks for bad sectors and fragmentation. This can cause total throughput
to drop substantially.


Co-resident Monitors/OSDs
-------------------------

Monitors are generally light-weight processes, but they do lots of ``fsync()``,
which can interfere with other workloads, particularly if monitors run on the
same drive as your OSDs. Additionally, if you run monitors on the same host as
the OSDs, you may incur performance issues related to:

- Running an older kernel (pre-3.0)
- Running Argonaut with an old ``glibc``
- Running a kernel with no syncfs(2) syscall.

In these cases, multiple OSDs running on the same host can drag each other down
by doing lots of commits. That often leads to the bursty writes.


Co-resident Processes
---------------------

Spinning up co-resident processes such as a cloud-based solution, virtual
machines and other applications that write data to Ceph while operating on the
same hardware as OSDs can introduce significant OSD latency. Generally, we
recommend optimizing a host for use with Ceph and using other hosts for other
processes. The practice of separating Ceph operations from other applications
may help improve performance and may streamline troubleshooting and maintenance.


Logging Levels
--------------

If you turned logging levels up to track an issue and then forgot to turn
logging levels back down, the OSD may be putting a lot of logs onto the disk. If
you intend to keep logging levels high, you may consider mounting a drive to the
default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).


Recovery Throttling
-------------------

Depending upon your configuration, Ceph may reduce recovery rates to maintain
performance or it may increase recovery rates to the point that recovery
impacts OSD performance. Check to see if the OSD is recovering.


Kernel Version
--------------

Check the kernel version you are running. Older kernels may not receive
new backports that Ceph depends upon for better performance.


Kernel Issues with SyncFS
-------------------------

Try running one OSD per host to see if performance improves. Old kernels
might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.


Filesystem Issues
-----------------

Currently, we recommend deploying clusters with XFS. The btrfs
filesystem has many attractive features, but bugs in the filesystem may
lead to performance issues.  We do not recommend ext4 because xattr size
limitations break our support for long object names (needed for RGW).

For more information, see `Filesystem Recommendations`_.

.. _Filesystem Recommendations: ../configuration/filesystem-recommendations


Insufficient RAM
----------------

We recommend 1GB of RAM per OSD daemon. You may notice that during normal
operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
Unused RAM makes it tempting to use the excess RAM for co-resident applications,
VMs and so forth. However, when OSDs go into recovery mode, their memory
utilization spikes. If there is no RAM available, the OSD performance will slow
considerably.


Old Requests or Slow Requests
-----------------------------

If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
complaining about requests that are taking too long.  The warning threshold
defaults to 30 seconds, and is configurable via the ``osd op complaint time``
option.  When this happens, the cluster log will receive messages.

Legacy versions of Ceph complain about 'old requests`::

	osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops

New versions of Ceph complain about 'slow requests`::

	{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
	{date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]


Possible causes include:

- A bad drive (check ``dmesg`` output)
- A bug in the kernel file system bug (check ``dmesg`` output)
- An overloaded cluster (check system load, iostat, etc.)
- A bug in the ``ceph-osd`` daemon.

Possible solutions

- Remove VMs Cloud Solutions from Ceph Hosts
- Upgrade Kernel
- Upgrade Ceph
- Restart OSDs


Flapping OSDs
=============

We recommend using both a public (front-end) network and a cluster (back-end)
network so that you can better meet the capacity requirements of object
replication. Another advantage is that you can run a cluster network such that
it isn't connected to the internet, thereby preventing some denial of service
attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
network when it's available. See `Monitor/OSD Interaction`_ for details.

However, if the cluster (back-end) network fails or develops significant latency
while the public (front-end) network operates optimally, OSDs currently do not
handle this situation well. What happens is that OSDs mark each other ``down``
on the monitor, while marking themselves ``up``. We call this scenario
'flapping`.

If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
then ``up`` again), you can force the monitors to stop the flapping with::

	ceph osd set noup      # prevent OSDs from getting marked up
	ceph osd set nodown    # prevent OSDs from getting marked down

These flags are recorded in the osdmap structure::

	ceph osd dump | grep flags
	flags no-up,no-down

You can clear the flags with::

	ceph osd unset noup
	ceph osd unset nodown

Two other flags are supported, ``noin`` and ``noout``, which prevent
booting OSDs from being marked ``in`` (allocated data) or protect OSDs
from eventually being marked ``out`` (regardless of what the current value for
``mon osd down out interval`` is).

.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
   sense that once the flags are cleared, the action they were blocking
   should occur shortly after.  The ``noin`` flag, on the other hand,
   prevents OSDs from being marked ``in`` on boot, and any daemons that
   started while the flag was set will remain that way.


.. _iostat: http://en.wikipedia.org/wiki/Iostat
.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
.. _Logging and Debugging: ../log-and-debug
.. _Debugging and Logging: ../debug
.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
.. _Monitor Config Reference: ../../configuration/mon-config-ref
.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
.. _OS recommendations: ../../../start/os-recommendations
.. _ceph-devel: ceph-devel@vger.kernel.org
Commit	Line	Data
7c673cae FG	1	======================
	2	Troubleshooting OSDs
	3	======================
	4
	5	Before troubleshooting your OSDs, check your monitors and network first. If
	6	you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
	7	a health status, it means that the monitors have a quorum.
	8	If you don't have a monitor quorum or if there are errors with the monitor
	9	status, `address the monitor issues first <../troubleshooting-mon>`_.
	10	Check your networks to ensure they
	11	are running properly, because networks may have a significant impact on OSD
	12	operation and performance.
	13
	14
	15
	16	Obtaining Data About OSDs
	17	=========================
	18
	19	A good first step in troubleshooting your OSDs is to obtain information in
	20	addition to the information you collected while `monitoring your OSDs`_
	21	(e.g., ``ceph osd tree``).
	22
	23
	24	Ceph Logs
	25	---------
	26
	27	If you haven't changed the default path, you can find Ceph log files at
	28	``/var/log/ceph``::
	29
	30	ls /var/log/ceph
	31
	32	If you don't get enough log detail, you can change your logging level. See
	33	`Logging and Debugging`_ for details to ensure that Ceph performs adequately
	34	under high logging volume.
	35
	36
	37	Admin Socket
	38	------------
	39
	40	Use the admin socket tool to retrieve runtime information. For details, list
	41	the sockets for your Ceph processes::
	42
	43	ls /var/run/ceph
	44
	45	Then, execute the following, replacing ``{daemon-name}`` with an actual
	46	daemon (e.g., ``osd.0``)::
	47
	48	ceph daemon osd.0 help
	49
	50	Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``)::
	51
	52	ceph daemon {socket-file} help
	53
	54
	55	The admin socket, among other things, allows you to:
	56
	57	- List your configuration at runtime
	58	- Dump historic operations
	59	- Dump the operation priority queue state
	60	- Dump operations in flight
	61	- Dump perfcounters
	62
	63
	64	Display Freespace
65	-----------------
66
67	Filesystem issues may arise. To display your filesystem's free space, execute
68	``df``. ::
69
70	df -h
71
72	Execute ``df --help`` for additional usage.
73
74
75	I/O Statistics
76	--------------
77
78	Use `iostat`_ to identify I/O-related issues. ::
79
80	iostat -x
81
82
83	Diagnostic Messages
84	-------------------
85
86	To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
87	or ``tail``. For example::
88
89	dmesg \| grep scsi
90
91
92	Stopping w/out Rebalancing
93	==========================
94
95	Periodically, you may need to perform maintenance on a subset of your cluster,
96	or resolve a problem that affects a failure domain (e.g., a rack). If you do not
97	want CRUSH to automatically rebalance the cluster as you stop OSDs for
98	maintenance, set the cluster to ``noout`` first::
99
100	ceph osd set noout
101
102	Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
103	failure domain that requires maintenance work. ::
104
105	stop ceph-osd id={num}
106
107	.. note:: Placement groups within the OSDs you stop will become ``degraded``
108	while you are addressing issues with within the failure domain.
109
110	Once you have completed your maintenance, restart the OSDs. ::
111
112	start ceph-osd id={num}
113
114	Finally, you must unset the cluster from ``noout``. ::
115
116	ceph osd unset noout
117
118
119
120	.. _osd-not-running:
121
122	OSD Not Running
123	===============
124
125	Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
126	allow it to rejoin the cluster and recover.
127
128	An OSD Won't Start
129	------------------
130
131	If you start your cluster and an OSD won't start, check the following:
132
133	- Configuration File: If you were not able to get OSDs running from
134	a new installation, check your configuration file to ensure it conforms
135	(e.g., ``host`` not ``hostname``, etc.).
136
137	- Check Paths: Check the paths in your configuration, and the actual
138	paths themselves for data and journals. If you separate the OSD data from
139	the journal data and there are errors in your configuration file or in the
140	actual mounts, you may have trouble starting OSDs. If you want to store the
141	journal on a block device, you should partition your journal disk and assign
142	one partition per OSD.
143
144	- Check Max Threadcount: If you have a node with a lot of OSDs, you may be
145	hitting the default maximum number of threads (e.g., usually 32k), especially
146	during recovery. You can increase the number of threads using ``sysctl`` to
147	see if increasing the maximum number of threads to the maximum possible
148	number of threads allowed (i.e., 4194303) will help. For example::
149
150	sysctl -w kernel.pid_max=4194303
151
152	If increasing the maximum thread count resolves the issue, you can make it
153	permanent by including a ``kernel.pid_max`` setting in the
154	``/etc/sysctl.conf`` file. For example::
155
156	kernel.pid_max = 4194303
157
158	- Kernel Version: Identify the kernel version and distribution you
159	are using. Ceph uses some third party tools by default, which may be
160	buggy or may conflict with certain distributions and/or kernel
161	versions (e.g., Google perftools). Check the `OS recommendations`_
162	to ensure you have addressed any issues related to your kernel.
163
164	- Segment Fault: If there is a segment fault, turn your logging up
165	(if it isn't already), and try again. If it segment faults again,
166	contact the ceph-devel email list and provide your Ceph configuration
167	file, your monitor output and the contents of your log file(s).
168
169
170
171	An OSD Failed
172	-------------
173
174	When a ``ceph-osd`` process dies, the monitor will learn about the failure
175	from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
176	command::
177
178	ceph health
179	HEALTH_WARN 1/3 in osds are down
180
181	Specifically, you will get a warning whenever there are ``ceph-osd``
182	processes that are marked ``in`` and ``down``. You can identify which
183	``ceph-osds`` are ``down`` with::
184
185	ceph health detail
186	HEALTH_WARN 1/3 in osds are down
187	osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
188
189	If there is a disk
190	failure or other fault preventing ``ceph-osd`` from functioning or
191	restarting, an error message should be present in its log file in
192	``/var/log/ceph``.
193
194	If the daemon stopped because of a heartbeat failure, the underlying
195	kernel file system may be unresponsive. Check ``dmesg`` output for disk
196	or other kernel errors.
197
198	If the problem is a software error (failed assertion or other
199	unexpected error), it should be reported to the `ceph-devel`_ email list.
200
201
202	No Free Drive Space
203	-------------------
204
205	Ceph prevents you from writing to a full OSD so that you don't lose data.
206	In an operational cluster, you should receive a warning when your cluster
207	is getting near its full ratio. The ``mon osd full ratio`` defaults to
208	``0.95``, or 95% of capacity before it stops clients from writing data.
209	The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
210	capacity when it blocks backfills from starting. The
211	``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
212	when it generates a health warning.
213
214	Full cluster issues usually arise when testing how Ceph handles an OSD
215	failure on a small cluster. When one node has a high percentage of the
216	cluster's data, the cluster can easily eclipse its nearfull and full ratio
217	immediately. If you are testing how Ceph reacts to OSD failures on a small
218	cluster, you should leave ample free disk space and consider temporarily
219	lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and
220	``mon osd nearfull ratio``.
221
222	Full ``ceph-osds`` will be reported by ``ceph health``::
223
224	ceph health
225	HEALTH_WARN 1 nearfull osd(s)
226
227	Or::
228
229	ceph health detail
230	HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
231	osd.3 is full at 97%
232	osd.4 is backfill full at 91%
233	osd.2 is near full at 87%
234
235	The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
236	the cluster to redistribute data to the newly available storage.
237
238	If you cannot start an OSD because it is full, you may delete some data by deleting
239	some placement group directories in the full OSD.
240
241	.. important:: If you choose to delete a placement group directory on a full OSD,
242	DO NOT delete the same placement group directory on another full OSD, or
243	YOU MAY LOSE DATA. You MUST maintain at least one copy of your data on
244	at least one OSD.
245
246	See `Monitor Config Reference`_ for additional details.
247
248
249	OSDs are Slow/Unresponsive
250	==========================
251
252	A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
253	have eliminated other troubleshooting possibilities before delving into OSD
254	performance issues. For example, ensure that your network(s) is working properly
255	and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
256
257	.. tip:: Newer versions of Ceph provide better recovery handling by preventing
258	recovering OSDs from using up system resources so that ``up`` and ``in``
259	OSDs aren't available or are otherwise slow.
260
261
262	Networking Issues
263	-----------------
264
265	Ceph is a distributed storage system, so it depends upon networks to peer with
266	OSDs, replicate objects, recover from faults and check heartbeats. Networking
267	issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
268	details.
269
270	Ensure that Ceph processes and Ceph-dependent processes are connected and/or
271	listening. ::
272
273	netstat -a \| grep ceph
274	netstat -l \| grep ceph
275	sudo netstat -p \| grep ceph
276
277	Check network statistics. ::
278
279	netstat -s
280
281
282	Drive Configuration
283	-------------------
284
285	A storage drive should only support one OSD. Sequential read and sequential
286	write throughput can bottleneck if other processes share the drive, including
287	journals, operating systems, monitors, other OSDs and non-Ceph processes.
288
289	Ceph acknowledges writes after journaling, so fast SSDs are an attractive
290	option to accelerate the response time--particularly when using the ``XFS`` or
291	``ext4`` filesystems. By contrast, the ``btrfs`` filesystem can write and journal
292	simultaneously.
293
294	.. note:: Partitioning a drive does not change its total throughput or
295	sequential read/write limits. Running a journal in a separate partition
296	may help, but you should prefer a separate physical drive.
297
298
299	Bad Sectors / Fragmented Disk
300	-----------------------------
301
302	Check your disks for bad sectors and fragmentation. This can cause total throughput
303	to drop substantially.
304
305
306	Co-resident Monitors/OSDs
307	-------------------------
308
309	Monitors are generally light-weight processes, but they do lots of ``fsync()``,
310	which can interfere with other workloads, particularly if monitors run on the
311	same drive as your OSDs. Additionally, if you run monitors on the same host as
312	the OSDs, you may incur performance issues related to:
313
314	- Running an older kernel (pre-3.0)
315	- Running Argonaut with an old ``glibc``
316	- Running a kernel with no syncfs(2) syscall.
317
318	In these cases, multiple OSDs running on the same host can drag each other down
319	by doing lots of commits. That often leads to the bursty writes.
320
321
322	Co-resident Processes
323	---------------------
324
325	Spinning up co-resident processes such as a cloud-based solution, virtual
326	machines and other applications that write data to Ceph while operating on the
327	same hardware as OSDs can introduce significant OSD latency. Generally, we
328	recommend optimizing a host for use with Ceph and using other hosts for other
329	processes. The practice of separating Ceph operations from other applications
330	may help improve performance and may streamline troubleshooting and maintenance.
331
332
333	Logging Levels
334	--------------
335
336	If you turned logging levels up to track an issue and then forgot to turn
337	logging levels back down, the OSD may be putting a lot of logs onto the disk. If
338	you intend to keep logging levels high, you may consider mounting a drive to the
339	default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
340
341
342	Recovery Throttling
343	-------------------
344
345	Depending upon your configuration, Ceph may reduce recovery rates to maintain
346	performance or it may increase recovery rates to the point that recovery
347	impacts OSD performance. Check to see if the OSD is recovering.
348
349
350	Kernel Version
351	--------------
352
353	Check the kernel version you are running. Older kernels may not receive
354	new backports that Ceph depends upon for better performance.
355
356
357	Kernel Issues with SyncFS
358	-------------------------
359
360	Try running one OSD per host to see if performance improves. Old kernels
361	might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
362
363
364	Filesystem Issues
365	-----------------
366
367	Currently, we recommend deploying clusters with XFS. The btrfs
368	filesystem has many attractive features, but bugs in the filesystem may
369	lead to performance issues. We do not recommend ext4 because xattr size
370	limitations break our support for long object names (needed for RGW).
371
372	For more information, see `Filesystem Recommendations`_.
373
374	.. _Filesystem Recommendations: ../configuration/filesystem-recommendations
375
376
377	Insufficient RAM
378	----------------
379
380	We recommend 1GB of RAM per OSD daemon. You may notice that during normal
381	operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
382	Unused RAM makes it tempting to use the excess RAM for co-resident applications,
383	VMs and so forth. However, when OSDs go into recovery mode, their memory
384	utilization spikes. If there is no RAM available, the OSD performance will slow
385	considerably.
386
387
388	Old Requests or Slow Requests
389	-----------------------------
390
391	If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
392	complaining about requests that are taking too long. The warning threshold
393	defaults to 30 seconds, and is configurable via the ``osd op complaint time``
394	option. When this happens, the cluster log will receive messages.
395
396	Legacy versions of Ceph complain about 'old requests`::
397
398	osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
399
400	New versions of Ceph complain about 'slow requests`::
401
402	{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
403	{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
404
405
406	Possible causes include:
407
408	- A bad drive (check ``dmesg`` output)
409	- A bug in the kernel file system bug (check ``dmesg`` output)
410	- An overloaded cluster (check system load, iostat, etc.)
411	- A bug in the ``ceph-osd`` daemon.
412
413	Possible solutions
414
415	- Remove VMs Cloud Solutions from Ceph Hosts
416	- Upgrade Kernel
417	- Upgrade Ceph
418	- Restart OSDs
419
420
421
422	Flapping OSDs
423	=============
424
425	We recommend using both a public (front-end) network and a cluster (back-end)
426	network so that you can better meet the capacity requirements of object
427	replication. Another advantage is that you can run a cluster network such that
428	it isn't connected to the internet, thereby preventing some denial of service
429	attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
430	network when it's available. See `Monitor/OSD Interaction`_ for details.
431
432	However, if the cluster (back-end) network fails or develops significant latency
433	while the public (front-end) network operates optimally, OSDs currently do not
434	handle this situation well. What happens is that OSDs mark each other ``down``
435	on the monitor, while marking themselves ``up``. We call this scenario
436	'flapping`.
437
438	If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
439	then ``up`` again), you can force the monitors to stop the flapping with::
440
441	ceph osd set noup # prevent OSDs from getting marked up
442	ceph osd set nodown # prevent OSDs from getting marked down
443
444	These flags are recorded in the osdmap structure::
445
446	ceph osd dump \| grep flags
447	flags no-up,no-down
448
449	You can clear the flags with::
450
451	ceph osd unset noup
452	ceph osd unset nodown
453
454	Two other flags are supported, ``noin`` and ``noout``, which prevent
455	booting OSDs from being marked ``in`` (allocated data) or protect OSDs
456	from eventually being marked ``out`` (regardless of what the current value for
457	``mon osd down out interval`` is).
458
459	.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
460	sense that once the flags are cleared, the action they were blocking
461	should occur shortly after. The ``noin`` flag, on the other hand,
462	prevents OSDs from being marked ``in`` on boot, and any daemons that
463	started while the flag was set will remain that way.
464
465
466
467
468
469
470	.. _iostat: http://en.wikipedia.org/wiki/Iostat
471	.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
472	.. _Logging and Debugging: ../log-and-debug
473	.. _Debugging and Logging: ../debug
474	.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
475	.. _Monitor Config Reference: ../../configuration/mon-config-ref
476	.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
477	.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
478	.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
479	.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
480	.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
481	.. _OS recommendations: ../../../start/os-recommendations
482	.. _ceph-devel: ceph-devel@vger.kernel.org