write throughput can bottleneck if other processes share the drive, including
journals, operating systems, monitors, other OSDs and non-Ceph processes.
-Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
-option to accelerate the response time--particularly when using the ``XFS`` or
-``ext4`` filesystems. By contrast, the ``btrfs`` filesystem can write and journal
-simultaneously.
+Ceph acknowledges writes *after* journaling, so fast SSDs are an
+attractive option to accelerate the response time--particularly when
+using the ``XFS`` or ``ext4`` filesystems. By contrast, the ``btrfs``
+filesystem can write and journal simultaneously. (Note, however, that
+we recommend against using ``btrfs`` for production deployments.)
.. note:: Partitioning a drive does not change its total throughput or
sequential read/write limits. Running a journal in a separate partition
Filesystem Issues
-----------------
-Currently, we recommend deploying clusters with XFS. The btrfs
-filesystem has many attractive features, but bugs in the filesystem may
-lead to performance issues. We do not recommend ext4 because xattr size
-limitations break our support for long object names (needed for RGW).
+Currently, we recommend deploying clusters with XFS.
+
+We recommend against using btrfs or ext4. The btrfs filesystem has
+many attractive features, but bugs in the filesystem may lead to
+performance issues and suprious ENOSPC errors. We do not recommend
+ext4 because xattr size limitations break our support for long object
+names (needed for RGW).
For more information, see `Filesystem Recommendations`_.
- Upgrade Ceph
- Restart OSDs
-
+Debugging Slow Requests
+-----------------------
+
+If you run "ceph daemon osd.<id> dump_historic_ops" or "dump_ops_in_flight",
+you will see a set of operations and a list of events each operation went
+through. These are briefly described below.
+
+Events from the Messenger layer:
+
+- header_read: when the messenger first started reading the message off the wire
+- throttled: when the messenger tried to acquire memory throttle space to read
+ the message into memory
+- all_read: when the messenger finished reading the message off the wire
+- dispatched: when the messenger gave the message to the OSD
+- Initiated: <This is identical to header_read. The existence of both is a
+ historical oddity.
+
+Events from the OSD as it prepares operations
+
+- queued_for_pg: the op has been put into the queue for processing by its PG
+- reached_pg: the PG has started doing the op
+- waiting for \*: the op is waiting for some other work to complete before it
+ can proceed (a new OSDMap; for its object target to scrub; for the PG to
+ finish peering; all as specified in the message)
+- started: the op has been accepted as something the OSD should actually do
+ (reasons not to do it: failed security/permission checks; out-of-date local
+ state; etc) and is now actually being performed
+- waiting for subops from: the op has been sent to replica OSDs
+
+Events from the FileStore
+
+- commit_queued_for_journal_write: the op has been given to the FileStore
+- write_thread_in_journal_buffer: the op is in the journal's buffer and waiting
+ to be persisted (as the next disk write)
+- journaled_completion_queued: the op was journaled to disk and its callback
+ queued for invocation
+
+Events from the OSD after stuff has been given to local disk
+
+- op_commit: the op has been committed (ie, written to journal) by the
+ primary OSD
+- op_applied: The op has been write()'en to the backing FS (ie, applied in
+ memory but not flushed out to disk) on the primary
+- sub_op_applied: op_applied, but for a replica's "subop"
+- sub_op_committed: op_commited, but for a replica's subop (only for EC pools)
+- sub_op_commit_rec/sub_op_apply_rec from <X>: the primary marks this when it
+ hears about the above, but for a particular replica <X>
+- commit_sent: we sent a reply back to the client (or primary OSD, for sub ops)
+
+Many of these events are seemingly redundant, but cross important boundaries in
+the internal code (such as passing data across locks into new threads).
Flapping OSDs
=============