ceph/doc/rados/troubleshooting/troubleshooting-osd.rst

   1 ======================
   2  Troubleshooting OSDs
   3 ======================
   4
   5 Before troubleshooting your OSDs, check your monitors and network first. If
   6 you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
   7 a health status, it means that the monitors have a quorum.
   8 If you don't have a monitor quorum or if there are errors with the monitor
   9 status, `address the monitor issues first <../troubleshooting-mon>`_.
  10 Check your networks to ensure they
  11 are running properly, because networks may have a significant impact on OSD
  12 operation and performance.
  13
  14
  15
  16 Obtaining Data About OSDs
  17 =========================
  18
  19 A good first step in troubleshooting your OSDs is to obtain information in
  20 addition to the information you collected while `monitoring your OSDs`_
  21 (e.g., ``ceph osd tree``).
  22
  23
  24 Ceph Logs
  25 ---------
  26
  27 If you haven't changed the default path, you can find Ceph log files at
  28 ``/var/log/ceph``::
  29
  30         ls /var/log/ceph
  31
  32 If you don't get enough log detail, you can change your logging level.  See
  33 `Logging and Debugging`_ for details to ensure that Ceph performs adequately
  34 under high logging volume.
  35
  36
  37 Admin Socket
  38 ------------
  39
  40 Use the admin socket tool to retrieve runtime information. For details, list
  41 the sockets for your Ceph processes::
  42
  43         ls /var/run/ceph
  44
  45 Then, execute the following, replacing ``{daemon-name}`` with an actual
  46 daemon (e.g., ``osd.0``)::
  47
  48   ceph daemon osd.0 help
  49
  50 Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``)::
  51
  52   ceph daemon {socket-file} help
  53
  54
  55 The admin socket, among other things, allows you to:
  56
  57 - List your configuration at runtime
  58 - Dump historic operations
  59 - Dump the operation priority queue state
  60 - Dump operations in flight
  61 - Dump perfcounters
  62
  63
  64 Display Freespace
  65 -----------------
  66
  67 Filesystem issues may arise. To display your filesystem's free space, execute
  68 ``df``. ::
  69
  70         df -h
  71
  72 Execute ``df --help`` for additional usage.
  73
  74
  75 I/O Statistics
  76 --------------
  77
  78 Use `iostat`_ to identify I/O-related issues. ::
  79
  80         iostat -x
  81
  82
  83 Diagnostic Messages
  84 -------------------
  85
  86 To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
  87 or ``tail``.  For example::
  88
  89         dmesg | grep scsi
  90
  91
  92 Stopping w/out Rebalancing
  93 ==========================
  94
  95 Periodically, you may need to perform maintenance on a subset of your cluster,
  96 or resolve a problem that affects a failure domain (e.g., a rack). If you do not
  97 want CRUSH to automatically rebalance the cluster as you stop OSDs for
  98 maintenance, set the cluster to ``noout`` first::
  99
 100         ceph osd set noout
 101
 102 Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
 103 failure domain that requires maintenance work. ::
 104
 105         stop ceph-osd id={num}
 106
 107 .. note:: Placement groups within the OSDs you stop will become ``degraded``
 108    while you are addressing issues with within the failure domain.
 109
 110 Once you have completed your maintenance, restart the OSDs. ::
 111
 112         start ceph-osd id={num}
 113
 114 Finally, you must unset the cluster from ``noout``. ::
 115
 116         ceph osd unset noout
 117
 118
 119
 120 .. _osd-not-running:
 121
 122 OSD Not Running
 123 ===============
 124
 125 Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
 126 allow it to rejoin the cluster and recover.
 127
 128 An OSD Won't Start
 129 ------------------
 130
 131 If you start your cluster and an OSD won't start, check the following:
 132
 133 - **Configuration File:** If you were not able to get OSDs running from
 134   a new installation, check your configuration file to ensure it conforms
 135   (e.g., ``host`` not ``hostname``, etc.).
 136
 137 - **Check Paths:** Check the paths in your configuration, and the actual
 138   paths themselves for data and journals. If you separate the OSD data from
 139   the journal data and there are errors in your configuration file or in the
 140   actual mounts, you may have trouble starting OSDs. If you want to store the
 141   journal on a block device, you should partition your journal disk and assign
 142   one partition per OSD.
 143
 144 - **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
 145   hitting the default maximum number of threads (e.g., usually 32k), especially
 146   during recovery. You can increase the number of threads using ``sysctl`` to
 147   see if increasing the maximum number of threads to the maximum possible
 148   number of threads allowed (i.e.,  4194303) will help. For example::
 149
 150         sysctl -w kernel.pid_max=4194303
 151
 152   If increasing the maximum thread count resolves the issue, you can make it
 153   permanent by including a ``kernel.pid_max`` setting in the
 154   ``/etc/sysctl.conf`` file. For example::
 155
 156         kernel.pid_max = 4194303
 157
 158 - **Kernel Version:** Identify the kernel version and distribution you
 159   are using. Ceph uses some third party tools by default, which may be
 160   buggy or may conflict with certain distributions and/or kernel
 161   versions (e.g., Google perftools). Check the `OS recommendations`_
 162   to ensure you have addressed any issues related to your kernel.
 163
 164 - **Segment Fault:** If there is a segment fault, turn your logging up
 165   (if it isn't already), and try again. If it segment faults again,
 166   contact the ceph-devel email list and provide your Ceph configuration
 167   file, your monitor output and the contents of your log file(s).
 168
 169
 170
 171 An OSD Failed
 172 -------------
 173
 174 When a ``ceph-osd`` process dies, the monitor will learn about the failure
 175 from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
 176 command::
 177
 178         ceph health
 179         HEALTH_WARN 1/3 in osds are down
 180
 181 Specifically, you will get a warning whenever there are ``ceph-osd``
 182 processes that are marked ``in`` and ``down``.  You can identify which
 183 ``ceph-osds`` are ``down`` with::
 184
 185         ceph health detail
 186         HEALTH_WARN 1/3 in osds are down
 187         osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
 188
 189 If there is a disk
 190 failure or other fault preventing ``ceph-osd`` from functioning or
 191 restarting, an error message should be present in its log file in
 192 ``/var/log/ceph``.
 193
 194 If the daemon stopped because of a heartbeat failure, the underlying
 195 kernel file system may be unresponsive. Check ``dmesg`` output for disk
 196 or other kernel errors.
 197
 198 If the problem is a software error (failed assertion or other
 199 unexpected error), it should be reported to the `ceph-devel`_ email list.
 200
 201
 202 No Free Drive Space
 203 -------------------
 204
 205 Ceph prevents you from writing to a full OSD so that you don't lose data.
 206 In an operational cluster, you should receive a warning when your cluster
 207 is getting near its full ratio. The ``mon osd full ratio`` defaults to
 208 ``0.95``, or 95% of capacity before it stops clients from writing data.
 209 The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
 210 capacity when it blocks backfills from starting. The
 211 ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
 212 when it generates a health warning.
 213
 214 Full cluster issues usually arise when testing how Ceph handles an OSD
 215 failure on a small cluster. When one node has a high percentage of the
 216 cluster's data, the cluster can easily eclipse its nearfull and full ratio
 217 immediately. If you are testing how Ceph reacts to OSD failures on a small
 218 cluster, you should leave ample free disk space and consider temporarily
 219 lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio``  and
 220 ``mon osd nearfull ratio``.
 221
 222 Full ``ceph-osds`` will be reported by ``ceph health``::
 223
 224         ceph health
 225         HEALTH_WARN 1 nearfull osd(s)
 226
 227 Or::
 228
 229         ceph health detail
 230         HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
 231         osd.3 is full at 97%
 232         osd.4 is backfill full at 91%
 233         osd.2 is near full at 87%
 234
 235 The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
 236 the cluster to redistribute data to the newly available storage.
 237
 238 If you cannot start an OSD because it is full, you may delete some data by deleting
 239 some placement group directories in the full OSD.
 240
 241 .. important:: If you choose to delete a placement group directory on a full OSD,
 242    **DO NOT** delete the same placement group directory on another full OSD, or
 243    **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
 244    at least one OSD.
 245
 246 See `Monitor Config Reference`_ for additional details.
 247
 248
 249 OSDs are Slow/Unresponsive
 250 ==========================
 251
 252 A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
 253 have eliminated other troubleshooting possibilities before delving into OSD
 254 performance issues. For example, ensure that your network(s) is working properly
 255 and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
 256
 257 .. tip:: Newer versions of Ceph provide better recovery handling by preventing
 258    recovering OSDs from using up system resources so that ``up`` and ``in``
 259    OSDs aren't available or are otherwise slow.
 260
 261
 262 Networking Issues
 263 -----------------
 264
 265 Ceph is a distributed storage system, so it  depends upon networks to peer with
 266 OSDs, replicate objects, recover from faults and check heartbeats. Networking
 267 issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
 268 details.
 269
 270 Ensure that Ceph processes and Ceph-dependent processes are connected and/or
 271 listening. ::
 272
 273         netstat -a | grep ceph
 274         netstat -l | grep ceph
 275         sudo netstat -p | grep ceph
 276
 277 Check network statistics. ::
 278
 279         netstat -s
 280
 281
 282 Drive Configuration
 283 -------------------
 284
 285 A storage drive should only support one OSD. Sequential read and sequential
 286 write throughput can bottleneck if other processes share the drive, including
 287 journals, operating systems, monitors, other OSDs and non-Ceph processes.
 288
 289 Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
 290 option to accelerate the response time--particularly when using the ``XFS`` or
 291 ``ext4`` filesystems. By contrast, the ``btrfs`` filesystem can write and journal
 292 simultaneously.
 293
 294 .. note:: Partitioning a drive does not change its total throughput or
 295    sequential read/write limits. Running a journal in a separate partition
 296    may help, but you should prefer a separate physical drive.
 297
 298
 299 Bad Sectors / Fragmented Disk
 300 -----------------------------
 301
 302 Check your disks for bad sectors and fragmentation. This can cause total throughput
 303 to drop substantially.
 304
 305
 306 Co-resident Monitors/OSDs
 307 -------------------------
 308
 309 Monitors are generally light-weight processes, but they do lots of ``fsync()``,
 310 which can interfere with other workloads, particularly if monitors run on the
 311 same drive as your OSDs. Additionally, if you run monitors on the same host as
 312 the OSDs, you may incur performance issues related to:
 313
 314 - Running an older kernel (pre-3.0)
 315 - Running Argonaut with an old ``glibc``
 316 - Running a kernel with no syncfs(2) syscall.
 317
 318 In these cases, multiple OSDs running on the same host can drag each other down
 319 by doing lots of commits. That often leads to the bursty writes.
 320
 321
 322 Co-resident Processes
 323 ---------------------
 324
 325 Spinning up co-resident processes such as a cloud-based solution, virtual
 326 machines and other applications that write data to Ceph while operating on the
 327 same hardware as OSDs can introduce significant OSD latency. Generally, we
 328 recommend optimizing a host for use with Ceph and using other hosts for other
 329 processes. The practice of separating Ceph operations from other applications
 330 may help improve performance and may streamline troubleshooting and maintenance.
 331
 332
 333 Logging Levels
 334 --------------
 335
 336 If you turned logging levels up to track an issue and then forgot to turn
 337 logging levels back down, the OSD may be putting a lot of logs onto the disk. If
 338 you intend to keep logging levels high, you may consider mounting a drive to the
 339 default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
 340
 341
 342 Recovery Throttling
 343 -------------------
 344
 345 Depending upon your configuration, Ceph may reduce recovery rates to maintain
 346 performance or it may increase recovery rates to the point that recovery
 347 impacts OSD performance. Check to see if the OSD is recovering.
 348
 349
 350 Kernel Version
 351 --------------
 352
 353 Check the kernel version you are running. Older kernels may not receive
 354 new backports that Ceph depends upon for better performance.
 355
 356
 357 Kernel Issues with SyncFS
 358 -------------------------
 359
 360 Try running one OSD per host to see if performance improves. Old kernels
 361 might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
 362
 363
 364 Filesystem Issues
 365 -----------------
 366
 367 Currently, we recommend deploying clusters with XFS. The btrfs
 368 filesystem has many attractive features, but bugs in the filesystem may
 369 lead to performance issues.  We do not recommend ext4 because xattr size
 370 limitations break our support for long object names (needed for RGW).
 371
 372 For more information, see `Filesystem Recommendations`_.
 373
 374 .. _Filesystem Recommendations: ../configuration/filesystem-recommendations
 375
 376
 377 Insufficient RAM
 378 ----------------
 379
 380 We recommend 1GB of RAM per OSD daemon. You may notice that during normal
 381 operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
 382 Unused RAM makes it tempting to use the excess RAM for co-resident applications,
 383 VMs and so forth. However, when OSDs go into recovery mode, their memory
 384 utilization spikes. If there is no RAM available, the OSD performance will slow
 385 considerably.
 386
 387
 388 Old Requests or Slow Requests
 389 -----------------------------
 390
 391 If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
 392 complaining about requests that are taking too long.  The warning threshold
 393 defaults to 30 seconds, and is configurable via the ``osd op complaint time``
 394 option.  When this happens, the cluster log will receive messages.
 395
 396 Legacy versions of Ceph complain about 'old requests`::
 397
 398         osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
 399
 400 New versions of Ceph complain about 'slow requests`::
 401
 402         {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
 403         {date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
 404
 405
 406 Possible causes include:
 407
 408 - A bad drive (check ``dmesg`` output)
 409 - A bug in the kernel file system bug (check ``dmesg`` output)
 410 - An overloaded cluster (check system load, iostat, etc.)
 411 - A bug in the ``ceph-osd`` daemon.
 412
 413 Possible solutions
 414
 415 - Remove VMs Cloud Solutions from Ceph Hosts
 416 - Upgrade Kernel
 417 - Upgrade Ceph
 418 - Restart OSDs
 419
 420 Debugging Slow Requests
 421 -----------------------
 422
 423 If you run "ceph daemon osd.<id> dump_historic_ops" or "dump_ops_in_flight",
 424 you will see a set of operations and a list of events each operation went
 425 through. These are briefly described below.
 426
 427 Events from the Messenger layer:
 428
 429 - header_read: when the messenger first started reading the message off the wire
 430 - throttled: when the messenger tried to acquire memory throttle space to read
 431   the message into memory
 432 - all_read: when the messenger finished reading the message off the wire
 433 - dispatched: when the messenger gave the message to the OSD
 434 - Initiated: <This is identical to header_read. The existence of both is a
 435   historical oddity.
 436
 437 Events from the OSD as it prepares operations
 438
 439 - queued_for_pg: the op has been put into the queue for processing by its PG
 440 - reached_pg: the PG has started doing the op
 441 - waiting for *: the op is waiting for some other work to complete before it
 442   can proceed (a new OSDMap; for its object target to scrub; for the PG to
 443   finish peering; all as specified in the message)
 444 - started: the op has been accepted as something the OSD should actually do
 445   (reasons not to do it: failed security/permission checks; out-of-date local
 446   state; etc) and is now actually being performed
 447 - waiting for subops from: the op has been sent to replica OSDs
 448
 449 Events from the FileStore
 450
 451 - commit_queued_for_journal_write: the op has been given to the FileStore
 452 - write_thread_in_journal_buffer: the op is in the journal's buffer and waiting
 453   to be persisted (as the next disk write)
 454 - journaled_completion_queued: the op was journaled to disk and its callback
 455   queued for invocation
 456
 457 Events from the OSD after stuff has been given to local disk
 458
 459 - op_commit: the op has been committed (ie, written to journal) by the
 460   primary OSD
 461 - op_applied: The op has been write()'en to the backing FS (ie, applied in
 462   memory but not flushed out to disk) on the primary
 463 - sub_op_applied: op_applied, but for a replica's "subop"
 464 - sub_op_committed: op_commited, but for a replica's subop (only for EC pools)
 465 - sub_op_commit_rec/sub_op_apply_rec from <X>: the primary marks this when it
 466   hears about the above, but for a particular replica <X>
 467 - commit_sent: we sent a reply back to the client (or primary OSD, for sub ops)
 468
 469 Many of these events are seemingly redundant, but cross important boundaries in
 470 the internal code (such as passing data across locks into new threads).
 471
 472 Flapping OSDs
 473 =============
 474
 475 We recommend using both a public (front-end) network and a cluster (back-end)
 476 network so that you can better meet the capacity requirements of object
 477 replication. Another advantage is that you can run a cluster network such that
 478 it isn't connected to the internet, thereby preventing some denial of service
 479 attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
 480 network when it's available. See `Monitor/OSD Interaction`_ for details.
 481
 482 However, if the cluster (back-end) network fails or develops significant latency
 483 while the public (front-end) network operates optimally, OSDs currently do not
 484 handle this situation well. What happens is that OSDs mark each other ``down``
 485 on the monitor, while marking themselves ``up``. We call this scenario
 486 'flapping`.
 487
 488 If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
 489 then ``up`` again), you can force the monitors to stop the flapping with::
 490
 491         ceph osd set noup      # prevent OSDs from getting marked up
 492         ceph osd set nodown    # prevent OSDs from getting marked down
 493
 494 These flags are recorded in the osdmap structure::
 495
 496         ceph osd dump | grep flags
 497         flags no-up,no-down
 498
 499 You can clear the flags with::
 500
 501         ceph osd unset noup
 502         ceph osd unset nodown
 503
 504 Two other flags are supported, ``noin`` and ``noout``, which prevent
 505 booting OSDs from being marked ``in`` (allocated data) or protect OSDs
 506 from eventually being marked ``out`` (regardless of what the current value for
 507 ``mon osd down out interval`` is).
 508
 509 .. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
 510    sense that once the flags are cleared, the action they were blocking
 511    should occur shortly after.  The ``noin`` flag, on the other hand,
 512    prevents OSDs from being marked ``in`` on boot, and any daemons that
 513    started while the flag was set will remain that way.
 514
 515
 516
 517
 518
 519
 520 .. _iostat: http://en.wikipedia.org/wiki/Iostat
 521 .. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
 522 .. _Logging and Debugging: ../log-and-debug
 523 .. _Debugging and Logging: ../debug
 524 .. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
 525 .. _Monitor Config Reference: ../../configuration/mon-config-ref
 526 .. _monitoring your OSDs: ../../operations/monitoring-osd-pg
 527 .. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
 528 .. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
 529 .. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
 530 .. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
 531 .. _OS recommendations: ../../../start/os-recommendations
 532 .. _ceph-devel: ceph-devel@vger.kernel.org