It turns out it's possible to get __remove_osd() called twice on the
same OSD. That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory. One scenario that I was able to reproduce is as follows:
A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().
Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: stable@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: stable@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>