]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/crimson/crimson.rst
import ceph quincy 17.2.4
[ceph.git] / ceph / doc / dev / crimson / crimson.rst
CommitLineData
9f95a23c
TL
1=======
2crimson
3=======
4
5Crimson is the code name of crimson-osd, which is the next generation ceph-osd.
6It targets fast networking devices, fast storage devices by leveraging state of
7the art technologies like DPDK and SPDK, for better performance. And it will
20effc67 8keep the support of HDDs and low-end SSDs via BlueStore. Crimson will try to
9f95a23c
TL
9be backward compatible with classic OSD.
10
f67539c2
TL
11.. highlight:: console
12
13Building Crimson
14================
15
20effc67 16Crimson is not enabled by default. To enable it::
f67539c2
TL
17
18 $ WITH_SEASTAR=true ./install-deps.sh
19 $ mkdir build && cd build
20 $ cmake -DWITH_SEASTAR=ON ..
21
22Please note, `ASan`_ is enabled by default if crimson is built from a source
23cloned using git.
24
f67539c2 25.. _ASan: https://github.com/google/sanitizers/wiki/AddressSanitizer
9f95a23c 26
2a845540
TL
27Installing Crimson with ready-to-use images
28===========================================
29
30An alternative to building Crimson from source is to use container images built
31by Ceph CI/CD and deploy them with one of the orchestrators: ``cephadm`` or ``Rook``.
32In this chapter documents the ``cephadm`` way.
33
34NOTE: We know that this procedure is suboptimal, but it has passed internal
35external quality assurance.::
36
37
38 $ curl -L https://raw.githubusercontent.com/ceph/ceph-ci/wip-bharat-crimson/src/cephadm/cephadm -o cephadm
39 $ cp cephadm /usr/sbin
40 $ vi /usr/sbin/cephadm
41
42In the file change ``DEFAULT_IMAGE = 'quay.ceph.io/ceph-ci/ceph:master'``
43to ``DEFAULT_IMAGE = 'quay.ceph.io/ceph-ci/ceph:<sha1>-crimson`` where ``<sha1>``
44is the commit ID built by the Ceph CI/CD. You may use
45https://shaman.ceph.com/builds/ceph/ to monitor branches built by Ceph's Jenkins
46and to also discover those IDs.
47
48An example::
49
50 DEFAULT_IMAGE = 'quay.ceph.io/ceph-ci/ceph:1647216bf4ebac6bcf5ad7739e02b38569736cfd-crimson
51
52When the edition is finished::
53
54 chmod 777 cephadm
55 podman pull quay.ceph.io/ceph-ci/ceph:<sha1>-crimson
56 cephadm bootstrap --mon-ip 10.1.172.208 --allow-fqdn-hostname
57 # Set "PermitRootLogin yes" for other nodes you want to use
58 echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config
59 systemctl restart sshd
60
61 ssh-copy-id -f -i /etc/ceph/ceph.pub root@<nodename>
62 cephadm shell
63 ceph orch host add <nodename>
64 ceph orch apply osd --all-available-devices
65
9f95a23c
TL
66Running Crimson
67===============
68
69As you might expect, crimson is not featurewise on par with its predecessor yet.
70
f67539c2
TL
71object store backend
72--------------------
73
20effc67
TL
74At the moment, ``crimson-osd`` offers both native and alienized object store
75backends. The native object store backends perform IO using seastar reactor.
76They are:
77
78.. describe:: cyanstore
79
80 CyanStore is modeled after memstore in classic OSD.
81
82.. describe:: seastore
83
84 Seastore is still under active development.
f67539c2 85
20effc67
TL
86While the alienized object store backends are backed by a thread pool, which
87is a proxy of the alien store adaptor running in SeaStar. The proxy issues
88requests to object stores running in alien threads, i.e., worker threads not
89managed by the Seastar framework. They are:
f67539c2 90
20effc67
TL
91.. describe:: memstore
92
93 The memory backed object store
94
95.. describe:: bluestore
96
97 The object store used by classic OSD by default.
f67539c2 98
9f95a23c
TL
99daemonize
100---------
101
20effc67 102Unlike ``ceph-osd``, ``crimson-osd`` does not daemonize itself even if the
9f95a23c
TL
103``daemonize`` option is enabled. Because, to read this option, ``crimson-osd``
104needs to ready its config sharded service, but this sharded service lives
105in the seastar reactor. If we fork a child process and exit the parent after
106starting the Seastar engine, that will leave us with a single thread which is
107the replica of the thread calls `fork()`_. This would unnecessarily complicate
108the code, if we would have tackled this problem in crimson.
109
110Since a lot of GNU/Linux distros are using systemd nowadays, which is able to
111daemonize the application, there is no need to daemonize by ourselves. For
112those who are using sysvinit, they can use ``start-stop-daemon`` for daemonizing
113``crimson-osd``. If this is not acceptable, we can whip up a helper utility
114to do the trick.
115
116
117.. _fork(): http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
118
9f95a23c
TL
119logging
120-------
121
122Currently, ``crimson-osd`` uses the logging utility offered by Seastar. see
123``src/common/dout.h`` for the mapping between different logging levels to
124the severity levels in Seastar. For instance, the messages sent to ``derr``
125will be printed using ``logger::error()``, and the messages with debug level
126over ``20`` will be printed using ``logger::trace()``.
127
128+---------+---------+
129| ceph | seastar |
130+---------+---------+
131| < 0 | error |
132+---------+---------+
133| 0 | warn |
134+---------+---------+
20effc67 135| [1, 6) | info |
9f95a23c 136+---------+---------+
20effc67 137| [6, 20] | debug |
9f95a23c
TL
138+---------+---------+
139| > 20 | trace |
140+---------+---------+
141
142Please note, ``crimson-osd``
143does not send the logging message to specified ``log_file``. It writes
144the logging messages to stdout and/or syslog. Again, this behavior can be
145changed using ``--log-to-stdout`` and ``--log-to-syslog`` command line
146options. By default, ``log-to-stdout`` is enabled, and the latter disabled.
147
148
149vstart.sh
150---------
151
152To facilitate the development of crimson, following options would be handy when
153using ``vstart.sh``,
154
155``--crimson``
156 start ``crimson-osd`` instead of ``ceph-osd``
157
158``--nodaemon``
159 do not daemonize the service
160
161``--redirect-output``
162 redirect the stdout and stderr of service to ``out/$type.$num.stdout``.
163
164``--osd-args``
165 pass extra command line options to crimson-osd or ceph-osd. It's quite
f67539c2
TL
166 useful for passing Seastar options to crimson-osd. For instance, you could
167 use ``--osd-args "--memory 2G"`` to set the memory to use. Please refer
168 the output of::
169
170 crimson-osd --help-seastar
171
172 for more Seastar specific command line options.
173
20effc67 174``--cyanstore``
f67539c2
TL
175 use the CyanStore as the object store backend.
176
177``--bluestore``
20effc67
TL
178 use the alienized BlueStore as the object store backend. This is the default
179 setting, if not specified otherwise.
180
181``--memstore``
182 use the alienized MemStore as the object store backend.
9f95a23c
TL
183
184So, a typical command to start a single-crimson-node cluster is::
185
f67539c2 186 $ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
20effc67
TL
187 --without-dashboard --cyanstore \
188 --crimson --redirect-output \
f67539c2 189 --osd-args "--memory 4G"
9f95a23c
TL
190
191Where we assign 4 GiB memory, a single thread running on core-0 to crimson-osd.
9f95a23c
TL
192
193You could stop the vstart cluster using::
194
f67539c2 195 $ ../src/stop.sh --crimson
9f95a23c 196
20effc67
TL
197Metrics and Tracing
198===================
9f95a23c 199
20effc67
TL
200Crimson offers three ways to report the stats and metrics:
201
202pg stats reported to mgr
203------------------------
204
205Crimson collects the per-pg, per-pool, and per-osd stats in a `MPGStats`
206message, and send it over to mgr, so that the mgr modules can query
207them using the `MgrModule.get()` method.
208
209asock command
210-------------
211
212an asock command is offered for dumping the metrics::
213
214 $ ceph tell osd.0 dump_metrics
215 $ ceph tell osd.0 dump_metrics reactor_utilization
216
217Where `reactor_utilization` is an optional string allowing us to filter
218the dumped metrics by prefix.
219
220Prometheus text protocol
221------------------------
222
223the listening port and address can be configured using the command line options of
224`--prometheus_port`
225see `Prometheus`_ for more details.
226
227.. _Prometheus: https://github.com/scylladb/seastar/blob/master/doc/prometheus.md
228
229Profiling Crimson
9f95a23c
TL
230=================
231
20effc67
TL
232fio
233---
234
235``crimson-store-nbd`` exposes configurable ``FuturizedStore`` internals as an
236NBD server for use with fio.
237
238To use fio to test ``crimson-store-nbd``,
239
240#. You will need to install ``libnbd``, and compile fio like
241
242 .. prompt:: bash $
243
244 apt-get install libnbd-dev
245 git clone git://git.kernel.dk/fio.git
246 cd fio
247 ./configure --enable-libnbd
248 make
249
250#. Build ``crimson-store-nbd``
251
252 .. prompt:: bash $
253
254 cd build
255 ninja crimson-store-nbd
256
257#. Run the ``crimson-store-nbd`` server with a block device. Please specify
258 the path to the raw device, like ``/dev/nvme1n1`` in place of the created
259 file for testing with a block device.
260
261 .. prompt:: bash $
262
263 export disk_img=/tmp/disk.img
264 export unix_socket=/tmp/store_nbd_socket.sock
265 rm -f $disk_img $unix_socket
266 truncate -s 512M $disk_img
267 ./bin/crimson-store-nbd \
268 --device-path $disk_img \
269 --smp 1 \
270 --mkfs true \
271 --type transaction_manager \
272 --uds-path ${unix_socket} &
273
274 in which,
275
276 ``--smp``
277 how many CPU cores are used
278
279 ``--mkfs``
280 initialize the device first
281
282 ``--type``
283 which backend to use. If ``transaction_manager`` is specified, SeaStore's
284 ``TransactionManager`` and ``BlockSegmentManager`` are used to emulate a
285 block device. Otherwise, this option is used to choose a backend of
286 ``FuturizedStore``, where the whole "device" is divided into multiple
287 fixed-size objects whose size is specified by ``--object-size``. So, if
288 you are only interested in testing the lower-level implementation of
289 SeaStore like logical address translation layer and garbage collection
290 without the object store semantics, ``transaction_manager`` would be a
291 better choice.
292
293#. Create an fio job file named ``nbd.fio``
294
295 .. code:: ini
296
297 [global]
298 ioengine=nbd
299 uri=nbd+unix:///?socket=${unix_socket}
300 rw=randrw
301 time_based
302 runtime=120
303 group_reporting
304 iodepth=1
305 size=512M
306
307 [job0]
308 offset=0
309
310#. Test the crimson object store using the fio compiled just now
311
312 .. prompt:: bash $
313
314 ./fio nbd.fio
315
316CBT
317---
9f95a23c
TL
318We can use `cbt`_ for performing perf tests::
319
320 $ git checkout master
321 $ make crimson-osd
322 $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml
323 $ git checkout yet-another-pr
324 $ make crimson-osd
325 $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml
326 $ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v
327 19:48:23 - INFO - cbt - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155 => accepted
328 19:48:23 - INFO - cbt - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0 => accepted
329 19:48:23 - WARNING - cbt - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833 => rejected
330 19:48:23 - INFO - cbt - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712 => accepted
331 19:48:23 - INFO - cbt - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619 => accepted
332 19:48:23 - INFO - cbt - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0 => accepted
333 19:48:23 - INFO - cbt - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495 => accepted
334 19:48:23 - INFO - cbt - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251 => accepted
335 19:48:23 - INFO - cbt - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336 => accepted
336 19:48:23 - INFO - cbt - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0 => accepted
337 19:48:23 - INFO - cbt - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
338 19:48:23 - INFO - cbt - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029 => accepted
339 19:48:23 - INFO - cbt - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372 => accepted
340 19:48:23 - INFO - cbt - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0 => accepted
341 19:48:23 - INFO - cbt - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
342 19:48:23 - INFO - cbt - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337 => accepted
343 19:48:23 - WARNING - cbt - 1 tests failed out of 16
344
345Where we compile and run the same test against two branches. One is ``master``, another is ``yet-another-pr`` branch.
346And then we compare the test results. Along with every test case, a set of rules is defined to check if we have
347performance regressions when comparing two set of test results. If a possible regression is found, the rule and
348corresponding test results are highlighted.
349
350.. _cbt: https://github.com/ceph/cbt
351
f67539c2
TL
352Hacking Crimson
353===============
354
355
356Seastar Documents
357-----------------
358
359See `Seastar Tutorial <https://github.com/scylladb/seastar/blob/master/doc/tutorial.md>`_ .
360Or build a browsable version and start an HTTP server::
361
362 $ cd seastar
363 $ ./configure.py --mode debug
364 $ ninja -C build/debug docs
365 $ python3 -m http.server -d build/debug/doc/html
366
367You might want to install ``pandoc`` and other dependencies beforehand.
9f95a23c
TL
368
369Debugging Crimson
370=================
371
f67539c2
TL
372Debugging with GDB
373------------------
9f95a23c 374
f67539c2 375The `tips`_ for debugging Scylla also apply to Crimson.
9f95a23c 376
20effc67 377.. _tips: https://github.com/scylladb/scylla/blob/master/docs/guides/debugging.md#tips-and-tricks
f67539c2
TL
378
379Human-readable backtraces with addr2line
380----------------------------------------
381
382When a seastar application crashes, it leaves us with a serial of addresses, like::
9f95a23c
TL
383
384 Segmentation fault.
385 Backtrace:
f67539c2
TL
386 0x00000000108254aa
387 0x00000000107f74b9
388 0x00000000105366cc
389 0x000000001053682c
390 0x00000000105d2c2e
391 0x0000000010629b96
392 0x0000000010629c31
393 0x00002a02ebd8272f
394 0x00000000105d93ee
395 0x00000000103eff59
396 0x000000000d9c1d0a
397 /lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a
398 0x000000000d833ac9
9f95a23c
TL
399 Segmentation fault
400
401``seastar-addr2line`` offered by Seastar can be used to decipher these
402addresses. After running the script, it will be waiting for input from stdin,
403so we need to copy and paste the above addresses, then send the EOF by inputting
404``control-D`` in the terminal::
405
406 $ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd
407
f67539c2
TL
408 0x00000000108254aa
409 0x00000000107f74b9
410 0x00000000105366cc
411 0x000000001053682c
412 0x00000000105d2c2e
413 0x0000000010629b96
414 0x0000000010629c31
415 0x00002a02ebd8272f
416 0x00000000105d93ee
417 0x00000000103eff59
418 0x000000000d9c1d0a
419 0x00000000108254aa
9f95a23c
TL
420 [Backtrace #0]
421 seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136
422 seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157
423 seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164
424 seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119
425 seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105
426 seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101
427 ?? ??:0
428 seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418
429 seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5)
430 main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1)
f67539c2
TL
431
432Please note, ``seastar-addr2line`` is able to extract the addresses from
433the input, so you can also paste the log messages like::
434
435 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace:
436 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e78dbc
437 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e7f0
438 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e8b8
439 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e985
440 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: /lib64/libpthread.so.0+0x0000000000012dbf
441
442Unlike classic OSD, crimson does not print a human-readable backtrace when it
443handles fatal signals like `SIGSEGV` or `SIGABRT`. And it is more complicated
444when it comes to a stripped binary. So before planting a signal handler for
445those signals in crimson, we could to use `script/ceph-debug-docker.sh` to parse
446the addresses in the backtrace::
447
448 # assuming you are under the source tree of ceph
449 $ ./src/script/ceph-debug-docker.sh --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8
450 ....
451 [root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line
452 [root@3deb50a8ad51 ~]# dnf install -q -y file
453 [root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd
454 # paste the backtrace here