]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | ======= |
2 | crimson | |
3 | ======= | |
4 | ||
5 | Crimson is the code name of crimson-osd, which is the next generation ceph-osd. | |
6 | It targets fast networking devices, fast storage devices by leveraging state of | |
7 | the art technologies like DPDK and SPDK, for better performance. And it will | |
20effc67 | 8 | keep the support of HDDs and low-end SSDs via BlueStore. Crimson will try to |
9f95a23c TL |
9 | be backward compatible with classic OSD. |
10 | ||
f67539c2 TL |
11 | .. highlight:: console |
12 | ||
13 | Building Crimson | |
14 | ================ | |
15 | ||
20effc67 | 16 | Crimson is not enabled by default. To enable it:: |
f67539c2 TL |
17 | |
18 | $ WITH_SEASTAR=true ./install-deps.sh | |
19 | $ mkdir build && cd build | |
20 | $ cmake -DWITH_SEASTAR=ON .. | |
21 | ||
22 | Please note, `ASan`_ is enabled by default if crimson is built from a source | |
23 | cloned using git. | |
24 | ||
f67539c2 | 25 | .. _ASan: https://github.com/google/sanitizers/wiki/AddressSanitizer |
9f95a23c | 26 | |
2a845540 TL |
27 | Installing Crimson with ready-to-use images |
28 | =========================================== | |
29 | ||
30 | An alternative to building Crimson from source is to use container images built | |
31 | by Ceph CI/CD and deploy them with one of the orchestrators: ``cephadm`` or ``Rook``. | |
32 | In this chapter documents the ``cephadm`` way. | |
33 | ||
34 | NOTE: We know that this procedure is suboptimal, but it has passed internal | |
35 | external quality assurance.:: | |
36 | ||
37 | ||
38 | $ curl -L https://raw.githubusercontent.com/ceph/ceph-ci/wip-bharat-crimson/src/cephadm/cephadm -o cephadm | |
39 | $ cp cephadm /usr/sbin | |
40 | $ vi /usr/sbin/cephadm | |
41 | ||
42 | In the file change ``DEFAULT_IMAGE = 'quay.ceph.io/ceph-ci/ceph:master'`` | |
43 | to ``DEFAULT_IMAGE = 'quay.ceph.io/ceph-ci/ceph:<sha1>-crimson`` where ``<sha1>`` | |
44 | is the commit ID built by the Ceph CI/CD. You may use | |
45 | https://shaman.ceph.com/builds/ceph/ to monitor branches built by Ceph's Jenkins | |
46 | and to also discover those IDs. | |
47 | ||
48 | An example:: | |
49 | ||
50 | DEFAULT_IMAGE = 'quay.ceph.io/ceph-ci/ceph:1647216bf4ebac6bcf5ad7739e02b38569736cfd-crimson | |
51 | ||
52 | When the edition is finished:: | |
53 | ||
54 | chmod 777 cephadm | |
55 | podman pull quay.ceph.io/ceph-ci/ceph:<sha1>-crimson | |
56 | cephadm bootstrap --mon-ip 10.1.172.208 --allow-fqdn-hostname | |
57 | # Set "PermitRootLogin yes" for other nodes you want to use | |
58 | echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config | |
59 | systemctl restart sshd | |
60 | ||
61 | ssh-copy-id -f -i /etc/ceph/ceph.pub root@<nodename> | |
62 | cephadm shell | |
63 | ceph orch host add <nodename> | |
64 | ceph orch apply osd --all-available-devices | |
65 | ||
9f95a23c TL |
66 | Running Crimson |
67 | =============== | |
68 | ||
69 | As you might expect, crimson is not featurewise on par with its predecessor yet. | |
70 | ||
f67539c2 TL |
71 | object store backend |
72 | -------------------- | |
73 | ||
20effc67 TL |
74 | At the moment, ``crimson-osd`` offers both native and alienized object store |
75 | backends. The native object store backends perform IO using seastar reactor. | |
76 | They are: | |
77 | ||
78 | .. describe:: cyanstore | |
79 | ||
80 | CyanStore is modeled after memstore in classic OSD. | |
81 | ||
82 | .. describe:: seastore | |
83 | ||
84 | Seastore is still under active development. | |
f67539c2 | 85 | |
20effc67 TL |
86 | While the alienized object store backends are backed by a thread pool, which |
87 | is a proxy of the alien store adaptor running in SeaStar. The proxy issues | |
88 | requests to object stores running in alien threads, i.e., worker threads not | |
89 | managed by the Seastar framework. They are: | |
f67539c2 | 90 | |
20effc67 TL |
91 | .. describe:: memstore |
92 | ||
93 | The memory backed object store | |
94 | ||
95 | .. describe:: bluestore | |
96 | ||
97 | The object store used by classic OSD by default. | |
f67539c2 | 98 | |
9f95a23c TL |
99 | daemonize |
100 | --------- | |
101 | ||
20effc67 | 102 | Unlike ``ceph-osd``, ``crimson-osd`` does not daemonize itself even if the |
9f95a23c TL |
103 | ``daemonize`` option is enabled. Because, to read this option, ``crimson-osd`` |
104 | needs to ready its config sharded service, but this sharded service lives | |
105 | in the seastar reactor. If we fork a child process and exit the parent after | |
106 | starting the Seastar engine, that will leave us with a single thread which is | |
107 | the replica of the thread calls `fork()`_. This would unnecessarily complicate | |
108 | the code, if we would have tackled this problem in crimson. | |
109 | ||
110 | Since a lot of GNU/Linux distros are using systemd nowadays, which is able to | |
111 | daemonize the application, there is no need to daemonize by ourselves. For | |
112 | those who are using sysvinit, they can use ``start-stop-daemon`` for daemonizing | |
113 | ``crimson-osd``. If this is not acceptable, we can whip up a helper utility | |
114 | to do the trick. | |
115 | ||
116 | ||
117 | .. _fork(): http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html | |
118 | ||
9f95a23c TL |
119 | logging |
120 | ------- | |
121 | ||
122 | Currently, ``crimson-osd`` uses the logging utility offered by Seastar. see | |
123 | ``src/common/dout.h`` for the mapping between different logging levels to | |
124 | the severity levels in Seastar. For instance, the messages sent to ``derr`` | |
125 | will be printed using ``logger::error()``, and the messages with debug level | |
126 | over ``20`` will be printed using ``logger::trace()``. | |
127 | ||
128 | +---------+---------+ | |
129 | | ceph | seastar | | |
130 | +---------+---------+ | |
131 | | < 0 | error | | |
132 | +---------+---------+ | |
133 | | 0 | warn | | |
134 | +---------+---------+ | |
20effc67 | 135 | | [1, 6) | info | |
9f95a23c | 136 | +---------+---------+ |
20effc67 | 137 | | [6, 20] | debug | |
9f95a23c TL |
138 | +---------+---------+ |
139 | | > 20 | trace | | |
140 | +---------+---------+ | |
141 | ||
142 | Please note, ``crimson-osd`` | |
143 | does not send the logging message to specified ``log_file``. It writes | |
144 | the logging messages to stdout and/or syslog. Again, this behavior can be | |
145 | changed using ``--log-to-stdout`` and ``--log-to-syslog`` command line | |
146 | options. By default, ``log-to-stdout`` is enabled, and the latter disabled. | |
147 | ||
148 | ||
149 | vstart.sh | |
150 | --------- | |
151 | ||
152 | To facilitate the development of crimson, following options would be handy when | |
153 | using ``vstart.sh``, | |
154 | ||
155 | ``--crimson`` | |
156 | start ``crimson-osd`` instead of ``ceph-osd`` | |
157 | ||
158 | ``--nodaemon`` | |
159 | do not daemonize the service | |
160 | ||
161 | ``--redirect-output`` | |
162 | redirect the stdout and stderr of service to ``out/$type.$num.stdout``. | |
163 | ||
164 | ``--osd-args`` | |
165 | pass extra command line options to crimson-osd or ceph-osd. It's quite | |
f67539c2 TL |
166 | useful for passing Seastar options to crimson-osd. For instance, you could |
167 | use ``--osd-args "--memory 2G"`` to set the memory to use. Please refer | |
168 | the output of:: | |
169 | ||
170 | crimson-osd --help-seastar | |
171 | ||
172 | for more Seastar specific command line options. | |
173 | ||
20effc67 | 174 | ``--cyanstore`` |
f67539c2 TL |
175 | use the CyanStore as the object store backend. |
176 | ||
177 | ``--bluestore`` | |
20effc67 TL |
178 | use the alienized BlueStore as the object store backend. This is the default |
179 | setting, if not specified otherwise. | |
180 | ||
181 | ``--memstore`` | |
182 | use the alienized MemStore as the object store backend. | |
9f95a23c TL |
183 | |
184 | So, a typical command to start a single-crimson-node cluster is:: | |
185 | ||
f67539c2 | 186 | $ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \ |
20effc67 TL |
187 | --without-dashboard --cyanstore \ |
188 | --crimson --redirect-output \ | |
f67539c2 | 189 | --osd-args "--memory 4G" |
9f95a23c TL |
190 | |
191 | Where we assign 4 GiB memory, a single thread running on core-0 to crimson-osd. | |
9f95a23c TL |
192 | |
193 | You could stop the vstart cluster using:: | |
194 | ||
f67539c2 | 195 | $ ../src/stop.sh --crimson |
9f95a23c | 196 | |
20effc67 TL |
197 | Metrics and Tracing |
198 | =================== | |
9f95a23c | 199 | |
20effc67 TL |
200 | Crimson offers three ways to report the stats and metrics: |
201 | ||
202 | pg stats reported to mgr | |
203 | ------------------------ | |
204 | ||
205 | Crimson collects the per-pg, per-pool, and per-osd stats in a `MPGStats` | |
206 | message, and send it over to mgr, so that the mgr modules can query | |
207 | them using the `MgrModule.get()` method. | |
208 | ||
209 | asock command | |
210 | ------------- | |
211 | ||
212 | an asock command is offered for dumping the metrics:: | |
213 | ||
214 | $ ceph tell osd.0 dump_metrics | |
215 | $ ceph tell osd.0 dump_metrics reactor_utilization | |
216 | ||
217 | Where `reactor_utilization` is an optional string allowing us to filter | |
218 | the dumped metrics by prefix. | |
219 | ||
220 | Prometheus text protocol | |
221 | ------------------------ | |
222 | ||
223 | the listening port and address can be configured using the command line options of | |
224 | `--prometheus_port` | |
225 | see `Prometheus`_ for more details. | |
226 | ||
227 | .. _Prometheus: https://github.com/scylladb/seastar/blob/master/doc/prometheus.md | |
228 | ||
229 | Profiling Crimson | |
9f95a23c TL |
230 | ================= |
231 | ||
20effc67 TL |
232 | fio |
233 | --- | |
234 | ||
235 | ``crimson-store-nbd`` exposes configurable ``FuturizedStore`` internals as an | |
236 | NBD server for use with fio. | |
237 | ||
238 | To use fio to test ``crimson-store-nbd``, | |
239 | ||
240 | #. You will need to install ``libnbd``, and compile fio like | |
241 | ||
242 | .. prompt:: bash $ | |
243 | ||
244 | apt-get install libnbd-dev | |
245 | git clone git://git.kernel.dk/fio.git | |
246 | cd fio | |
247 | ./configure --enable-libnbd | |
248 | make | |
249 | ||
250 | #. Build ``crimson-store-nbd`` | |
251 | ||
252 | .. prompt:: bash $ | |
253 | ||
254 | cd build | |
255 | ninja crimson-store-nbd | |
256 | ||
257 | #. Run the ``crimson-store-nbd`` server with a block device. Please specify | |
258 | the path to the raw device, like ``/dev/nvme1n1`` in place of the created | |
259 | file for testing with a block device. | |
260 | ||
261 | .. prompt:: bash $ | |
262 | ||
263 | export disk_img=/tmp/disk.img | |
264 | export unix_socket=/tmp/store_nbd_socket.sock | |
265 | rm -f $disk_img $unix_socket | |
266 | truncate -s 512M $disk_img | |
267 | ./bin/crimson-store-nbd \ | |
268 | --device-path $disk_img \ | |
269 | --smp 1 \ | |
270 | --mkfs true \ | |
271 | --type transaction_manager \ | |
272 | --uds-path ${unix_socket} & | |
273 | ||
274 | in which, | |
275 | ||
276 | ``--smp`` | |
277 | how many CPU cores are used | |
278 | ||
279 | ``--mkfs`` | |
280 | initialize the device first | |
281 | ||
282 | ``--type`` | |
283 | which backend to use. If ``transaction_manager`` is specified, SeaStore's | |
284 | ``TransactionManager`` and ``BlockSegmentManager`` are used to emulate a | |
285 | block device. Otherwise, this option is used to choose a backend of | |
286 | ``FuturizedStore``, where the whole "device" is divided into multiple | |
287 | fixed-size objects whose size is specified by ``--object-size``. So, if | |
288 | you are only interested in testing the lower-level implementation of | |
289 | SeaStore like logical address translation layer and garbage collection | |
290 | without the object store semantics, ``transaction_manager`` would be a | |
291 | better choice. | |
292 | ||
293 | #. Create an fio job file named ``nbd.fio`` | |
294 | ||
295 | .. code:: ini | |
296 | ||
297 | [global] | |
298 | ioengine=nbd | |
299 | uri=nbd+unix:///?socket=${unix_socket} | |
300 | rw=randrw | |
301 | time_based | |
302 | runtime=120 | |
303 | group_reporting | |
304 | iodepth=1 | |
305 | size=512M | |
306 | ||
307 | [job0] | |
308 | offset=0 | |
309 | ||
310 | #. Test the crimson object store using the fio compiled just now | |
311 | ||
312 | .. prompt:: bash $ | |
313 | ||
314 | ./fio nbd.fio | |
315 | ||
316 | CBT | |
317 | --- | |
9f95a23c TL |
318 | We can use `cbt`_ for performing perf tests:: |
319 | ||
320 | $ git checkout master | |
321 | $ make crimson-osd | |
322 | $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml | |
323 | $ git checkout yet-another-pr | |
324 | $ make crimson-osd | |
325 | $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml | |
326 | $ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v | |
327 | 19:48:23 - INFO - cbt - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155 => accepted | |
328 | 19:48:23 - INFO - cbt - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0 => accepted | |
329 | 19:48:23 - WARNING - cbt - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833 => rejected | |
330 | 19:48:23 - INFO - cbt - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712 => accepted | |
331 | 19:48:23 - INFO - cbt - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619 => accepted | |
332 | 19:48:23 - INFO - cbt - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0 => accepted | |
333 | 19:48:23 - INFO - cbt - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495 => accepted | |
334 | 19:48:23 - INFO - cbt - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251 => accepted | |
335 | 19:48:23 - INFO - cbt - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336 => accepted | |
336 | 19:48:23 - INFO - cbt - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0 => accepted | |
337 | 19:48:23 - INFO - cbt - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted | |
338 | 19:48:23 - INFO - cbt - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029 => accepted | |
339 | 19:48:23 - INFO - cbt - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372 => accepted | |
340 | 19:48:23 - INFO - cbt - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0 => accepted | |
341 | 19:48:23 - INFO - cbt - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted | |
342 | 19:48:23 - INFO - cbt - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337 => accepted | |
343 | 19:48:23 - WARNING - cbt - 1 tests failed out of 16 | |
344 | ||
345 | Where we compile and run the same test against two branches. One is ``master``, another is ``yet-another-pr`` branch. | |
346 | And then we compare the test results. Along with every test case, a set of rules is defined to check if we have | |
347 | performance regressions when comparing two set of test results. If a possible regression is found, the rule and | |
348 | corresponding test results are highlighted. | |
349 | ||
350 | .. _cbt: https://github.com/ceph/cbt | |
351 | ||
f67539c2 TL |
352 | Hacking Crimson |
353 | =============== | |
354 | ||
355 | ||
356 | Seastar Documents | |
357 | ----------------- | |
358 | ||
359 | See `Seastar Tutorial <https://github.com/scylladb/seastar/blob/master/doc/tutorial.md>`_ . | |
360 | Or build a browsable version and start an HTTP server:: | |
361 | ||
362 | $ cd seastar | |
363 | $ ./configure.py --mode debug | |
364 | $ ninja -C build/debug docs | |
365 | $ python3 -m http.server -d build/debug/doc/html | |
366 | ||
367 | You might want to install ``pandoc`` and other dependencies beforehand. | |
9f95a23c TL |
368 | |
369 | Debugging Crimson | |
370 | ================= | |
371 | ||
f67539c2 TL |
372 | Debugging with GDB |
373 | ------------------ | |
9f95a23c | 374 | |
f67539c2 | 375 | The `tips`_ for debugging Scylla also apply to Crimson. |
9f95a23c | 376 | |
20effc67 | 377 | .. _tips: https://github.com/scylladb/scylla/blob/master/docs/guides/debugging.md#tips-and-tricks |
f67539c2 TL |
378 | |
379 | Human-readable backtraces with addr2line | |
380 | ---------------------------------------- | |
381 | ||
382 | When a seastar application crashes, it leaves us with a serial of addresses, like:: | |
9f95a23c TL |
383 | |
384 | Segmentation fault. | |
385 | Backtrace: | |
f67539c2 TL |
386 | 0x00000000108254aa |
387 | 0x00000000107f74b9 | |
388 | 0x00000000105366cc | |
389 | 0x000000001053682c | |
390 | 0x00000000105d2c2e | |
391 | 0x0000000010629b96 | |
392 | 0x0000000010629c31 | |
393 | 0x00002a02ebd8272f | |
394 | 0x00000000105d93ee | |
395 | 0x00000000103eff59 | |
396 | 0x000000000d9c1d0a | |
397 | /lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a | |
398 | 0x000000000d833ac9 | |
9f95a23c TL |
399 | Segmentation fault |
400 | ||
401 | ``seastar-addr2line`` offered by Seastar can be used to decipher these | |
402 | addresses. After running the script, it will be waiting for input from stdin, | |
403 | so we need to copy and paste the above addresses, then send the EOF by inputting | |
404 | ``control-D`` in the terminal:: | |
405 | ||
406 | $ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd | |
407 | ||
f67539c2 TL |
408 | 0x00000000108254aa |
409 | 0x00000000107f74b9 | |
410 | 0x00000000105366cc | |
411 | 0x000000001053682c | |
412 | 0x00000000105d2c2e | |
413 | 0x0000000010629b96 | |
414 | 0x0000000010629c31 | |
415 | 0x00002a02ebd8272f | |
416 | 0x00000000105d93ee | |
417 | 0x00000000103eff59 | |
418 | 0x000000000d9c1d0a | |
419 | 0x00000000108254aa | |
9f95a23c TL |
420 | [Backtrace #0] |
421 | seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136 | |
422 | seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157 | |
423 | seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164 | |
424 | seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119 | |
425 | seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105 | |
426 | seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101 | |
427 | ?? ??:0 | |
428 | seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418 | |
429 | seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5) | |
430 | main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1) | |
f67539c2 TL |
431 | |
432 | Please note, ``seastar-addr2line`` is able to extract the addresses from | |
433 | the input, so you can also paste the log messages like:: | |
434 | ||
435 | 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace: | |
436 | 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e78dbc | |
437 | 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e7f0 | |
438 | 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e8b8 | |
439 | 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e985 | |
440 | 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: /lib64/libpthread.so.0+0x0000000000012dbf | |
441 | ||
442 | Unlike classic OSD, crimson does not print a human-readable backtrace when it | |
443 | handles fatal signals like `SIGSEGV` or `SIGABRT`. And it is more complicated | |
444 | when it comes to a stripped binary. So before planting a signal handler for | |
445 | those signals in crimson, we could to use `script/ceph-debug-docker.sh` to parse | |
446 | the addresses in the backtrace:: | |
447 | ||
448 | # assuming you are under the source tree of ceph | |
449 | $ ./src/script/ceph-debug-docker.sh --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8 | |
450 | .... | |
451 | [root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line | |
452 | [root@3deb50a8ad51 ~]# dnf install -q -y file | |
453 | [root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd | |
454 | # paste the backtrace here |