Notice

This document is for a development version of Ceph.

Crimson developer documentation

See Crimson’s User Guide for more information.

Building Crimson

Crimson is not enabled by default. Enable it at build time by running:

$ WITH_CRIMSON=true ./install-deps.sh
$ ./do_cmake.sh -DWITH_CRIMSON=ON

Please note, ASan is enabled by default if Crimson is built from a source cloned using git.

vstart.sh

Note

vstart.sh enables crimson Crimson Required Flags automatically when --crimson is used.

The following options can be used with vstart.sh.

--crimson

Start crimson-osd instead of ceph-osd.

--nodaemon

Do not daemonize the service.

--redirect-output

Redirect the stdout and stderr to out/$type.$num.stdout.

--osd-args

Pass extra command line options to crimson-osd or ceph-osd. This is useful for passing Seastar options to crimson-osd. For example, one can supply --osd-args "--memory 2G" to set the amount of memory to use. Please refer to the output of:

crimson-osd --help-seastar

for additional Seastar-specific command line options.

--crimson-smp

The number of cores to use for each OSD. If BlueStore is used, the balance of available cores (as determined by nproc) will be assigned to the object store.

--bluestore

Use the alienized BlueStore as the object store backend. This is the default (see above section on the Crimson’s Object Store Backends for more details)

--cyanstore

Use CyanStore as the object store backend.

--memstore

Use the alienized MemStore as the object store backend.

--seastore

Use SeaStore as the back end object store.

--seastore-devs

Specify the block device used by SeaStore.

--seastore-secondary-devs

Optional. SeaStore supports multiple devices. Enable this feature by passing the block device to this option.

--seastore-secondary-devs-type

Optional. Specify the type of secondary devices. When the secondary device is slower than main device passed to --seastore-devs, cold data on the faster device will be migrated to the slower devices over time. Valid types include HDD, SSD``(default), ``ZNS, and RANDOM_BLOCK_SSD Note secondary devices should not be faster than the main device.

To start a cluster with a single Crimson node, run:

$  MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh \
  --without-dashboard --bluestore --crimson \
  --redirect-output

Another SeaStore example:

$  MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
  --without-dashboard --seastore \
  --crimson --redirect-output \
  --seastore-devs /dev/sda \
  --seastore-secondary-devs /dev/sdb \
  --seastore-secondary-devs-type HDD

Stop this vstart cluster by running:

$ ../src/stop.sh --crimson

daemonize

Unlike ceph-osd, crimson-osd does not daemonize itself even if the daemonize option is enabled. In order to read this option, crimson-osd needs to ready its config sharded service, but this sharded service lives in the Seastar reactor. If we fork a child process and exit the parent after starting the Seastar engine, that will leave us with a single thread which is a replica of the thread that called fork(). Tackling this problem in Crimson would unnecessarily complicate the code.

Since supported GNU/Linux distributions use systemd, which is able to daemonize processes, there is no need to daemonize ourselves. Those using sysvinit can use start-stop-daemon to daemonize crimson-osd. If this is does not work out, a helper utility may be devised.

logging

Crimson-osd currently uses the logging utility offered by Seastar. See src/common/dout.h for the mapping between Ceph logging levels to the severity levels in Seastar. For instance, messages sent to derr will be issued using logger::error(), and the messages with a debug level greater than 20 will be issued using logger::trace().

ceph

seastar

< 0

error

0

warn

[1, 6)

info

[6, 20]

debug

> 20

trace

Note that crimson-osd does not send log messages directly to a specified log_file. It writes the logging messages to stdout and/or syslog. This behavior can be changed using --log-to-stdout and --log-to-syslog command line options. By default, --log-to-stdout is enabled, and --log-to-syslog is disabled.

Profiling Crimson

Fio

crimson-store-nbd exposes configurable FuturizedStore internals as an NBD server for use with fio.

In order to use fio to test crimson-store-nbd, perform the below steps.

  1. You will need to install libnbd, and compile it into fio

    apt-get install libnbd-dev
    git clone git://git.kernel.dk/fio.git
    cd fio
    ./configure --enable-libnbd
    make
    
  2. Build crimson-store-nbd

    cd build
    ninja crimson-store-nbd
    
  3. Run the crimson-store-nbd server with a block device. Specify the path to the raw device, for example /dev/nvme1n1, in place of the created file for testing with a block device.

    export disk_img=/tmp/disk.img
    export unix_socket=/tmp/store_nbd_socket.sock
    rm -f $disk_img $unix_socket
    truncate -s 512M $disk_img
    ./bin/crimson-store-nbd \
      --device-path $disk_img \
      --smp 1 \
      --mkfs true \
      --type transaction_manager \
      --uds-path ${unix_socket} &
    

    Below are descriptions of these command line arguments:

    --smp

    The number of CPU cores to use (Symmetric MultiProcessor)

    --mkfs

    Initialize the device first.

    --type

    The back end to use. If transaction_manager is specified, SeaStore’s TransactionManager and BlockSegmentManager are used to emulate a block device. Otherwise, this option is used to choose a backend of FuturizedStore, where the whole “device” is divided into multiple fixed-size objects whose size is specified by --object-size. So, if you are only interested in testing the lower-level implementation of SeaStore like logical address translation layer and garbage collection without the object store semantics, transaction_manager would be a better choice.

  4. Create a fio job file named nbd.fio

    [global]
    ioengine=nbd
    uri=nbd+unix:///?socket=${unix_socket}
    rw=randrw
    time_based
    runtime=120
    group_reporting
    iodepth=1
    size=512M
    
    [job0]
    offset=0
    
  5. Test the Crimson object store, using the custom fio built just now

    ./fio nbd.fio
    

CBT

We can use cbt for performance tests:

$ git checkout main
$ make crimson-osd
$ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml
$ git checkout yet-another-pr
$ make crimson-osd
$ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml
$ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v
19:48:23 - INFO     - cbt      - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155  => accepted
19:48:23 - INFO     - cbt      - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0  => accepted
19:48:23 - WARNING  - cbt      - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833  => rejected
19:48:23 - INFO     - cbt      - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712  => accepted
19:48:23 - INFO     - cbt      - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619  => accepted
19:48:23 - INFO     - cbt      - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0  => accepted
19:48:23 - INFO     - cbt      - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495  => accepted
19:48:23 - INFO     - cbt      - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0  => accepted
19:48:23 - INFO     - cbt      - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337  => accepted
19:48:23 - WARNING  - cbt      - 1 tests failed out of 16

Here we compile and run the same test against two branches: main and yet-another-pr. We then compare the results. Along with every test case, a set of rules is defined to check for performance regressions when comparing the sets of test results. If a possible regression is found, the rule and corresponding test results are highlighted.

Hacking Crimson

Seastar Documents

See Seastar Tutorial . Or build a browsable version and start an HTTP server:

$ cd seastar
$ ./configure.py --mode debug
$ ninja -C build/debug docs
$ python3 -m http.server -d build/debug/doc/html

Install pandoc and other dependencies beforehand.

Debugging Crimson

Debugging with GDB

The tips for debugging Scylla also apply to Crimson.

Human-readable backtraces with addr2line

When a Seastar application crashes, it leaves us with a backtrace of addresses, like:

Segmentation fault.
Backtrace:
  0x00000000108254aa
  0x00000000107f74b9
  0x00000000105366cc
  0x000000001053682c
  0x00000000105d2c2e
  0x0000000010629b96
  0x0000000010629c31
  0x00002a02ebd8272f
  0x00000000105d93ee
  0x00000000103eff59
  0x000000000d9c1d0a
  /lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a
  0x000000000d833ac9
Segmentation fault

The seastar-addr2line utility provided by Seastar can be used to map these addresses to functions. The script expects input on stdin, so we need to copy and paste the above addresses, then send EOF by inputting control-D in the terminal. One might use echo or cat instead:

$ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd

  0x00000000108254aa
  0x00000000107f74b9
  0x00000000105366cc
  0x000000001053682c
  0x00000000105d2c2e
  0x0000000010629b96
  0x0000000010629c31
  0x00002a02ebd8272f
  0x00000000105d93ee
  0x00000000103eff59
  0x000000000d9c1d0a
  0x00000000108254aa
[Backtrace #0]
seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136
seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157
seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164
seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119
seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105
seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101
?? ??:0
seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5)
main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1)

Note that seastar-addr2line is able to extract addresses from its input, so you can also paste the log messages as below:

2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace:
2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:  0x0000000000e78dbc
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr:  0x0000000000e3e7f0
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr:  0x0000000000e3e8b8
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr:  0x0000000000e3e985
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr:  /lib64/libpthread.so.0+0x0000000000012dbf

Unlike the classic ceph-osd, Crimson does not print a human-readable backtrace when it handles fatal signals like SIGSEGV or SIGABRT. It is also more complicated with a stripped binary. So instead of planting a signal handler for those signals into Crimson, we can use script/ceph-debug-docker.sh to map addresses in the backtrace:

# assuming you are under the source tree of ceph
$ ./src/script/ceph-debug-docker.sh  --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8
....
[root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line
[root@3deb50a8ad51 ~]# dnf install -q -y file
[root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd
# paste the backtrace here

Code Walkthroughs

Contents

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.