Notice

This document is for a development version of Ceph.

CpuTrace

CpuTrace is a developer tool that measures the CPU cost of execution. It is useful when deciding between algorithms for new code and for validating performance enhancements. CpuTrace measures CPU instructions, clock cycles, branch mispredictions, cache misses and thread reschedules.

Integration into Ceph

To enable CpuTrace, build with the WITH_CPUTRACE flag:

./do_cmake.sh -DWITH_CPUTRACE=1

Once built with CpuTrace support, you can annotate specific functions or code regions using the provided macros and helper classes.

To enable profiling in your code, include the CpuTrace header:

#include "common/cputrace.h"

Then you can mark functions for profiling using the provided helpers.

Raw counter mode

CpuTrace is using the Linux perf_event_open syscall. You can use the tool as a simple helper to get access to hardware perf counters.

// I am profiling my code and want to know
// how many clock cycles and how many thread switches it takes
HW_ctx hw = HW_ctx_empty;
HW_init(&hw, HW_PROFILE_SWI|HW_PROFILE_CYC);
sample_t start, end;
HW_read(&hw, &start);
// my code starts
// .....
// my code ends
HW_read(&hw, &end);
// task_switches = end.swi - start.swi;
// clock_cycles  = end.cyc - start.cyc;
HW_clean(&hw);

By inspecting task_switches and clock_cycles the developer can learn that real clock execution time of 10ms has only 1M clock cycles, but had 2 task switches.

Aggregating samples

A single readout of execution time is usually not enough. We need more samples to get a more realistic measurement of actual execution cost.

// a variable to hold my measurement
static measurement_t my_code_time;
sample_t start, end, elapsed;
// hw initialized somewhere else
HW_read(&hw, &start);
// my code starts
// .....
// my code ends
HW_read(&hw, &end);
elapsed = end - start;
// add new sample to the whole measurement
my_code_time.sample(elapsed);

`measurement_t`

The measurement_t type aggregates collected samples and counts the number of measurements performed.

It produces summary statistics that include:

count : total number of measurements
average : mean value across all samples
zero / non-zero split : how many measurements were exactly zero versus greater than zero (only for context switch metrics)

These statistics provide a compact and clear view of performance measurements.

measurement_t can also export results in two formats:

Ceph Formatter (for structured JSON/YAML/XML output):

ceph::Formatter* jf;
m->dump(jf, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output

String stream (for plain-text logging):

std::stringstream ss;
m->dump_to_stringstream(ss, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output
std::cout << ss.str();

This makes it easy to either integrate measurements into Ceph’s structured output pipeline or dump them as human-readable text for debugging.

RAII samples

It is usually most convenient to use RAII to collect samples. With RAII, measurement begins automatically when the guard object is created and ends when it goes out of scope, so no explicit start/stop calls are required.

The hardware context (HW_ctx) must be initialized once before creating guards. After initialization, the same context can be reused across multiple measurements.

HW_guard takes two arguments:

HW_ctx* ctx Pointer to the initialized hardware context.
measurement_t* m Pointer to the measurement object where results will be stored.

Example:

// variable to hold measurement results
static measurement_t my_code_time;
{
  HW_guard guard(&hw, &my_code_time);
  // code to be measured
  // ...
}

Named measurements

Code regions can be measured using a named guard. Each HW_named_guard automatically starts measurement at construction and stops when leaving scope.

{
  HW_named_guard("function", &hw);
  // my code starts
  // ...
  // my code ends
}

This example records the execution time of function.

The guard requires a pointer to a previously initialized HW_ctx. This context must be created and set up (e.g., during program initialization) before guards can be used.

Named guards provide a simple and consistent way to track performance metrics.

To later access the collected measurements for a given name, use:

measurement_t* m = get_named_measurement("function");
if (m) {
  // inspect m->sum_cyc, m->sum_ins.
  // m->dump_to_stringstream(ss, HW_PROFILE_INS|HW_PROFILE_CYC);
}

Admin socket integration

In addition to direct instrumentation in code, CpuTrace can also be controlled at runtime via the admin socket interface. This allows developers to start, stop, and inspect profiling in running Ceph daemons without rebuilding or restarting them.

To profile a function, annotate it with the provided macros:

HWProfileFunctionF(profile, __func__,
                   HW_PROFILE_CYC  | HW_PROFILE_CMISS |
                   HW_PROFILE_INS  | HW_PROFILE_BMISS |
                   HW_PROFILE_SWI);

profile is a local variable name for the profiler object and only needs to be unique within the profiling scope.
__func__ (or any string you pass as the name) is the unique anchor name for this profiling scope.

Each unique name creates a separate anchor. Reusing the same name in multiple places will trigger an assertion failure.

This macro automatically attaches a profiler to the function scope and collects the specified hardware counters each time the function executes.

You can combine any of the available flags:

HW_PROFILE_CYC – CPU cycles
HW_PROFILE_CMISS – Cache misses
HW_PROFILE_BMISS – Branch mispredictions
HW_PROFILE_INS – Instructions retired
HW_PROFILE_SWI – Context switches

Available commands:

cputrace start – Start profiling with the configured groups/counters
cputrace stop – Stop profiling and freeze results
cputrace dump – Dump all collected metrics (as JSON or plain text)
cputrace reset – Reset all captured data

Profiling counters are cumulative. cputrace stop pauses profiling without resetting values. cputrace start resumes accumulation. Use cputrace reset to clear all collected metrics.

Example usage from the command line:

# Start profiling on OSD.0
ceph tell osd.0 cputrace start

# Stop profiling
ceph tell osd.0 cputrace stop

# Dump results
ceph tell osd.0 cputrace dump

# Reset counters
ceph tell osd.0 cputrace reset

These commands can be repeated multiple times: developers typically start before a workload, stop afterwards, and then dump the results to analyze them.

cputrace dump supports optional arguments to filter by logger or counter, so only a subset of metrics can be reported when needed.

cputrace reset clears all data, preparing for a fresh round of profiling.

API Reference

Enums

enum cputrace_flags {
    HW_PROFILE_SWI   = (1ULL << 0), // Context switches
    HW_PROFILE_CYC   = (1ULL << 1), // CPU cycles
    HW_PROFILE_CMISS = (1ULL << 2), // Cache misses
    HW_PROFILE_BMISS = (1ULL << 3), // Branch mispredictions
    HW_PROFILE_INS   = (1ULL << 4), // Instructions retired
};

The bitwise | operator may be used to combine these flags.

Data structures

sample_t – holds a single hardware counter snapshot.

struct sample_t {
  uint64_t swi;   //context switches
  uint64_t cyc;   //clock cycles
  uint64_t cmiss; //cache misses
  uint64_t bmiss; //branch misses
  uint64_t ins;   //instructions
};

measurement_t – accumulates multiple samples and computes totals/averages and other useful metrics.

struct measurement_t {
  uint64_t call_count = 0;
  uint64_t sample_count = 0;
  uint64_t sum_swi = 0, sum_cyc = 0, sum_cmiss = 0, sum_bmiss = 0, sum_ins = 0;
  uint64_t non_zero_swi_count = 0;
  uint64_t zero_swi_count = 0;
};

HW_ctx – encapsulates perf-event file descriptors for one measurement context.

extern HW_ctx HW_ctx_empty;

Low-level API

void HW_init(HW_ctx* ctx, cputrace_flags flags) – initialize perf counters.
void HW_read(HW_ctx* ctx, sample_t* out) – read current counter values.
void HW_clean(HW_ctx* ctx) – release perf counters.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.