Notice
This document is for a development version of Ceph.
CpuTrace
CpuTrace is a developer tool that measures the CPU cost of execution. It is useful when deciding between algorithms for new code and for validating performance enhancements. CpuTrace measures CPU instructions, clock cycles, branch mispredictions, cache misses and thread reschedules.
Integration into Ceph
To enable CpuTrace, build with the WITH_CPUTRACE flag:
./do_cmake.sh -DWITH_CPUTRACE=1
Once built with CpuTrace support, you can annotate specific functions or code regions using the provided macros and helper classes.
To enable profiling in your code, include the CpuTrace header:
#include "common/cputrace.h"
Then you can mark functions for profiling using the provided helpers.
Raw counter mode
CpuTrace is using the Linux perf_event_open syscall. You can use the tool
as a simple helper to get access to hardware perf counters.
// I am profiling my code and want to know
// how many clock cycles and how many thread switches it takes
HW_ctx hw = HW_ctx_empty;
HW_init(&hw, HW_PROFILE_SWI|HW_PROFILE_CYC);
sample_t start, end;
HW_read(&hw, &start);
// my code starts
// .....
// my code ends
HW_read(&hw, &end);
// task_switches = end.swi - start.swi;
// clock_cycles = end.cyc - start.cyc;
HW_clean(&hw);
By inspecting task_switches and clock_cycles the developer can learn that
real clock execution time of 10ms has only 1M clock cycles, but had 2 task switches.
Aggregating samples
A single readout of execution time is usually not enough. We need more samples to get a more realistic measurement of actual execution cost.
// a variable to hold my measurement
static measurement_t my_code_time;
sample_t start, end, elapsed;
// hw initialized somewhere else
HW_read(&hw, &start);
// my code starts
// .....
// my code ends
HW_read(&hw, &end);
elapsed = end - start;
// add new sample to the whole measurement
my_code_time.sample(elapsed);
measurement_t
The measurement_t type aggregates collected samples and counts the number
of measurements performed.
It produces summary statistics that include:
count : total number of measurements
average : mean value across all samples
zero / non-zero split : how many measurements were exactly zero versus greater than zero (only for context switch metrics)
These statistics provide a compact and clear view of performance measurements.
measurement_t can also export results in two formats:
Ceph Formatter (for structured JSON/YAML/XML output):
ceph::Formatter* jf; m->dump(jf, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output
String stream (for plain-text logging):
std::stringstream ss; m->dump_to_stringstream(ss, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output std::cout << ss.str();
This makes it easy to either integrate measurements into Ceph’s structured output pipeline or dump them as human-readable text for debugging.
RAII samples
It is usually most convenient to use RAII to collect samples. With RAII, measurement begins automatically when the guard object is created and ends when it goes out of scope, so no explicit start/stop calls are required.
The hardware context (HW_ctx) must be initialized once before creating
guards. After initialization, the same context can be reused across multiple
measurements.
HW_guard takes two arguments:
HW_ctx* ctxPointer to the initialized hardware context.measurement_t* mPointer to the measurement object where results will be stored.
Example:
// variable to hold measurement results
static measurement_t my_code_time;
{
HW_guard guard(&hw, &my_code_time);
// code to be measured
// ...
}
Named measurements
Code regions can be measured using a named guard.
Each HW_named_guard automatically starts measurement at construction and stops when leaving scope.
{
HW_named_guard("function", &hw);
// my code starts
// ...
// my code ends
}
This example records the execution time of function.
The guard requires a pointer to a previously initialized HW_ctx.
This context must be created and set up (e.g., during program initialization)
before guards can be used.
Named guards provide a simple and consistent way to track performance metrics.
To later access the collected measurements for a given name, use:
measurement_t* m = get_named_measurement("function");
if (m) {
// inspect m->sum_cyc, m->sum_ins.
// m->dump_to_stringstream(ss, HW_PROFILE_INS|HW_PROFILE_CYC);
}
Admin socket integration
In addition to direct instrumentation in code, CpuTrace can also be controlled at runtime via the admin socket interface. This allows developers to start, stop, and inspect profiling in running Ceph daemons without rebuilding or restarting them.
To profile a function, annotate it with the provided macros:
HWProfileFunctionF(profile, __func__,
HW_PROFILE_CYC | HW_PROFILE_CMISS |
HW_PROFILE_INS | HW_PROFILE_BMISS |
HW_PROFILE_SWI);
profileis a local variable name for the profiler object and only needs to be unique within the profiling scope.__func__(or any string you pass as the name) is the unique anchor name for this profiling scope.
Each unique name creates a separate anchor. Reusing the same name in multiple places will trigger an assertion failure.
This macro automatically attaches a profiler to the function scope and collects the specified hardware counters each time the function executes.
You can combine any of the available flags:
HW_PROFILE_CYC– CPU cyclesHW_PROFILE_CMISS– Cache missesHW_PROFILE_BMISS– Branch mispredictionsHW_PROFILE_INS– Instructions retiredHW_PROFILE_SWI– Context switches
Available commands:
cputrace start– Start profiling with the configured groups/counterscputrace stop– Stop profiling and freeze resultscputrace dump– Dump all collected metrics (as JSON or plain text)cputrace reset– Reset all captured data
Profiling counters are cumulative. cputrace stop pauses profiling without resetting values. cputrace start resumes accumulation. Use cputrace reset to clear all collected metrics.
Example usage from the command line:
# Start profiling on OSD.0
ceph tell osd.0 cputrace start
# Stop profiling
ceph tell osd.0 cputrace stop
# Dump results
ceph tell osd.0 cputrace dump
# Reset counters
ceph tell osd.0 cputrace reset
These commands can be repeated multiple times: developers typically
start before a workload, stop afterwards, and then dump the results
to analyze them.
cputrace dump supports optional arguments to filter by logger or counter,
so only a subset of metrics can be reported when needed.
cputrace reset clears all data, preparing for a fresh round of profiling.
API Reference
Enums
enum cputrace_flags {
HW_PROFILE_SWI = (1ULL << 0), // Context switches
HW_PROFILE_CYC = (1ULL << 1), // CPU cycles
HW_PROFILE_CMISS = (1ULL << 2), // Cache misses
HW_PROFILE_BMISS = (1ULL << 3), // Branch mispredictions
HW_PROFILE_INS = (1ULL << 4), // Instructions retired
};
The bitwise | operator may be used to combine these flags.
Data structures
sample_t – holds a single hardware counter snapshot.
struct sample_t {
uint64_t swi; //context switches
uint64_t cyc; //clock cycles
uint64_t cmiss; //cache misses
uint64_t bmiss; //branch misses
uint64_t ins; //instructions
};
measurement_t – accumulates multiple samples and computes totals/averages and other
useful metrics.
struct measurement_t {
uint64_t call_count = 0;
uint64_t sample_count = 0;
uint64_t sum_swi = 0, sum_cyc = 0, sum_cmiss = 0, sum_bmiss = 0, sum_ins = 0;
uint64_t non_zero_swi_count = 0;
uint64_t zero_swi_count = 0;
};
HW_ctx – encapsulates perf-event file descriptors for one measurement context.
extern HW_ctx HW_ctx_empty;
Low-level API
void HW_init(HW_ctx* ctx, cputrace_flags flags)– initialize perf counters.void HW_read(HW_ctx* ctx, sample_t* out)– read current counter values.void HW_clean(HW_ctx* ctx)– release perf counters.
Brought to you by the Ceph Foundation
The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.