Notice

This document is for a development version of Ceph.

Prometheus Module

The Manager prometheus module implements a Prometheus exporter to expose Ceph performance counters from the collection point in the Manager. The Manager receives MMgrReport messages from all MgrClient processes (including mons and OSDs) with performance counter schema data and counter data, and maintains a circular buffer of the latest samples. This module listens on an HTTP endpoint and retrieves the latest sample of every counter when scraped. The HTTP path and query parameters are ignored. All extant counters for all reporting entities are returned in the Prometheus exposition format. (See the Prometheus documentation.)

Enabling Prometheus output

Enable the prometheus module by running the below command :

ceph mgr module enable prometheus

Configuration

Note

The prometheus Manager module must be restarted to apply configuration changes.

server_addr

the IPv4 or IPv6 address on which the module listens for HTTP requests

type:

str

default:

::

server_port

the port on which the module listens for HTTP requests

type:

int

default:

9283

scrape_interval

type:

float

default:

15.0

cache

type:

bool

default:

true

stale_cache_strategy

type:

str

default:

log

rbd_stats_pools

type:

str

default:

<empty string>

rbd_stats_pools_refresh_interval

type:

int

default:

300

standby_behaviour

type:

str

default:

default

standby_error_status_code

type:

int

default:

500

allowed range:

[400, 599]

exclude_perf_counters

Gathering perf-counters from a single Prometheus exporter can degrade ceph-mgr performance, especially in large clusters. Instead, Ceph- exporter daemons are now used by default for perf-counter gathering. This should only be disabled when no ceph-exporters are deployed.

type:

bool

default:

true

healthcheck_history_max_entries

Maximum number of health check history entries to keep

type:

int

default:

1000

By default the module will accept HTTP requests on port 9283 on all IPv4 and IPv6 addresses on the host. The port and listen address are configurable with ceph config set, with keys mgr/prometheus/server_addr and mgr/prometheus/server_port. This port is registered with Prometheus’s registry.

ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
ceph config set mgr mgr/prometheus/server_port 9283

Warning

The mgr/prometheus/scrape_interval of this module should match Prometheus’ scrape interval to work properly.

The scrape interval in the module is used for caching purposes and to determine when a cache is stale.

It is not recommended to use a scrape interval below 10 seconds. It is recommended to use 15 seconds as scrape interval, though, in some cases it might be useful to increase the scrape interval.

To set a different scrape interval in the Prometheus module, set scrape_interval to the desired value:

ceph config set mgr mgr/prometheus/scrape_interval 20

On large clusters (>1000 OSDs), the time to fetch the metrics may become significant. Without the cache, the Prometheus manager module could, especially in conjunction with multiple Prometheus instances, overload the manager and lead to unresponsive or crashing Ceph Manager instances. Hence, the cache is enabled by default. This means that there is a possibility that the cache becomes stale. The cache is considered stale when the time to fetch the metrics from Ceph exceeds the configured mgr/prometheus/scrape_interval.

If that is the case, a warning will be logged and the module will either respond with a 503 HTTP status code (service unavailable) or it will return the content of the cache, even though it might be stale.

This behavior can be configured. By default, it will return a 503 HTTP status code (service unavailable). You can set other options using the ceph config set commands.

To configure the module to respond with possibly stale data, set the cache strategy to return:

ceph config set mgr mgr/prometheus/stale_cache_strategy return

To configure the module to respond with “service unavailable”, set it to fail:

ceph config set mgr mgr/prometheus/stale_cache_strategy fail

If you are confident that you don’t require the cache, you can disable it:

ceph config set mgr mgr/prometheus/cache false

If you are using the prometheus module behind a reverse proxy or load balancer, you can simplify discovery of the active instance by switching to error-mode:

ceph config set mgr mgr/prometheus/standby_behaviour error

If set, the prometheus module will respond with a HTTP error when requesting / from the standby instance. The default error code is 500, but you can configure the HTTP response code with:

ceph config set mgr mgr/prometheus/standby_error_status_code 503

Valid error codes are between 400-599.

To switch back to the default behaviour, simply set the config key to default:

ceph config set mgr mgr/prometheus/standby_behaviour default

Ceph Health Checks

The Manager prometheus module tracks and maintains a history of Ceph health checks, exposing them to the Prometheus server as discrete metrics. This allows Alertmanager rules to be configured for specific health check events.

The metrics take the following form:

# HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
# TYPE ceph_health_detail gauge
ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0

The module also maintains an in-memory history of health-check states. By default the history retains a maximum of 1000 entries. This limit is configurable via the following runtime option:

mgr/prometheus/healthcheck_history_max_entries - the maximum number of unique health check entries to track in memory (default: 1000).

This setting helps avoid unbounded memory growth in large or long-lived clusters.

The health check history may be retrieved and cleared by running the following commands:

ceph healthcheck history ls [--format {plain|json|json-pretty}]
ceph healthcheck history clear

The ceph healthcheck ls command provides an overview of the health checks that the cluster has encountered since the last clear command was issued:

ceph healthcheck history ls

Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
OSDMAP_FLAGS              2021/09/16 03:17:47   2021/09/16 22:07:40       2    No
OSD_DOWN                  2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
PG_DEGRADED               2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
3 health check(s) listed

RBD IO statistics

The prometheus module can optionally collect RBD per-image IO statistics by enabling dynamic OSD performance counters. Statistics are gathered for all images in the pools that are specified by the mgr/prometheus/rbd_stats_pools configuration parameter. The parameter is a comma or space separated list of pool[/namespace] entries. If the RBD namespace is not specified, statistics are collected for all namespaces in the pool.

To enable collection of stats for RBD pools named pool1, pool2 and poolN:

ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"

A wildcard can be used to indicate all pools or namespaces:

ceph config set mgr mgr/prometheus/rbd_stats_pools "*"

The module maintains a list of all available images by scanning the specified pools and namespaces. The refresh period is configurable via the mgr/prometheus/rbd_stats_pools_refresh_interval parameter, which defaults to 300 seconds (5 minutes). The module will force refresh earlier if it detects statistics from a previously unknown RBD image.

To set the sync interval to 10 minutes run the following command:

ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600

Ceph daemon performance counters metrics

With the introduction of the ceph-exporter daemon, the prometheus module will no longer export Ceph daemon perf counters as Prometheus metrics by default. However, one may re-enable exporting these metrics by setting the module option exclude_perf_counters to false:

ceph config set mgr mgr/prometheus/exclude_perf_counters false

Statistic names and labels

These Prometheus stats names are the Ceph native names with illegal characters ., - and :: translated to _, and ceph_ prepended.

All daemon statistics have a ceph_daemon label with a value that identifies the type and ID of the daemon they come from, for example osd.123. A given metric may be reported by multiple types of daemon, so for example when when querying an OSD RocksDB stats, you may constrain the query with a pattern of the form ceph_daemon=~'osd.*' so that Monitor RocksDB metrics are excluded.

Cluster statistics (i.e. those global to the Ceph cluster) have labels appropriate to the entity for which they are reported. For example, metrics relating to pools have a pool_id label.

Long-running averages that represent Ceph statistic histograms are represented by paired <name>_sum and <name>_count metrics. This is similar to how histograms are represented in Prometheus and they are treated similarly.

Pool and OSD metadata series

Series are exported to facilitate displaying and querying on certain metadata fields.

Pools have a ceph_pool_metadata metric of the following form:

ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0

OSDs have a ceph_osd_metadata metric of the following form:

ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0

Correlating drive statistics with node_exporter

Ceph cluster Prometheus metrics are used in conjunction with generic host metrics from the Prometheus node_exporter.

To enable correlation of Ceph OSD statistics with node_exporter’s drive statistics, Ceph creates series of the below form:

ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}

To query drive metrics by OSD ID, use either the and operator or the * operator in your Prometheus query. All metadata metrics (like ceph_disk_occupation_human) have the value 1 so that they combine in a neutral fashion with the PromQL * operator. Using * allows the use of the group_left and group_right grouping modifiers so that the results have additional labels from one side of the query.

See the prometheus documentation for more information about constructing PromQL queries and exploring interactively via the Prometheus expression browser..

For example we can run a query like the below:

rate(node_disk_written_bytes_total[30s]) and
on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}

Out of the box the above query will not return any metrics since the instance labels of both metrics don’t match. The instance label of ceph_disk_occupation_human will be the currently active Manager.

The following sections outline two approaches to remedy this.

Note

If you need to group on the ceph_daemon label instead of device and instance labels, using ceph_disk_occupation_human may not work reliably. It is advised that you use ceph_disk_occupation instead.

The difference is that ceph_disk_occupation_human may group several OSDs into the value of a single ceph_daemon label in cases where multiple OSDs share a device.

Use label_replace

The label_replace function (cp. label_replace documentation) can add a label to, or alter a label of, a metric within a query.

To correlate an OSD with its drives’ write rate, a query of the following form can be used:

label_replace(
    rate(node_disk_written_bytes_total[30s]),
    "exported_instance",
    "$1",
    "instance",
    "(.*):.*"
) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}

Configuring Prometheus server

honor_labels

To enable Ceph to output properly-labeled data relating to any host, use the honor_labels setting when adding the Manager endpoints to your Prometheus configuration.

This instructs Ceph to export the proper instance labels without Prometheus overwriting them at ingest. Without this setting, Prometheus applies an instance label that includes the hostname and port of the endpoint from which each metric is scraped. Because Ceph clusters have multiple Manager daemons, this results in an instance label that changes when the active Manager daemon changes.

If this is undesirable, a custom instance label can be set in the Prometheus target configuration. You might wish to set it to the hostname of your first Manager, or something arbitrary like ceph_cluster.

node_exporter hostname labels

Set your instance labels to match what appears in Ceph’s OSD metadata in the instance field. This is generally the short hostname of the node.

This is only necessary if you want to correlate Ceph stats with host stats, but you may find it useful to do so to facilitate correlation of historical data in the future.

Example configuration

This example shows a deployment with a Manager and node_exporter placed on a server named senta04. Note that this requires one to add an appropriate and unique instance label to each node_exporter target.

This is just an example: there are other ways to configure Prometheus scrape targets and label rewrite rules.

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    file_sd_configs:
      - files:
        - node_targets.yml
  - job_name: 'ceph'
    honor_labels: true
    file_sd_configs:
      - files:
        - ceph_targets.yml

ceph_targets.yml

[
    {
        "targets": [ "senta04.mydomain.com:9283" ],
        "labels": {}
    }
]

node_targets.yml

[
    {
        "targets": [ "senta04.mydomain.com:9100" ],
        "labels": {
            "instance": "senta04"
        }
    }
]

Notes

Counters and gauges are exported. Histograms and long-running averages are currently not exported. It is possible that Ceph’s 2-D histograms could be reduced to two separate 1-D histograms, and that long-running averages could be exported as metrics of Prometheus’ Summary type.

Timestamps, as with many exporters, are set by Prometheus at ingest to the Prometheus server’s scrape time. Prometheus expects that it is polling the actual counter process synchronously. It is possible to supply a timestamp along with the stat report, but the Prometheus team strongly advises against this. This would mean that timestamps would be delayed by an unpredictable amount. It is not clear if this would be problematic, but it is worth knowing about.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.