Notice
This document is for a development version of Ceph.
Troubleshooting
Slow/stuck operations
Sometimes CephFS operations hang. The first step in troubleshooting them is to locate the problem causing the operations to hang. Problems present in three places:
in the client
in the MDS
in the network that connects the client to the MDS
First, use the procedure in Slow requests (MDS) to determine if the client has stuck operations or the MDS has stuck operations.
Dump the MDS cache. The contents of the MDS cache will be used to diagnose the nature of the problem. Run the following command to dump the MDS cache:
ceph daemon mds.<name> dump cache /tmp/dump.txt
Note
MDS services that are not controlled by systemd dump the file
dump.txt to the machine that runs the MDS. MDS services that are
controlled by systemd dump the file dump.txt to a tmpfs in the MDS
container. Use nsenter(1) to locate dump.txt or specify another
system-wide path.
If high logging levels have been set on the MDS, dump.txt can be expected
to hold the information needed to diagnose and solve the issue causing the
CephFS operations to hang.
Slow requests (MDS)
List current operations via the admin socket by running the following command from the MDS host:
ceph daemon mds.<name> dump_ops_in_flight
Identify the stuck commands and examine why they are stuck. Usually the last “event” will have been an attempt to gather locks, or sending the operation off to the MDS log. If it is waiting on the OSDs, fix them.
If operations are stuck on a specific inode, then a client is likely holding capabilities, preventing its use by other clients. This situation can be caused by a client trying to flush dirty data, but it might be caused because you have encountered a bug in the distributed file lock code (the file “capabilities” [“caps”] system) of CephFS.
If you have determined that the commands are stuck because of a bug in the capabilities code, restart the MDS. Restarting the MDS is likely to resolve the problem.
If there are no slow requests reported on the MDS, and there is no indication that clients are misbehaving, then either there is a problem with the client or the client’s requests are not reaching the MDS.
Stuck during recovery
Stuck in up:replay
If your MDS is stuck in the up:replay state, then the journal is probably
very long. The presence of MDS_HEALTH_TRIM cluster warnings can indicate
that the MDS has not yet caught up while trimming its journal. Very large
journals can take hours to process. There is no working around this, but there
are things you can do to speed up the process:
Temporarily disable MDS debug logs by reducing MDS debugging to 0. Even
with the default settings, the MDS logs a few messages to memory for dumping in
case a fatal error is encountered. You can turn off all logging by running the
following commands:
ceph config set mds debug_mds 0
ceph config set mds debug_ms 0
ceph config set mds debug_monc 0
Remember that when you set debug_mds, debug_ms, and debug_monc to
0, if the MDS fails then there will be no debugging information that can be
used to determine why fatal errors occurred. If you can calculate when
up:replay will complete, restore these configurations just prior to
entering the next state:
ceph config rm mds debug_mds
ceph config rm mds debug_ms
ceph config rm mds debug_monc
After replay has been expedited, calculate when the MDS will complete the replay. Examine the journal replay status:
$ ceph tell mds.<fs_name>:0 status | jq .replay_status
{
"journal_read_pos": 4195244,
"journal_write_pos": 4195244,
"journal_expire_pos": 4194304,
"num_events": 2,
"num_segments": 2
}
Replay completes when the journal_read_pos reaches the
journal_write_pos. The write position does not change during replay. Track
the progression of the read position to compute the expected time to complete.
The MDS emits an MDS_ESTIMATED_REPLAY_TIME warning when the act of replaying
the journal takes more than 30 seconds. The warning message includes an
estimated time to the completion of journal replay:
mds.a(mds.0): replay: 50.0446% complete - elapsed time: 582s, estimated time remaining: 581s
Avoiding recovery roadblocks
Do the following when restoring your file system:
Deny all reconnection to clients. Blocklist all existing CephFS sessions, causing all mounts to hang or become unavailable:
ceph config set mds mds_deny_all_reconnect trueRemember to undo this after the MDS becomes active.
Note
This does not prevent new sessions from connecting. Use the
refuse_client_sessionfile-system setting to prevent new sessions from connecting to the CephFS.Extend the MDS heartbeat grace period. Doing this causes the system to avoid replacing an MDS that becomes “stuck” during an operation. Sometimes recovery of an MDS may involve operations that take longer than expected (from the programmer’s perspective). This is more likely when recovery has already taken longer than normal to complete (which, if you’re reading this document, is likely the situation you find yourself in). Avoid unnecessary replacement loops by running the following command and extending the heartbeat grace period:
ceph config set mds mds_heartbeat_grace 3600Note
This causes the MDS to continue to send beacons to the monitors even when its internal “heartbeat” mechanism has not been reset (it has not beaten) in one hour. In the past, this was achieved with the
mds_beacon_gracemonitor setting.Disable open-file-table prefetch. Under normal circumstances, the MDS prefetches directory contents during recovery as a way of heating up its cache. During a long recovery, the cache is probably already hot and large. If the cache is already hot and large, this prefetching is unnecessary and can be undesirable. Disable open-file-table prefetching by running the following command:
ceph config set mds mds_oft_prefetch_dirfrags falseTurn off clients. Clients that reconnect to the newly
up:activeMDS can create new load on the file system just as it is becoming operational. This is often undesirable. Maintenance is often necessary before allowing clients to connect to the file system and before resuming a regular workload. For example, expediting the trimming of journals may be advisable if the recovery took a long time due to the amount of time replay spent in reading a very large journal.Client sessions can be refused manually, or by using the
refuse_client_sessiontunable as in the following command:ceph fs set <fs_name> refuse_client_session trueThis command has the effect of preventing clients from establishing new sessions with the MDS.
Do not tweak max_mds. Modifying the file-system setting variable
max_mdsmay seem like a good idea during troubleshooting and recovery, but it probably isn’t. Modifyingmax_mdsmight have the effect of further destabilizing the cluster. Ifmax_mdsmust be changed in such circumstances, run the command to changemax_mdswith the confirmation flag (--yes-i-really-mean-it).
Turn off async purge threads. The volumes plugin spawns threads that asynchronously purge trashed or deleted subvolumes. During troubleshooting or recovery, these purge threads can be disabled by running the following command:
ceph config set mgr mgr/volumes/pause_purging trueTo resume purging, run the following command:
ceph config set mgr mgr/volumes/pause_purging false
Turn off async cloner threads. The volumes plugin spawns threads that asynchronously clone subvolume snapshots. During troubleshooting or recovery, these cloner threads can be disabled by running the following command:
ceph config set mgr mgr/volumes/pause_cloning trueTo resume cloning, run the following command:
ceph config set mgr mgr/volumes/pause_cloning false
Expediting MDS journal trim
MDS_HEALTH_TRIM warnings indicate that the MDS journal has grown too large.
When the MDS journal has grown too large, use the mds_tick_interval tunable
to modify the “MDS tick interval”. The “tick” interval drives various upkeep
activities in the MDS, and modifying the interval will decrease the size of the
MDS journal by ensuring that it is trimmed more frequently.
Make sure that there is no significant file-system load present when modifying
mds_tick_interval. See
Avoiding recovery roadblocks for ways to reduce
load on the CephFS.
This setting affects only MDSes in the up:active state. The MDS does not
trim its journal during recovery.
Run the following command to modify the mds_tick_interval tunable:
ceph config set mds mds_tick_interval 2
RADOS Health
If part of the CephFS metadata or data pools is unavailable and CephFS is not responding, it could indicate that RADOS itself is unhealthy.
Resolve problems with RADOS before attempting to locate any problems in CephFS. See the RADOS troubleshooting documentation.
The MDS
Run the ceph health command. Any operation that is hung in the MDS is
indicated by the slow requests are blocked message.
Messages that read failing to respond indicate that a client is failing to
respond.
The following list details potential causes of hung operations:
The system is overloaded. The most likely cause of system overload is an active file set that is larger than the MDS cache.
If you have extra RAM, increase the
mds_cache_memory_limit. The specific tunablemds_cache_memory_limitis discussed in the MDS Cache Size. Read the MDS Cache Configuration section in full before making any alterations to themds_cache_memory_limittunable.There is an older (misbehaving) client.
There are underlying RADOS issues. See The RADOS troubleshooting documentation.
Otherwise, you have probably discovered a new bug and should report it to the developers!
ceph-fuse debugging
ceph-fuse is an alternative to the CephFS kernel driver that mounts CephFS file
systems in user space. ceph-fuse supports dump_ops_in_flight. Use the following command to dump in-flight ceph-fuse operations for examination:
See the Mount CephFS using FUSE documentation.
Debug output
To get more debugging information from ceph-fuse, list current operations in
the foreground while logging to the console (-d), enabling client debug
(--debug-client=20), and enabling prints for each message sent
(--debug-ms=1).
ceph daemon -d mds.<name> dump_ops_in_flight --debug-client=20 --debug-ms=1
If you suspect a potential monitor issue, enable monitor debugging as well
(--debug-monc=20) by running a command of the following form:
ceph daemon -d mds.<name> dump_ops_in_flight --debug-client=20 --debug-ms=1 --debug-monc=20
Kernel mount debugging
The first step in diagnosing and repairing an issue with the kernel client is determining whether the problem is in the kernel client or in the MDS. If the kernel client itself is broken, evidence of its breakage will be in the kernel ring buffer, which can be examined by running the following command:
dmesg
Find the relevant kernel state.
Slow requests
Unfortunately, the kernel client does not provide an admin socket. However, the the kernel on the client has debugfs enabled, interfaces similar to the admin socket are available.
Find a folder in /sys/kernel/debug/ceph/ with a name like
28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880.
That folder contains files that can be used to diagnose the causes of slow requests. Use cat to see their contents.
These files are described below. The files most useful for diagnosis of slow
requests are the mdsc (current requests to the MDS) and the osdc
(current operations in-flight to OSDs) files.
bdi: BDI info about the Ceph system (blocks dirtied, written, etc)caps: counts of file “caps” structures in-memory and usedclient_options: dumps the options provided to the CephFS mountdentry_lru: Dumps the CephFS dentries currently in-memorymdsc: Dumps current requests to the MDSmdsmap: Dumps the current MDSMap epoch and MDSesmds_sessions: Dumps the current sessions to MDSesmonc: Dumps the current maps from the monitor, and any “subscriptions” heldmonmap: Dumps the current monitor map epoch and monitorsosdc: Dumps the current ops in-flight to OSDs (ie, file data IO)osdmap: Dumps the current OSDMap epoch, pools, and OSDs
If the data pool is in a NEARFULL condition, then the kernel CephFS client
will switch to doing writes synchronously. Synchronous writes are quite slow.
Disconnected+Remounted FS
Because CephFS has a “consistent cache”, the MDS will forcibly evict (and
blocklist) clients from the cluster when the network connection has been
disrupted for a long time. When this happens, the kernel client cannot safely
write back dirty (buffered) data and this results in data loss. However: note
that this behavior is appropriate and also follows POSIX semantics. The client
has to be remounted to be able to access the file system again. This is the
default behavior but it can be overridden by the recover_session mount
option. See the “options” section of the “mount.ceph” man page
You are in this situation if the output of dmesg contains something like
the following:
[Fri Aug 15 02:38:10 2025] ceph: mds0 caps stale
[Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX :6800 socket closed (con state OPEN)
[Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX:6800 session reset
[Fri Aug 15 02:38:28 2025] ceph: mds0 closed our session
[Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect start
[Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect denied
Mounting
Mount 5 Error
A mount 5 error indicates a lagging MDS server or a crashed MDS server.
Ensure that at least one MDS is up and running, and the cluster is active +
healthy.
Mount 12 Error
A mount 12 error with a message reading cannot allocate memory indicates a
version mismatch between the Ceph Client version and the Ceph
Storage Cluster version. Check the versions using the following command:
ceph -v
If the Ceph Client is of an older version than the Ceph cluster, upgrade the Client:
sudo apt-get update && sudo apt-get install ceph-common
If this fails to resolve the problem, uninstall, autoclean, and autoremove the
ceph-common package and then reinstall it to ensure that you have the
latest version of it.
Dynamic Debugging
Dynamic debugging for CephFS kernel driver allows to enable or disable debug
logging. The kernel driver logs are written to the kernel ring buffer and can
be examined using dmesg(1) utility. Debug logging is disabled by default
because enabling debug logging can result in system slowness and a drop in I/O
throughput.
Enable dynamic debug against the CephFS module.
See: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh
Note: Running the above script enables debug logging for the CephFS kernel driver, libceph, and the kernel RBD module. To enable debug logging for a specific component (for example, for the CephFS kernel driver), run a command of the following form:
echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control
To disable debug logging, run a command of the following form:
echo 'module ceph -p' > /sys/kernel/debug/dynamic_debug/control
In-memory Log Dump
In-memory logs can be dumped by setting
mds_extraordinary_events_dump_interval when
the log level is set to less than 10.
mds_extraordinary_events_dump_interval is the interval in seconds for
dumping the recent in-memory logs when there is an extraordinary event.
Extraordinary events include the following:
Client Eviction
Missed Beacon ACK from the monitors
Missed Internal Heartbeats
In-memory log dump is disabled by default. This prevents production environments from experiencing log file bloat by default.
Run the following two commands in order to enable in-memory log dumping:
ceph config set mds debug_mds <log_level>/<gather_level>Set
log_levelto a value of less than10. Setgather_levelto a value greater than10. When those two values have been set, in-memory log dump is enabled.ceph config set mds mds_extraordinary_events_dump_interval <seconds>When in-memory log dumping is enabled, the MDS checks for extraordinary events every
mds_extraordinary_events_dump_intervalseconds. If any extraordinary event occurs, the MDS dumps the in-memory logs that contain relevant event details to the Ceph MDS log.
Note
When higher log levels are set (log_level greater than or equal
to 10) there is no reason to dump the in-memory logs. A lower gather
level (gather_level less than 10) is insufficient to gather in-
memory logs. This means that a log level of greater than or equal to 10
or a gather level of less than 10 in debug_mds prevents enabling
in-memory-log dumping. In such cases, if there is a failure, you must reset
the value of mds_extraordinary_events_dump_interval to 0 before
enabling the use of the above commands.
Disable in-memory log dumping by running the following command:
ceph config set mds mds_extraordinary_events_dump_interval 0
Filesystems Become Inaccessible After an Upgrade
Note
You can avoid operation not permitted errors by running this procedure
before an upgrade. As of May 2023, it seems that operation not permitted
errors of the kind discussed here occur after upgrades after Nautilus
(inclusive).
IF
you have CephFS file systems that have data and metadata pools that were
created by a ceph fs new command (meaning that they were not created
with the defaults)
OR
you have an existing CephFS file system and are upgrading to a new post-Nautilus major version of Ceph
THEN
in order for the documented ceph fs authorize... commands to function as
documented (and to avoid ‘operation not permitted’ errors when doing file I/O
or similar security-related problems for all users except the client.admin
user), you must first run:
ceph osd pool application set <your metadata pool name> cephfs metadata <your ceph fs filesystem name>
and
ceph osd pool application set <your data pool name> cephfs data <your ceph fs filesystem name>
Otherwise, when the OSDs receive a request to read or write data (not the directory info, but file data) they will not know which Ceph file system name to look up. This is true also of pool names, because the ‘defaults’ themselves changed in the major releases, from:
data pool=fsname
metadata pool=fsname_metadata
to:
data pool=fsname.data and
metadata pool=fsname.meta
Any setup that used client.admin for all mounts did not run into this
problem, because the admin key gave blanket permissions.
A temporary fix involves changing mount requests to the ‘client.admin’ user and
its associated key. A less drastic but half-fix is to change the osd cap for
your user to just caps osd = "allow rw" and delete tag cephfs
data=....
Disabling the Volumes Plugin
In certain scenarios, the Volumes plugin may need to be disabled to prevent compromise for rest of the Ceph cluster. For details see: Disabling Volumes Plugin
Reporting Issues
If you have identified a specific issue, please report it with as much information as possible. Especially important information:
Ceph versions installed on client and server
Whether you are using the kernel or fuse client
If you are using the kernel client, what kernel version?
How many clients are in play, doing what kind of workload?
If a system is ‘stuck’, is that affecting all clients or just one?
Any ceph health messages
Any backtraces in the ceph logs from crashes
If you are satisfied that you have found a bug, please file it on the bug tracker. For more general queries, please write to the ceph-users mailing list.
Brought to you by the Ceph Foundation
The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.