Skip to main content Skip to page footer

Improving Ceph OSD start-up behaviour with vmtouch

Created by Christian Kauhaus | | Blog

We have a love/hate relation ship with Ceph. On one hand, it is probably the best open source distributed storage around. On the other hand, Ceph repeatedly exhibits unexpected behaviour under high load. And it is absolutely correct that you expect Flying Circus VMs to perform evenly. That is something we keep revisiting regularly. In the following article, I will describe an improvement we have applied on a common pain point: I/O hangs during OSD restarts.

Restarting an OSD (Object Storage Daemon) places additional load on its backing disks. Flying Circus business growth led to increasing storage I/O demand. While this is generally a good thing, it brought our main Ceph cluster near its throughput limit for several times. Danger ahead: The storage cluster is running fine as long as nothing special happens. But if something unusual happens, the cluster suddenly goes over the tipping point and performance becomes shaky. We are attacking the problem of insufficient headroom from several sides at once. The first thing is to upgrade hardware continuously so that the storage cluster is keeping up with demand. Additionally, it is a good idea to defuse situations which are likely to turn a highly loaded cluster into an overloaded one. I will focus on OSDs restarts here. Other conditions like all VMs producing I/O load spikes at the same time are worth another article.

Problem anatomy

OSD restarts are so critical because, in my opinion, there is a weakness in Ceph's design. Let me explain. A Ceph cluster has a number of Object Storage Daemons. OSDs manage access to replicated on-disk objects. Disk objects are pooled in Placement Groups (PG) that share the same distribution and replication properties (if this is all confusing to you, Ceph's architecture overview would be a good read). When an OSDs goes down, it gets marked "out" by the cluster monitors. I/O requests are serviced by other OSDs that maintain copies of the affected PGs. When the OSD restarts, it gets marked "in" again. Now it needs to check all PGs for updates it may have missed during its downtime. Here comes the problem: A newly started OSD hits the disk hard to check PGs for staleness, recover from missed updates, and receives client requests because it has already been marked "in", all at the same time. If there is already little headroom in I/O throughput, disks cannot keep up and performance starts to drop. There have even been reports on the mailing list that, in extreme cases, this effect can bring a whole cluster down. This happens when an OSDs becomes so unresponsive that it gets marked "out" by the cluster monitors despite the fact that it is still running. Once client requests are routed elsewhere, the OSD starts to respond again and gets marked "in". Note that other, formerly unaffected OSDs also kick off increased disk activity as part of the recovery process. If a cluster is near its I/O limit, this also make other OSDs unresponsive. The result is a cluster where OSDs are repeatedly switching between "in" and "out" states. A cluster in such a state is unable to service client requests. It may even not recover without administrator intervention. Luckily, we have never experienced a situation like this. But even in its weaker form, I/O hangs on OSD restarts are affecting VM performance and should at least be reduced, if not avoided. What can we do about this? A Ceph developer confirmed that the design should be fixed. But is there anything we can do to reduce impact in the meantime? The obvious thing is to make sure a storage cluster has always an I/O throughput reserve. Another vector is to reduce I/O contention during that critical phase when a freshly started OSD is simultaneously checking PGs and receiving client traffic.

Reducing disk load during OSD starts

OSDs show predictable disk access patterns during start-up. We can use knowledge about these access patterns to read files into the kernel cache in advance. These files don't need to be read during the critical phase. This means fewer seeks and better performance. We have examined OSD behaviour with Brendan Gregg's opensnoop. A typical OSD start-up looks like this:

Tracing open()s issued for filenames containing "ceph-7".
osd     0x4 /srv/ceph/osd/ceph-7/magic
osd     0x4 /srv/ceph/osd/ceph-7/whoami
osd     0x4 /srv/ceph/osd/ceph-7/ceph_fsid
osd     0x4 /srv/ceph/osd/ceph-7/fsid
osd     0xb /srv/ceph/osd/ceph-7/fsid
osd     0xb /srv/ceph/osd/ceph-7/fsid
osd     0xc /srv/ceph/osd/ceph-7/store_version
osd     0xc /srv/ceph/osd/ceph-7/superblock
osd     0xc /srv/ceph/osd/ceph-7
osd     0xd /srv/ceph/osd/ceph-7/fiemap_test
osd     0xd /srv/ceph/osd/ceph-7/xattr_test
osd     0xd /srv/ceph/osd/ceph-7/current
osd     0xe /srv/ceph/osd/ceph-7/current/commit_op_seq
osd     0xf /srv/ceph/osd/ceph-7/current/omap/LOG
osd    0x10 /srv/ceph/osd/ceph-7/current/omap/LOCK
osd    0x11 /srv/ceph/osd/ceph-7/current/omap/CURRENT
osd    0x11 /srv/ceph/osd/ceph-7/current/omap/MANIFEST-040577

At first, the OSD reads various metadata files. Nothing special here. But later on, traces become more interesting:

osd    0x13 /srv/ceph/osd/ceph-7/current/omap/040588.ldb
osd    0x13 /srv/ceph/osd/ceph-7/current/omap/040580.ldb
osd    0x13 /srv/ceph/osd/ceph-7/current/omap/040225.ldb
osd    0x16 /srv/ceph/osd/ceph-7/current/410.cf_head
osd    0x16 /srv/ceph/osd/ceph-7/current/482.b9_head
osd    0x16 /srv/ceph/osd/ceph-7/current/482.90_head

Our OSD first opens files belonging to the Object Map database (omap/*.ldb). The Object Map keeps record which storage objects are located where in the cluster. Afterwards, the named objects are opened (*_head) and reconciled with other replicas. This pattern repeats for a while: first omap files are opened, then the corresponding objects:

osd    0x16 /srv/ceph/osd/ceph-7/current/omap/040228.ldb
osd    0x16 /srv/ceph/osd/ceph-7/current/410.1e8_head
osd    0x16 /srv/ceph/osd/ceph-7/current/482.122_head
osd    0x16 /srv/ceph/osd/ceph-7/current/410.1ea_head
[...]
osd    0x16 /srv/ceph/osd/ceph-7/current/omap/040229.ldb
osd    0x16 /srv/ceph/osd/ceph-7/current/410.266_head
osd    0x16 /srv/ceph/osd/ceph-7/current/410.395_head
osd    0x16 /srv/ceph/osd/ceph-7/current/610.41_head
[...]
osd    0x16 /srv/ceph/osd/ceph-7/current/omap/040230.ldb
osd    0x16 /srv/ceph/osd/ceph-7/current/410.26b_head
osd    0x16 /srv/ceph/osd/ceph-7/current/410.31c_head
osd    0x16 /srv/ceph/osd/ceph-7/current/482.e1_head
[...]

We cannot predict which objects need to be read, but we know that all omap database files need to be opened sooner or later. Enter vmtouch. This little utility reads files into the kernel cache and locks them in memory, so that subsequent I/O operations will read them from the cache. This is exactly what we need here. We lock all omap database files into memory before starting an OSD. Now access patterns looks like this:

vmtouch  0x6 /srv/ceph/osd/ceph-7/current/omap/040424.ldb
vmtouch  0x7 /srv/ceph/osd/ceph-7/current/omap/040558.ldb
vmtouch  0x9 /srv/ceph/osd/ceph-7/current/omap/040559.ldb
vmtouch  0xb /srv/ceph/osd/ceph-7/current/omap/040120.ldb
vmtouch  0xc /srv/ceph/osd/ceph-7/current/omap/040102.ldb
[...]
osd      0x16 /srv/ceph/osd/ceph-7/current/410.cf_head
osd      0x16 /srv/ceph/osd/ceph-7/current/482.b9_head
osd      0x16 /srv/ceph/osd/ceph-7/current/482.90_head
osd      0x16 /srv/ceph/osd/ceph-7/current/482.9e_head
osd      0x16 /srv/ceph/osd/ceph-7/current/410.f_head

Disk seeks are reduced once the OSD has been marked "in". But is the effect large enough to make a difference?

Measurements

A good indicator to measure is how many and how long PGs are in the so-called "peering" phase. Peering is an internal Ceph state for PGs that are in the process of being reconciled and have not yet decided how much recovery is needed. Client requests are temporarily not serviced for those PGs. Ideally, a peering state should pass so fast that it is hardly noticeable. But if a cluster is overloaded, peering states hold up and requests are stalled. To verify that our vmtouch trick reduces peering time, I stressed a lab cluster with artificial load slightly below its I/O throughput limit. Then I turned an OSD off, waited for a minute so that it missed a certain amount of writes, and turned it back on. From the logs, I plotted the number of PGs in peering state over time. The impact score is the integral under that function, i.e. the sum of the number of peering PGs times the peering durations. In the first experiment an OSD is restarted the normal way (without vmtouch). For more than 30 seconds, quite a number of PGs are in peering state and cause the storage cluster to appear slow. In the second experiment, I use vmtouch to prefetch omap database files.

While peering states have not vanished completely, we see greatly improved behaviour. Peering starts a bit earlier (omap files are already loaded) and the impact score is more than an order of magnitude smaller. I think this is quite an impressive result.

Conclusion

Preloading files along a known access pattern helps quite a bit. While this is not the complete solution to make our Ceph cluster more robust, it helps us bearing a common pain point. In the long run, we would like to see Ceph's design improved so that the underlying cause goes away.

Back