We have been using Ceph since 0.7x back in 2013 already, starting when we were fed up with the open source iSCSI implementations, longing to provide our customers with a more elastic, manageable, and scalable solution. Ceph has generally fulfilled its promises from the perspective of functionality. However, if you have been following this blog or searched for Ceph troubles on Google you will likely have seen our previous posts.
Aside from early software stability issues we had to invest a good amount of manpower (and nerves) into learning how to make Ceph perform acceptably and how all the pieces of hard drives, SSDs, raid controllers, 1- and 10Gbit network, CPU and RAM consumption, Ceph configuration, Qemu drivers, … fit together.
Today, I’d like to present our learnings both from a technical and methodical view. Specifically the methodical aspects should be seen in the retrospective of running a production cluster for a comparatively long time by now, going through version upgrades, hardware changes, and so on. Even if you won’t be bitten by the specific issues of the 0.7x series in the future, the methods may prove useful in the future to avoid navigating into troublesome waters. No promises, though. 🙂