Tag Archives: sysadmin

Ceph performance learnings (long read)

We have been using Ceph since 0.7x back in 2013 already, starting when we were fed up with the open source iSCSI implementations, longing to provide our customers with a more elastic, manageable, and scalable solution. Ceph has generally fulfilled its promises from the perspective of functionality. However, if you have been following this blog or searched for Ceph troubles on Google you will likely have seen our previous posts.

Aside from early software stability issues we had to invest a good amount of manpower (and nerves) into learning how to make Ceph perform acceptably and how all the pieces of hard drives, SSDs, raid controllers, 1- and 10Gbit network, CPU and RAM consumption, Ceph configuration, Qemu drivers, … fit together.

Today, I’d like to present our learnings both from a technical and methodical view. Specifically the methodical aspects should be seen in the retrospective of running a production cluster for a comparatively long time by now, going through version upgrades, hardware changes, and so on. Even if you won’t be bitten by the specific issues of the 0.7x series in the future, the methods may prove useful in the future to avoid navigating into troublesome waters. No promises, though. 🙂

Continue reading Ceph performance learnings (long read)

Thoughts on systems management methods

Reading Why Order Matters: Turing Equivalence in Automated Systems Administration (by Steve Traugott and Lance Brown) 15 years ago has been a career-changing moment for me. In this blog post, I will explore the meaning of some of the points made in this article for today’s data center infrastructures. I will also give a bit of background on what motivated our recent move to NixOS.

Continue reading Thoughts on systems management methods

Lonely Garbage Bin

Automatically deleting things — safely and reliably

Automatically deleting things — safely and reliably

Managing “stuff” automatically is awesome. Getting rid of “stuff” automatically is even more awesome — but also a lot harder: there be dragons.

After shooting ourselves in the foot in the past we came up with a system that we feel confident in using and maintaining.

We developed a phased approach that splits risky and complex deletion workflows into separate steps starting with tasks that can be reverted easily and then progressing towards increasing impact until reaching the point of no return.

Continue reading Automatically deleting things — safely and reliably

Improving periodic data import jobs in 3 steps

Unimposing, less-than-fashionable, often hacked together without passion—yet, these little periodic data import jobs are still ubiquitous in any sizable datacenter. They often provide the glue that make data flow from one system to another. If they break, important stuff may get stuck. It’s time to pay them the attention they deserve. Continue reading Improving periodic data import jobs in 3 steps