Effective on 2018-03-01, we will be changing the platform default log format for managed nginx web servers. It will log only truncated IP addresses which makes it impossible to identify individual users. This change is motivated by recent developments in data protection regulations.
We have been using Ceph since 0.7x back in 2013 already, starting when we were fed up with the open source iSCSI implementations, longing to provide our customers with a more elastic, manageable, and scalable solution. Ceph has generally fulfilled its promises from the perspective of functionality. However, if you have been following this blog or searched for Ceph troubles on Google you will likely have seen our previous posts.
Aside from early software stability issues we had to invest a good amount of manpower (and nerves) into learning how to make Ceph perform acceptably and how all the pieces of hard drives, SSDs, raid controllers, 1- and 10Gbit network, CPU and RAM consumption, Ceph configuration, Qemu drivers, … fit together.
Today, I’d like to present our learnings both from a technical and methodical view. Specifically the methodical aspects should be seen in the retrospective of running a production cluster for a comparatively long time by now, going through version upgrades, hardware changes, and so on. Even if you won’t be bitten by the specific issues of the 0.7x series in the future, the methods may prove useful in the future to avoid navigating into troublesome waters. No promises, though. 🙂
Reading Why Order Matters: Turing Equivalence in Automated Systems Administration (by Steve Traugott and Lance Brown) 15 years ago has been a career-changing moment for me. In this blog post, I will explore the meaning of some of the points made in this article for today’s data center infrastructures. I will also give a bit of background on what motivated our recent move to NixOS.
Automatically deleting things — safely and reliably
Managing “stuff” automatically is awesome. Getting rid of “stuff” automatically is even more awesome — but also a lot harder: there be dragons.
After shooting ourselves in the foot in the past we came up with a system that we feel confident in using and maintaining.
We developed a phased approach that splits risky and complex deletion workflows into separate steps starting with tasks that can be reverted easily and then progressing towards increasing impact until reaching the point of no return.
Unimposing, less-than-fashionable, often hacked together without passion—yet, these little periodic data import jobs are still ubiquitous in any sizable datacenter. They often provide the glue that make data flow from one system to another. If they break, important stuff may get stuck. It’s time to pay them the attention they deserve. Continue reading Improving periodic data import jobs in 3 steps