All posts by Christian Kauhaus

About Christian Kauhaus

Christian is a systems engineer working with Flying Circus Internet Operations.

Announcing fc-userscan

NixOS manages dependencies in a very strict way—sometimes too strict? Here at Flying Circus, many users prefer to compile custom applications in home directories. They link them against libraries they have installed before by nix-env. This works well… until something is updated! On the next change anywhere down the dependency chain, libraries get new hashes in the Nix store, the garbage collector removes old versions, and user applications break until recompiled.

In this blog post, I would like to introduce fc-userscan. This little tool scans (home) directories recursively for Nix store references and registers them as per-user roots with the garbage collector. This way, dependencies will be protected even if they cease to be referenced from “official” Nix roots like the current-system profile or a user’s local Nix profile. After registering formerly unmanaged references with fc-userscan, one can fearlessly run updates and garbage collection.

Continue reading Announcing fc-userscan

How to renew Puppet CA and server certificates in place

It used to run fine for years… but now the (deprecated) Puppet infrastructure at the Flying Circus is showing signs of aging. It’s not about server hardware or something like this (fully virtualized anyway) – it’s about SSL certificates of Puppet’s own SSL infrastructure. Time for a face lift.

In the following, I will describe what we did to renew both CA and Puppet server certificates. Despite that this problem should occur on every Puppet server running for a prolonged amount of time, I found remarkably few resources on the net (that did not involve completely replacing the CA) – so I’m going to share our findings.

Continue reading How to renew Puppet CA and server certificates in place

Dirty Cow: Restarting all VMs

All VMs are currently affected by the “Dirty Cow” kernel bug. The upcoming release 2016_034 contains a kernel update which upgrades Linux to the unaffected version 4.4.28. As usual, the kernel update requires to reboot all VMs.

Schedule:

  • Tue 15 through Thu 17 November 2016: reboot staging VMs
  • Thu 17 through Thu 24 November 2016: reboot productive VMs.

VM reboots will be scheduled along the agreed maintenance windows. We will piggy-back a Qemu binary environment update which would require a separate reboot otherwise.

Old red telephone and an old computer in between cogs and wheels.

Sneak Preview: Upcoming FC Platform and Infrastructure Features

We are planning to implement some cool stuff for the Flying Circus hosting platform and its underlying infrastructure during the second half of this year. In this post, I will give a preview to technical improvements you can happily look forward to.

All of these improvements are included in the platform subscription (this is what the platform subscription is actually for!) so you don’t have to pay extra for any of them.

Continue reading Sneak Preview: Upcoming FC Platform and Infrastructure Features

Thoughts on systems management methods

Reading Why Order Matters: Turing Equivalence in Automated Systems Administration (by Steve Traugott and Lance Brown) 15 years ago has been a career-changing moment for me. In this blog post, I will explore the meaning of some of the points made in this article for today’s data center infrastructures. I will also give a bit of background on what motivated our recent move to NixOS.

Continue reading Thoughts on systems management methods

Improving Ceph OSD start-up behaviour with vmtouch

We have a love/hate relation ship with Ceph. On one hand, it is probably the best open source distributed storage around. On the other hand, Ceph repeatedly exhibits unexpected behaviour under high load. And it is absolutely correct that you expect Flying Circus VMs to perform evenly. That is something we keep revisiting regularly. In the following article, I will describe an improvement we have applied on a common pain point: I/O hangs during OSD restarts.

Restarting an OSD (Object Storage Daemon) places additional load on its backing disks. Flying Circus business growth led to increasing storage I/O demand. While this is generally a good thing, it brought our main Ceph cluster near its throughput limit for several times. Danger ahead: The storage cluster is running fine as long as nothing special happens. But if something unusual happens, the cluster suddenly goes over the tipping point and performance becomes shaky.

Continue reading Improving Ceph OSD start-up behaviour with vmtouch