Category Archives: Platform

Platform features, security updates, technical stuff

2018-05-29 – 2018-05-31: Major data center upgrade

TL;DR We are consolidating our hardware and racks in the data center and will perform a longer period of maintenance over multiple days. We have prepared thoroughly for the migration to avoid any downtimes and will use this opportunity to further improve our network.

Over the last years our data center setup has grown from a few machines in a single rack to three racks that are completely filled up with servers and additional customer-specific racks in our vicinity.

One of our basic tenets has always been to grow organically to avoid unnecessary waste. Now we have reached the limit of renting individual racks and our next organic step is to move to a separate row (and room!) of multiple consecutive racks. This gives us and you enough room to grow in the future while maintaining tight control over our network structure. It also gives us the chance to nicely clean up some smaller annoyances have accumulated over the years.

As this maintenance requires us to move all of our machines, we are leveraging this opportunity by having reviewed and improved all our technology layers:

  • We are introducing a redundant spine/core into our switching setup and are upgrading to 40G on all backbone connections.
  • Simplify our network infrastructure by reducing it to single-vendor components.
  • Our routers get upgraded with 10 Gbit/s on internal and external interfaces.
  • Our DNS is now more reliable by running it on the routers and having it included in automatic failover.
  • We improved our VM migration code to better support large migration tasks like moving whole racks around.
  • Our overall resource usage has around 40% or more free capacity on all of CPU, RAM and storage.
  • We are keeping a set of SSDs and HDDs on hand just in the case that disks should experience failures after turning the servers off. of them should experience issues when turning them back on.
  • Virtualisation hosts that have not yet been upgraded to 10G storage interfaces will be upgraded at that time.

The maintenance itself will be performed during regular business and evening hours as all involved components are fully redundant and have been tested recently. We will perform all steps slowly and carefully, leaving enough capacity and time to verify individual steps to reduce the chance for critical mishaps.

Nevertheless, our back office personnel will be monitoring the situation closely and will be able to respond to any issues immediately.

If you have any questions or feedback – let us know through your usual contact channels or by email to support@flyingcircus.io.

Cover photo by Tristan Schmurr, © 2012 CC-BY-2.0

CPU hardware security issues: Meltdown and Spectre (updated 2018-03-08)

Researchers have published serious and widespread security issues relevant for all users of Intel (and other) CPUs for all products from the last decade. The bugs are known as “Meltdown” and “Spectre”. Both bugs have massive implications for the security of all applications both within an operating system as well as on hosted virtualised platforms like Amazon AWS, Google Compute Engine or the Flying Circus.

The security issues were intended to be under an embargo for another week but a couple of news outlets have already started reporting about them and forced the security researchers to publish the issues earlier than intended.

We’re watching the in-progress security patches as they arrive and will take appropriate measures. We’ll update our customers with more specific information over time but want you to know that we are aware of the issue and its implications.

Update Monday 2018-01-08

There is still progress happening and the most relevant security issue (Spectre, Variation 2, CVE 2017-5715) has no patch available yet. Some vendors and distributions are providing undocumented (and not publicly tested) patches that we are refraining from rolling out into our infrastructure. We’re in contact with Qemu and Linux kernel developers who are still working on reliable patches on both levels. We’ll keep you updated.

Update Monday, 2018-01-29

The situation remains complex. We have identified a small Linux kernel change that will ensure proper KVM/Qemu guest/host isolation. However, there are a number of other patches that keep finding their wait into this part of the Linux kernel code and Intel is still communicating very unclear messages and has retracted (some or all) µCode updates for their CPUs last week as well as other Vendors like Ubuntu, VMware, etc. Intel has announced another update for 2018-01-31 which we will review and consider after waiting for industry feedback on the performance and stability.

We are then planning to roll out an update Linux kernel on VM hosts and will likely enable the additional countermeasures (like KPTI) on the hosts. To validate that this does not have drastic performance impacts we are reviewing our baseline system and application performance using the Phoronix Test Suite.

More updates will follow here as the situation develops.

Update Thursday, 2018-03-08

A few weeks ago we have reviewed the status of the vanilla kernel fixes Spectre and Meltdown and have decided to update our Gentoo-based Linux 4.9 series hardware and virtual machines with the recent 4.9.85 update. The upstream developers have implemented sufficient mechanics at this point to selectively enable mitigations depending on hardware support and balance performance versus security. We started to roll out those updates in yesterday’s release and VMs and servers will perform required reboots within regular maintenance windows over the next days. Our host servers will enable mitigations for all variants (Meltdown and Spectre 1 and 2). Our guest systems will enable mitigations against Spectre 1 and 2 but not Meltdown, due to missing support for PCID and thus avoid KPTI which would have a big performance impact.

Due to the extent of mitigations available in the Linux kernel and a rough history of stability of Intel’s patches  we also decided to not apply µCode updates for the Intel CPUs at this point in time.

Announcing fc-userscan

NixOS manages dependencies in a very strict way—sometimes too strict? Here at Flying Circus, many users prefer to compile custom applications in home directories. They link them against libraries they have installed before by nix-env. This works well… until something is updated! On the next change anywhere down the dependency chain, libraries get new hashes in the Nix store, the garbage collector removes old versions, and user applications break until recompiled.

In this blog post, I would like to introduce fc-userscan. This little tool scans (home) directories recursively for Nix store references and registers them as per-user roots with the garbage collector. This way, dependencies will be protected even if they cease to be referenced from “official” Nix roots like the current-system profile or a user’s local Nix profile. After registering formerly unmanaged references with fc-userscan, one can fearlessly run updates and garbage collection.

Continue reading Announcing fc-userscan

How to renew Puppet CA and server certificates in place

It used to run fine for years… but now the (deprecated) Puppet infrastructure at the Flying Circus is showing signs of aging. It’s not about server hardware or something like this (fully virtualized anyway) – it’s about SSL certificates of Puppet’s own SSL infrastructure. Time for a face lift.

In the following, I will describe what we did to renew both CA and Puppet server certificates. Despite that this problem should occur on every Puppet server running for a prolonged amount of time, I found remarkably few resources on the net (that did not involve completely replacing the CA) – so I’m going to share our findings.

Continue reading How to renew Puppet CA and server certificates in place

Release 2017_010 with many security updates

During the last weeks we have prepared a larger update for our Gentoo based VMs. It will include many basic libraries as well as added support for Python 3.5 and 3.6. A detailed list of affected packages and changes can be found as usual in our ChangeLog. Please review the list of updated packages for libraries and tools that may have compiled into your applications. While we have tried to avoid link-level compatibility issues, a small chance remains that applications will not start afterwards due to dynamic linkage problems. Recompiling usually  solves this kind a problem.

As the update is a bit bulky, we opted for a staged roll out during the week. Each VM will get an individual scheduled maintenance slot according to the agreed pre-announcement period. Development and staging VMs will already receive updates during this weekend.

Please feel free to contact our support if you need assistance.