All posts by Theuni

2018-08-30 – Umzug unseres Rechenzentrums-Labors / Moving our data center laboratory

Unser Büro ist letzte Woche wie geplant umgezogen und wir sind jetzt in der Leipziger Str. 70/71 beheimatet.

In der nächsten Woche werden wir unser Rechenzentrums-Labor ebenfalls aus der Forsterstraße in das neue Gebäude umziehen. Unser Labor besteht aus zwei Racks die unser produktives Rechenzentrum einmal zur Entwicklung und zur Qualitätssicherung detailliert nachbilden. Dort betreiben wir auch einige sekundäre Dienste (z.B. Build-Server, unseren Chat oder ein paar ausgewählte Monitoring-Skripte) durch deren Downtime der Produktiv-Betrieb nicht gestört wird.

Beim Umzug werden wir im Laufe des Donnerstags (30.08.2018) das Labor vollständig offline nehmen, umziehen und im Laufe des Nachmittags wieder in Betrieb nehmen.

Die Auswirkungen während der Umzugsphase werden voraussichtlich sein:

  • Deployments über unseren zentralen Build-Server (builds.flyingcircus.io) nicht möglich.
  • Konfigurationsänderungen von NixOS-VMs werden möglich sein, können jedoch beim Anfordern neuer Pakete mit erhöhtem Compiler-Aufwand verbunden sein.
  • Unser Support wird per E-Mail und Telefon aber nicht per Chat erreichbar sein.
  • Unser Support wird keinen Zugriff auf die zentrale Telemetrie haben und es werden für diesen Zeitraum auch keine zentralen Telemetrie-Daten erfasst.
  • Einzelne “Frühwarn-Checks” auf Infrastruktur-Ebene und die im Kundenportal definierten HTTP-Checks werden nicht verfügbar sein. Tests zur externen Verfügbarkeit der Anwendungen, sowie VM-spezifische Checks und die Statuspages werden wie gewohnt verfügbar sein.

Bei Fragen sind wir wie gewohnt über alle Kanäle ansprechbar.


 

Last week we moved into our new office building and are now located in the Leipziger Str. 70/71 in Halle (Saale).

Next week we’ll also move our data center laboratory to the new premises. The lab consists of two racks that fully model our productive data center environment for development and quality assurance. The lab also runs a couple of secondary services  (e.g. CI/CD servers, our chat, a couple of selected monitoring scripts, …) whose absence will not have an impact on the productive data center applications.

During the move on Thursday (2018-08-30) we’ll shut down the lab during the morning and rebuild and power everything up in the new location during the afternoon.

The projected impact of the secondary services during the move will be:

  • Deployments through our central build server (builds.flyingcircus.io) won’t be possible.
  • Configuration changes to NixOS VMs will be possible but may require substantial local package compilation.
  • Our support will be available by email and phone but not per chat.
  • Our support will not have access to central telemetry and our central telemetry will not record data for this period.
  • A couple of “canary checks” on the infrastructure level will be offline and HTTP checks defined through our customer portal will not be executed. External availability checks, regular VM checks and the status pages will be available as usual.

If you have any questions don’t hesitate to contact us through the usual channels.

2018-05-29 – 2018-05-31: Major data center upgrade

TL;DR We are consolidating our hardware and racks in the data center and will perform a longer period of maintenance over multiple days. We have prepared thoroughly for the migration to avoid any downtimes and will use this opportunity to further improve our network.

Over the last years our data center setup has grown from a few machines in a single rack to three racks that are completely filled up with servers and additional customer-specific racks in our vicinity.

One of our basic tenets has always been to grow organically to avoid unnecessary waste. Now we have reached the limit of renting individual racks and our next organic step is to move to a separate row (and room!) of multiple consecutive racks. This gives us and you enough room to grow in the future while maintaining tight control over our network structure. It also gives us the chance to nicely clean up some smaller annoyances have accumulated over the years.

As this maintenance requires us to move all of our machines, we are leveraging this opportunity by having reviewed and improved all our technology layers:

  • We are introducing a redundant spine/core into our switching setup and are upgrading to 40G on all backbone connections.
  • Simplify our network infrastructure by reducing it to single-vendor components.
  • Our routers get upgraded with 10 Gbit/s on internal and external interfaces.
  • Our DNS is now more reliable by running it on the routers and having it included in automatic failover.
  • We improved our VM migration code to better support large migration tasks like moving whole racks around.
  • Our overall resource usage has around 40% or more free capacity on all of CPU, RAM and storage.
  • We are keeping a set of SSDs and HDDs on hand just in the case that disks should experience failures after turning the servers off. of them should experience issues when turning them back on.
  • Virtualisation hosts that have not yet been upgraded to 10G storage interfaces will be upgraded at that time.

The maintenance itself will be performed during regular business and evening hours as all involved components are fully redundant and have been tested recently. We will perform all steps slowly and carefully, leaving enough capacity and time to verify individual steps to reduce the chance for critical mishaps.

Nevertheless, our back office personnel will be monitoring the situation closely and will be able to respond to any issues immediately.

If you have any questions or feedback – let us know through your usual contact channels or by email to support@flyingcircus.io.

Cover photo by Tristan Schmurr, © 2012 CC-BY-2.0

Retiring our Gentoo platform – Sundown until September 2018

Over the last years we have moved our managed service offerings from a Gentoo-based Linux system over to a distribution called NixOS.

Since almost two years this has been the platform of choice for new projects and even within existing projects we started to add NixOS VMs where possible. We have also migrated some projects or moved them partially to NixOS where newer components were required.

Today, it’s time to start saying goodbye to our old Gentoo platform. Of course, we won’t leave anyone behind who is still using Gentoo-based VMs. Here’s our schedule for the next months and it’s impact for customers using the Gentoo platform:

Phase Dates Impact for Gentoo VMs
Announced  Immediately
  • No further feature development
  • No major updates
Sundown period May 2018
to
August 2018
  • No new VMs
  • Security updates only
  • Migration to NixOS VMs depending on individual agreements
Grace period September­ 2018
  • No further security updates
  • Remaining VMs will stay online.
End of Life September 2019
  • Remaining Gentoo VMs will be shut down.

Note: Customers already using the NixOS platform will not be affected by this.

I’m still using Gentoo-based VMs. What do I do now?

If you’re a customer with a support contract in the “Guided” or “Managed” service classes  then we’ll approach you directly and discuss how to move your remaining Gentoo VMs to NixOS.

If you’re a customer in the “Hosted” service class then we recommend you contacting our support team to discuss setting up new VMs and migrating your services over. We’ll help you with any information and coordination that you might need, but you’ll be responsible to migrate your data and services to new machines.

And lastly, rest assured that we won’t shut off any remaining Gentoo VMs for at least another 18 months. However, as the old platform will not receive further updates and as there will be a hard limit in September 2019, we advise you to take the time and move to the new platform as early as possible.

How do I know which VMs are still using Gentoo?

You can look at the VMs of your projects on my.flyingcircus.io. Select a project (“More details”) and then choose “Manage” on the box titled “Virtual Machines”. You’ll see a listing like the one in this screenshot. The VMs have different labels. If a VM has a label “Puppet” then it is still running on Gentoo. If the VM has a label “NixOS” then it is already running on the NixOS platform.

Screen Shot 2018-03-15 at 13.03.33

Why are we moving to NixOS?

A big part of our service is that we want to have as few “breaking” updates as possible – after all, we want to deliver small and continuous updates. When we started out with out Gentoo-based platform more than 10 years ago, we envisioned that we would profit from Gentoo’s rolling nature.

However, with the rising complexity, Gentoo has shown conceptual issues that has hindered us to efficiently manage the balance between stability and progress.

NixOS has been around for a while but wasn’t ready until around 2015 when we started to investigate alternatives to Gentoo. Since then we’ve been achieving great improvements to our service that would be impossible on our old platform. Due to that, we decided that it’s time to make the transition for everyone.

Aside from the larger motivation, there are also a number of direct benefits for you when moving to our NixOS based platform:

  • VMs now run a 64-bit kernel which provide better performance for many languages (Python, Java, …) and allow larger RAM allocations to be used effectively.
  • Service users can install custom (Nix) packages and versions without requiring pre-defined roles from our platform and still have them monitored within our security update tools.
  • Improved logging (Graylog), monitoring (Sensu), and telemetry (Telegraf/Prometheus/Grafana) services that have a higher flexibility and allow more direct interaction without needing our personal assistance. (Even though we’re always happy to help!)
  • Overall a newer set of versions for many components like nginx (HTTP 2!), MySQL, PostgreSQL, Python, PHP, …
  • A better release process that is much much more robust and more flexible to provide you with early releases of customizations.
  • Faster installation of changes, updates, rollback capability, and local versioning of all configuration.

If you’d like to know more about NixOS and its benefits, we recommend talking to us or visiting the NixOS homepage. Similar to the effect that Gentoo has been a comparatively “exotic” Linux distribution, we know that NixOS may look even more so. However, our documentation has been extended with a NixOS-specific area that will help you discover the relevant parts for you to interact with. On every other account: it’s a Linux environment that will run your applications well and we hope that you’ll enjoy that platform that we’ve built using it.

If you have questions …

As always: if you have any questions or comments then let us know and send us an email to support@flyingcircus.io and we will follow up quickly.

CPU hardware security issues: Meltdown and Spectre (updated 2018-03-08)

Researchers have published serious and widespread security issues relevant for all users of Intel (and other) CPUs for all products from the last decade. The bugs are known as “Meltdown” and “Spectre”. Both bugs have massive implications for the security of all applications both within an operating system as well as on hosted virtualised platforms like Amazon AWS, Google Compute Engine or the Flying Circus.

The security issues were intended to be under an embargo for another week but a couple of news outlets have already started reporting about them and forced the security researchers to publish the issues earlier than intended.

We’re watching the in-progress security patches as they arrive and will take appropriate measures. We’ll update our customers with more specific information over time but want you to know that we are aware of the issue and its implications.

Update Monday 2018-01-08

There is still progress happening and the most relevant security issue (Spectre, Variation 2, CVE 2017-5715) has no patch available yet. Some vendors and distributions are providing undocumented (and not publicly tested) patches that we are refraining from rolling out into our infrastructure. We’re in contact with Qemu and Linux kernel developers who are still working on reliable patches on both levels. We’ll keep you updated.

Update Monday, 2018-01-29

The situation remains complex. We have identified a small Linux kernel change that will ensure proper KVM/Qemu guest/host isolation. However, there are a number of other patches that keep finding their wait into this part of the Linux kernel code and Intel is still communicating very unclear messages and has retracted (some or all) µCode updates for their CPUs last week as well as other Vendors like Ubuntu, VMware, etc. Intel has announced another update for 2018-01-31 which we will review and consider after waiting for industry feedback on the performance and stability.

We are then planning to roll out an update Linux kernel on VM hosts and will likely enable the additional countermeasures (like KPTI) on the hosts. To validate that this does not have drastic performance impacts we are reviewing our baseline system and application performance using the Phoronix Test Suite.

More updates will follow here as the situation develops.

Update Thursday, 2018-03-08

A few weeks ago we have reviewed the status of the vanilla kernel fixes Spectre and Meltdown and have decided to update our Gentoo-based Linux 4.9 series hardware and virtual machines with the recent 4.9.85 update. The upstream developers have implemented sufficient mechanics at this point to selectively enable mitigations depending on hardware support and balance performance versus security. We started to roll out those updates in yesterday’s release and VMs and servers will perform required reboots within regular maintenance windows over the next days. Our host servers will enable mitigations for all variants (Meltdown and Spectre 1 and 2). Our guest systems will enable mitigations against Spectre 1 and 2 but not Meltdown, due to missing support for PCID and thus avoid KPTI which would have a big performance impact.

Due to the extent of mitigations available in the Linux kernel and a rough history of stability of Intel’s patches  we also decided to not apply µCode updates for the Intel CPUs at this point in time.

Internal and External Interfaces From an Operations Perspective (long read)

We recently pitched for a project that explicitly asked for a team with experience dealing with a system that has many interfaces. I sat down and wrote an overview of our experience and what we learned about internal and external interfaces over time – and I’d like to share this in today’s post.

Continue reading Internal and External Interfaces From an Operations Perspective (long read)

How we are improving our infrastructure for better reliability and scalability

In 2016 we have seen substantial growth in our infrastructure after we changed our pricing model to make resources more affordable and as we introduced NixOS-based 64-bit virtual machines. Unfortunately under this growth we have failed to adapt our infrastructure in good time and thus, ultimately, failed our customers to provide the reliability and performance that they rightfully expect.

We have taken the time and reviewed all the incidents we experienced this year and decided to further invest in two critical areas of our infrastructure: networking and storage.

Networking

Earlier this year, after a long period of research, we discovered Brocade’s VDX offering. Combined with their flexible subscription option we replaced our transitional 10 Gigabit infrastructure with Brocade VDX 6740T switches that have been working fantastically since about 2 months now. However, we are currently still running a 1 GE HP ProCurved based infrastructure for some of our VLANs. This situation at the moment has multiple drawbacks:

  • we do not yet have full active-active redundancy for all servers and all networks
  • we still have a mixed-vendor environment that has its own risks (as we have seen with our outages earlier this year)
  • we do not benefit from the reliability features that Brocade offers (like persistent logging, full meshing)
  • we require too many cables per server and have a much too complicated switch configuration

Our plan for the next weeks and months is, to:

  • provide two redundant 10GE Brocade switches per rack, so that every server gets redundant access to the network
  • remove existing 1GE connections
  • move all our networks to tagged VLAN configurations

For you as a customer, this will be visible as a faster network in all VMs with a much, much lower risk of incidents (due to component failure or operating errors) than before.

Storage

Our storage has not been able to provide the performance and resilience that we want to deliver to you. It specifically struggled to keep up with our increased load. We have learned multiple things that resulted in our new roadmap:

  • For Ceph to show its strength in horizontal scaling we not only need many disks, but we also need a more servers to reduce the impact of individual server failure.
  • Using our existing HDD pool based on its storage capacity has lowered the available IOPS per customer dramatically. We need to provide a lot more IOPS, manage them more systematically, and we also need to communicate what can be expected more transparently.
  • With the advent of high capacity and medium endurance SSD technology we are finally at a point where HDD technology is now turning from a mainstream default choice to a niche solution for low-performance high-capacity tasks, like test environments, archives, backups, etc.

Our next steps for our storage cluster are:

  • Add more capacity and IOPS to our existing cluster by extending the HDD pool with SSDs.
  • Growing our cluster from 6 storage hosts to 10 to further improve available IOPS and reduce host outage impact.
  • Move from n+1 redundancy to n+2 redundancy to allow for more complex failure scenarios.
  • Introduce more strict IOPS mangement for VMs and communicate the rules around it. We already started to defensively add limits to counter the worst impact. At the moment we are discussing some options that will allow you to choose from different storage classes with different performance characteristics. Very likely we will bind the absolute IOPS limits of each VM based on the total storage size that it uses. This reflects physical reality of how IOPS can be added to a cluster.
  • Update our Ceph installation to the next long-term supported version “Jewel” which will have much more controlled IOPS behaviour to avoid negative impact from maintenance operations on customer traffic.
  • Implement a more fine-grained and much more strict capacity management in our inventory system that reflects cluster status, effective usage and booked capacities to avoid over-subscription which results in undetected reduction of redundancy and performance penalties.

Those measures will be implemented in the next days, weeks, and months and ultimately lead to a more robust, more reliable, and more scalable infrastructure.

We will announce individual steps that require maintenance on our status page. We are also interested in hearing your feedback, specifically when it comes to your requirements of performance and capacity. Let us know by sending an email to support@flyingcircus.io.

Gleichmäßigere Performance virtueller Festplatten durch IO-Limits

Mit dem nächsten Release unserer Gentoo-Plattform werden wir für alle virtuellen Maschinen die Geschwindigkeit der Festplatten neu regulieren. Die neuen Regeln sorgen dafür, dass die in unserem Cluster insgesamt verfügbare Performance gleichmäßiger verteilt wird und Lastspitzen einzelner VMs nicht übermäßig die Leistung von anderen VMs beeinträchtigen.

In der Vergangenheit haben wir allen VMs die gesamte Leistung unseres Clusters nach Bedarf unbegrenzt zur Verfügung gestellt. Dadurch konnten bei Lastspitzen sehr viele Operationen pro Sekunde durchgeführt werden. In einem gemischten System mit vielen VMs gleichen sich diese Effekte in Summe häufig aus. Allerdings ist es in den letzten Monaten häufiger dazu gekommen, dass mehrere Maschinen gleichzeitig extreme Lastspitzen gezeigt haben und kein Ausgleich mehr stattfinden konnte, sodass die Systeme anderer Kunden dadurch beeinträchtigt wurden.

Wir führen deshalb ein Limit für die Anzahl der Lese-/Schreib-Operationen pro Sekunde (IOPS) ein, um auch in Situationen mit hoher Last allen Kunden eine zufriedenstellende Performance anbieten zu können. Continue reading Gleichmäßigere Performance virtueller Festplatten durch IO-Limits