It smells like Christmas everywhere – Holiday season is almost there.
To ensure that all your applications in the Flying Circus are running smoothly we will monitor all regular support during business hours* and emergency support as usual. We won’t be performing non-critical work in this time and catch up with any backlog early in January 2017.
Here’s an overview of the next days and our support availability. The highlighted days are national or local holidays and are only covered for SLA customers:
- 2016-12-19 (Monday): regular support
- 2016-12-20 (Tuesday): regular support
- 2016-12-21 (Wednesday): regular support
- 2016-12-22 (Thursday): regular support
- 2016-12-23 (Friday): regular support
- 2016-12-24 (Saturday):SLA-covered emergency support only
- 2016-12-25 (Sunday): SLA-covered emergency support only
- 2016-12-26 (Monday): SLA-covered emergency support only
- 2016-12-27 (Tuesday): regular support
- 2016-12-28 (Wednesday): regular support
- 2016-12-29 (Thursday): regular support
- 2016-12-30 (Friday): regular support
- 2016-12-31 (Saturday): SLA-covered emergency support only
- 2017-01-01 (Sunday): SLA-covered emergency support only
- 2017-01-02 (Monday): regular support
- 2017-01-03 (Tuesday): regular support
- 2017-01-04 (Wednesday): regular support
- 2017-01-05 (Thursday): regular support
- 2017-01-06 (Friday): SLA-covered emergency support only
Happy holidays to everybody and see you in 2017!
* business hours = Mo-Fr, 8-16 CE(S)T
In 2016 we have seen substantial growth in our infrastructure after we changed our pricing model to make resources more affordable and as we introduced NixOS-based 64-bit virtual machines. Unfortunately under this growth we have failed to adapt our infrastructure in good time and thus, ultimately, failed our customers to provide the reliability and performance that they rightfully expect.
We have taken the time and reviewed all the incidents we experienced this year and decided to further invest in two critical areas of our infrastructure: networking and storage.
Earlier this year, after a long period of research, we discovered Brocade’s VDX offering. Combined with their flexible subscription option we replaced our transitional 10 Gigabit infrastructure with Brocade VDX 6740T switches that have been working fantastically since about 2 months now. However, we are currently still running a 1 GE HP ProCurved based infrastructure for some of our VLANs. This situation at the moment has multiple drawbacks:
- we do not yet have full active-active redundancy for all servers and all networks
- we still have a mixed-vendor environment that has its own risks (as we have seen with our outages earlier this year)
- we do not benefit from the reliability features that Brocade offers (like persistent logging, full meshing)
- we require too many cables per server and have a much too complicated switch configuration
Our plan for the next weeks and months is, to:
- provide two redundant 10GE Brocade switches per rack, so that every server gets redundant access to the network
- remove existing 1GE connections
- move all our networks to tagged VLAN configurations
For you as a customer, this will be visible as a faster network in all VMs with a much, much lower risk of incidents (due to component failure or operating errors) than before.
Our storage has not been able to provide the performance and resilience that we want to deliver to you. It specifically struggled to keep up with our increased load. We have learned multiple things that resulted in our new roadmap:
- For Ceph to show its strength in horizontal scaling we not only need many disks, but we also need a more servers to reduce the impact of individual server failure.
- Using our existing HDD pool based on its storage capacity has lowered the available IOPS per customer dramatically. We need to provide a lot more IOPS, manage them more systematically, and we also need to communicate what can be expected more transparently.
- With the advent of high capacity and medium endurance SSD technology we are finally at a point where HDD technology is now turning from a mainstream default choice to a niche solution for low-performance high-capacity tasks, like test environments, archives, backups, etc.
Our next steps for our storage cluster are:
- Add more capacity and IOPS to our existing cluster by extending the HDD pool with SSDs.
- Growing our cluster from 6 storage hosts to 10 to further improve available IOPS and reduce host outage impact.
- Move from n+1 redundancy to n+2 redundancy to allow for more complex failure scenarios.
- Introduce more strict IOPS mangement for VMs and communicate the rules around it. We already started to defensively add limits to counter the worst impact. At the moment we are discussing some options that will allow you to choose from different storage classes with different performance characteristics. Very likely we will bind the absolute IOPS limits of each VM based on the total storage size that it uses. This reflects physical reality of how IOPS can be added to a cluster.
- Update our Ceph installation to the next long-term supported version “Jewel” which will have much more controlled IOPS behaviour to avoid negative impact from maintenance operations on customer traffic.
- Implement a more fine-grained and much more strict capacity management in our inventory system that reflects cluster status, effective usage and booked capacities to avoid over-subscription which results in undetected reduction of redundancy and performance penalties.
Those measures will be implemented in the next days, weeks, and months and ultimately lead to a more robust, more reliable, and more scalable infrastructure.
We will announce individual steps that require maintenance on our status page. We are also interested in hearing your feedback, specifically when it comes to your requirements of performance and capacity. Let us know by sending an email to firstname.lastname@example.org.
All VMs are currently affected by the “Dirty Cow” kernel bug. The upcoming release 2016_034 contains a kernel update which upgrades Linux to the unaffected version 4.4.28. As usual, the kernel update requires to reboot all VMs.
- Tue 15 through Thu 17 November 2016: reboot staging VMs
- Thu 17 through Thu 24 November 2016: reboot productive VMs.
VM reboots will be scheduled along the agreed maintenance windows. We will piggy-back a Qemu binary environment update which would require a separate reboot otherwise.
Mit dem nächsten Release unserer Gentoo-Plattform werden wir für alle virtuellen Maschinen die Geschwindigkeit der Festplatten neu regulieren. Die neuen Regeln sorgen dafür, dass die in unserem Cluster insgesamt verfügbare Performance gleichmäßiger verteilt wird und Lastspitzen einzelner VMs nicht übermäßig die Leistung von anderen VMs beeinträchtigen.
In der Vergangenheit haben wir allen VMs die gesamte Leistung unseres Clusters nach Bedarf unbegrenzt zur Verfügung gestellt. Dadurch konnten bei Lastspitzen sehr viele Operationen pro Sekunde durchgeführt werden. In einem gemischten System mit vielen VMs gleichen sich diese Effekte in Summe häufig aus. Allerdings ist es in den letzten Monaten häufiger dazu gekommen, dass mehrere Maschinen gleichzeitig extreme Lastspitzen gezeigt haben und kein Ausgleich mehr stattfinden konnte, sodass die Systeme anderer Kunden dadurch beeinträchtigt wurden.
Wir führen deshalb ein Limit für die Anzahl der Lese-/Schreib-Operationen pro Sekunde (IOPS) ein, um auch in Situationen mit hoher Last allen Kunden eine zufriedenstellende Performance anbieten zu können. Continue reading Gleichmäßigere Performance virtueller Festplatten durch IO-Limits
With the upcoming release of our Gentoo platform we will start to regulate the disk performance of all virtual machines. The new rules will help to achieve more uniform performance in our cluster and reduce the impact of load peaks from individual VMs to others.
In the past we have provided the entire performance capacity of our cluster based on demand without any regulation. Load peaks requiring many IOPS (input/output operations per second) could thus be processed quickly. In a mixed environment this usually evens out. However, we are seeing more and more periods when too many VMs have load peaks at the same time. Instead of evening out, those periods result in performance penalties for the remaining virtual machines.
We therefore introduce a limit to the number of operations per second, to achieve a satisfying performance for all customers even during periods of increased load.
Continue reading Introducing IO limits to achieve more uniform virtual disk performance
Next week our Autumn 2016 Sprint starts and we really look forward to welcome our guests. We are in the midst of preparation and hope the weather plays along. All details around the sprint can be find on Meetup. Interesting topics are on the agenda as: backy, batou, NixOS and more – there is an Etherpad to gather them.
If you want to contribute but can’t make it in person, think about join us remote. Just let us know in advance (send a short message to email@example.com or poke us on twitter @flyingcircusio).
Back in May I introduced you to the development of vulnix, a tool which initially was done to find out whether a system (might) be affected by a security vulnerability. It does this by matching the derivations name with the product and version specified in the cpe language of the so-called CVEs (Common Vulnerabilities and Exposures). In the meantime we introduced the tool to the community at the Berlin NixOS Meetup and got some wonderful input in which directions we might extend the features. We sprinted the next two days to improve the code quality and broaden the feature set.
Continue reading Vulnix v1.0 release