Researchers have published serious and widespread security issues relevant for all users of Intel (and other) CPUs for all products from the last decade. The bugs are known as “Meltdown” and “Spectre”. Both bugs have massive implications for the security of all applications both within an operating system as well as on hosted virtualised platforms like Amazon AWS, Google Compute Engine or the Flying Circus.
The security issues were intended to be under an embargo for another week but a couple of news outlets have already started reporting about them and forced the security researchers to publish the issues earlier than intended.
We’re watching the in-progress security patches as they arrive and will take appropriate measures. We’ll update our customers with more specific information over time but want you to know that we are aware of the issue and its implications.
Update Monday 2018-01-08
There is still progress happening and the most relevant security issue (Spectre, Variation 2, CVE 2017-5715) has no patch available yet. Some vendors and distributions are providing undocumented (and not publicly tested) patches that we are refraining from rolling out into our infrastructure. We’re in contact with Qemu and Linux kernel developers who are still working on reliable patches on both levels. We’ll keep you updated.
We recently pitched for a project that explicitly asked for a team with experience dealing with a system that has many interfaces. I sat down and wrote an overview of our experience and what we learned about internal and external interfaces over time – and I’d like to share this in today’s post.
Continue reading Internal and External Interfaces From an Operations Perspective (long read)
In 2016 we have seen substantial growth in our infrastructure after we changed our pricing model to make resources more affordable and as we introduced NixOS-based 64-bit virtual machines. Unfortunately under this growth we have failed to adapt our infrastructure in good time and thus, ultimately, failed our customers to provide the reliability and performance that they rightfully expect.
We have taken the time and reviewed all the incidents we experienced this year and decided to further invest in two critical areas of our infrastructure: networking and storage.
Earlier this year, after a long period of research, we discovered Brocade’s VDX offering. Combined with their flexible subscription option we replaced our transitional 10 Gigabit infrastructure with Brocade VDX 6740T switches that have been working fantastically since about 2 months now. However, we are currently still running a 1 GE HP ProCurved based infrastructure for some of our VLANs. This situation at the moment has multiple drawbacks:
- we do not yet have full active-active redundancy for all servers and all networks
- we still have a mixed-vendor environment that has its own risks (as we have seen with our outages earlier this year)
- we do not benefit from the reliability features that Brocade offers (like persistent logging, full meshing)
- we require too many cables per server and have a much too complicated switch configuration
Our plan for the next weeks and months is, to:
- provide two redundant 10GE Brocade switches per rack, so that every server gets redundant access to the network
- remove existing 1GE connections
- move all our networks to tagged VLAN configurations
For you as a customer, this will be visible as a faster network in all VMs with a much, much lower risk of incidents (due to component failure or operating errors) than before.
Our storage has not been able to provide the performance and resilience that we want to deliver to you. It specifically struggled to keep up with our increased load. We have learned multiple things that resulted in our new roadmap:
- For Ceph to show its strength in horizontal scaling we not only need many disks, but we also need a more servers to reduce the impact of individual server failure.
- Using our existing HDD pool based on its storage capacity has lowered the available IOPS per customer dramatically. We need to provide a lot more IOPS, manage them more systematically, and we also need to communicate what can be expected more transparently.
- With the advent of high capacity and medium endurance SSD technology we are finally at a point where HDD technology is now turning from a mainstream default choice to a niche solution for low-performance high-capacity tasks, like test environments, archives, backups, etc.
Our next steps for our storage cluster are:
- Add more capacity and IOPS to our existing cluster by extending the HDD pool with SSDs.
- Growing our cluster from 6 storage hosts to 10 to further improve available IOPS and reduce host outage impact.
- Move from n+1 redundancy to n+2 redundancy to allow for more complex failure scenarios.
- Introduce more strict IOPS mangement for VMs and communicate the rules around it. We already started to defensively add limits to counter the worst impact. At the moment we are discussing some options that will allow you to choose from different storage classes with different performance characteristics. Very likely we will bind the absolute IOPS limits of each VM based on the total storage size that it uses. This reflects physical reality of how IOPS can be added to a cluster.
- Update our Ceph installation to the next long-term supported version “Jewel” which will have much more controlled IOPS behaviour to avoid negative impact from maintenance operations on customer traffic.
- Implement a more fine-grained and much more strict capacity management in our inventory system that reflects cluster status, effective usage and booked capacities to avoid over-subscription which results in undetected reduction of redundancy and performance penalties.
Those measures will be implemented in the next days, weeks, and months and ultimately lead to a more robust, more reliable, and more scalable infrastructure.
We will announce individual steps that require maintenance on our status page. We are also interested in hearing your feedback, specifically when it comes to your requirements of performance and capacity. Let us know by sending an email to email@example.com.
Mit dem nächsten Release unserer Gentoo-Plattform werden wir für alle virtuellen Maschinen die Geschwindigkeit der Festplatten neu regulieren. Die neuen Regeln sorgen dafür, dass die in unserem Cluster insgesamt verfügbare Performance gleichmäßiger verteilt wird und Lastspitzen einzelner VMs nicht übermäßig die Leistung von anderen VMs beeinträchtigen.
In der Vergangenheit haben wir allen VMs die gesamte Leistung unseres Clusters nach Bedarf unbegrenzt zur Verfügung gestellt. Dadurch konnten bei Lastspitzen sehr viele Operationen pro Sekunde durchgeführt werden. In einem gemischten System mit vielen VMs gleichen sich diese Effekte in Summe häufig aus. Allerdings ist es in den letzten Monaten häufiger dazu gekommen, dass mehrere Maschinen gleichzeitig extreme Lastspitzen gezeigt haben und kein Ausgleich mehr stattfinden konnte, sodass die Systeme anderer Kunden dadurch beeinträchtigt wurden.
Wir führen deshalb ein Limit für die Anzahl der Lese-/Schreib-Operationen pro Sekunde (IOPS) ein, um auch in Situationen mit hoher Last allen Kunden eine zufriedenstellende Performance anbieten zu können. Continue reading Gleichmäßigere Performance virtueller Festplatten durch IO-Limits
With the upcoming release of our Gentoo platform we will start to regulate the disk performance of all virtual machines. The new rules will help to achieve more uniform performance in our cluster and reduce the impact of load peaks from individual VMs to others.
In the past we have provided the entire performance capacity of our cluster based on demand without any regulation. Load peaks requiring many IOPS (input/output operations per second) could thus be processed quickly. In a mixed environment this usually evens out. However, we are seeing more and more periods when too many VMs have load peaks at the same time. Instead of evening out, those periods result in performance penalties for the remaining virtual machines.
We therefore introduce a limit to the number of operations per second, to achieve a satisfying performance for all customers even during periods of increased load.
Continue reading Introducing IO limits to achieve more uniform virtual disk performance
We have been using Ceph since 0.7x back in 2013 already, starting when we were fed up with the open source iSCSI implementations, longing to provide our customers with a more elastic, manageable, and scalable solution. Ceph has generally fulfilled its promises from the perspective of functionality. However, if you have been following this blog or searched for Ceph troubles on Google you will likely have seen our previous posts.
Aside from early software stability issues we had to invest a good amount of manpower (and nerves) into learning how to make Ceph perform acceptably and how all the pieces of hard drives, SSDs, raid controllers, 1- and 10Gbit network, CPU and RAM consumption, Ceph configuration, Qemu drivers, … fit together.
Today, I’d like to present our learnings both from a technical and methodical view. Specifically the methodical aspects should be seen in the retrospective of running a production cluster for a comparatively long time by now, going through version upgrades, hardware changes, and so on. Even if you won’t be bitten by the specific issues of the 0.7x series in the future, the methods may prove useful in the future to avoid navigating into troublesome waters. No promises, though. 🙂
Continue reading Ceph performance learnings (long read)
Easter is a big holiday here in Germany and we hope that you and your loved ones will enjoy a few beautiful days of spring. We will also use the time after Easter for a 3-day planning session for the upcoming quarter. Continue reading Support during Easter 2016