As every year in spring Easter is coming around. In Germany it brings us some extra holidays.
To ensure that all your applications in the Flying Circus are running smoothly we will monitor all regular support during business hours: Monday to Friday , 8-16 CE(S)T and SLA-covered emergency support as usual.
Continue reading Support during Easter 2017
We recently pitched for a project that explicitly asked for a team with experience dealing with a system that has many interfaces. I sat down and wrote an overview of our experience and what we learned about internal and external interfaces over time – and I’d like to share this in today’s post.
Continue reading Internal and External Interfaces From an Operations Perspective (long read)
The Chemnitzer Linux Tage is a venue which needs no big introduction I guess. Connecting the (mostly) German Linux and open source enthusiasts for almost two decades now is a big achievement and a great event to catch up and meet people in #rl (real life). As we did in the last years we try to continuously make some kind of effort in participating, be it as lecturer, sponsor or having workshops. This year we decided to introduce Vulnix, a tool for detecting potential vulnerabilities on running systems or nix-driven projects. I wrote about it here and there.
 Vulnix v1.0 release
 Introducing vulnix – a vulnerability scanner for NixOS
The latest Amazon S3 outage showed me one thing again: more diversity is better.
Diversity is a current topic that includes social issues like women in tech. However, on a technical basis diversity also is important. It’s known that monocultures are more affected by diseases and other issues. So when half of the internet is using Amazon, a lot goes down if Amazon fails.
Every system will eventually fail. This is true for Amazon, as well as us. The internet is moving fast away from independent, interconnected nodes to an oligopoly. Nobody gets fired for using AWS nowadays. And that’s a problem I think. We need to embrace independent providers for the better of the internet.
Photo by Andrew Fogg.
It smells like Christmas everywhere – Holiday season is almost there.
To ensure that all your applications in the Flying Circus are running smoothly we will monitor all regular support during business hours* and emergency support as usual. We won’t be performing non-critical work in this time and catch up with any backlog early in January 2017.
Here’s an overview of the next days and our support availability. The highlighted days are national or local holidays and are only covered for SLA customers:
- 2016-12-19 (Monday): regular support
- 2016-12-20 (Tuesday): regular support
- 2016-12-21 (Wednesday): regular support
- 2016-12-22 (Thursday): regular support
- 2016-12-23 (Friday): regular support
- 2016-12-24 (Saturday):SLA-covered emergency support only
- 2016-12-25 (Sunday): SLA-covered emergency support only
- 2016-12-26 (Monday): SLA-covered emergency support only
- 2016-12-27 (Tuesday): regular support
- 2016-12-28 (Wednesday): regular support
- 2016-12-29 (Thursday): regular support
- 2016-12-30 (Friday): regular support
- 2016-12-31 (Saturday): SLA-covered emergency support only
- 2017-01-01 (Sunday): SLA-covered emergency support only
- 2017-01-02 (Monday): regular support
- 2017-01-03 (Tuesday): regular support
- 2017-01-04 (Wednesday): regular support
- 2017-01-05 (Thursday): regular support
- 2017-01-06 (Friday): SLA-covered emergency support only
Happy holidays to everybody and see you in 2017!
* business hours = Mo-Fr, 8-16 CE(S)T
In 2016 we have seen substantial growth in our infrastructure after we changed our pricing model to make resources more affordable and as we introduced NixOS-based 64-bit virtual machines. Unfortunately under this growth we have failed to adapt our infrastructure in good time and thus, ultimately, failed our customers to provide the reliability and performance that they rightfully expect.
We have taken the time and reviewed all the incidents we experienced this year and decided to further invest in two critical areas of our infrastructure: networking and storage.
Earlier this year, after a long period of research, we discovered Brocade’s VDX offering. Combined with their flexible subscription option we replaced our transitional 10 Gigabit infrastructure with Brocade VDX 6740T switches that have been working fantastically since about 2 months now. However, we are currently still running a 1 GE HP ProCurved based infrastructure for some of our VLANs. This situation at the moment has multiple drawbacks:
- we do not yet have full active-active redundancy for all servers and all networks
- we still have a mixed-vendor environment that has its own risks (as we have seen with our outages earlier this year)
- we do not benefit from the reliability features that Brocade offers (like persistent logging, full meshing)
- we require too many cables per server and have a much too complicated switch configuration
Our plan for the next weeks and months is, to:
- provide two redundant 10GE Brocade switches per rack, so that every server gets redundant access to the network
- remove existing 1GE connections
- move all our networks to tagged VLAN configurations
For you as a customer, this will be visible as a faster network in all VMs with a much, much lower risk of incidents (due to component failure or operating errors) than before.
Our storage has not been able to provide the performance and resilience that we want to deliver to you. It specifically struggled to keep up with our increased load. We have learned multiple things that resulted in our new roadmap:
- For Ceph to show its strength in horizontal scaling we not only need many disks, but we also need a more servers to reduce the impact of individual server failure.
- Using our existing HDD pool based on its storage capacity has lowered the available IOPS per customer dramatically. We need to provide a lot more IOPS, manage them more systematically, and we also need to communicate what can be expected more transparently.
- With the advent of high capacity and medium endurance SSD technology we are finally at a point where HDD technology is now turning from a mainstream default choice to a niche solution for low-performance high-capacity tasks, like test environments, archives, backups, etc.
Our next steps for our storage cluster are:
- Add more capacity and IOPS to our existing cluster by extending the HDD pool with SSDs.
- Growing our cluster from 6 storage hosts to 10 to further improve available IOPS and reduce host outage impact.
- Move from n+1 redundancy to n+2 redundancy to allow for more complex failure scenarios.
- Introduce more strict IOPS mangement for VMs and communicate the rules around it. We already started to defensively add limits to counter the worst impact. At the moment we are discussing some options that will allow you to choose from different storage classes with different performance characteristics. Very likely we will bind the absolute IOPS limits of each VM based on the total storage size that it uses. This reflects physical reality of how IOPS can be added to a cluster.
- Update our Ceph installation to the next long-term supported version “Jewel” which will have much more controlled IOPS behaviour to avoid negative impact from maintenance operations on customer traffic.
- Implement a more fine-grained and much more strict capacity management in our inventory system that reflects cluster status, effective usage and booked capacities to avoid over-subscription which results in undetected reduction of redundancy and performance penalties.
Those measures will be implemented in the next days, weeks, and months and ultimately lead to a more robust, more reliable, and more scalable infrastructure.
We will announce individual steps that require maintenance on our status page. We are also interested in hearing your feedback, specifically when it comes to your requirements of performance and capacity. Let us know by sending an email to email@example.com.