Introducing IO limits to achieve more uniform virtual disk performance

With the upcoming release of our Gentoo platform we will start to regulate the disk performance of all virtual machines. The new rules will help to achieve more uniform performance in our cluster and reduce the impact of load peaks from individual VMs to others.

In the past we have provided the entire performance capacity of our cluster based on demand without any regulation. Load peaks requiring many IOPS (input/output operations per second) could thus be processed quickly. In a mixed environment this usually evens out. However, we are seeing more and more periods when too many VMs have load peaks at the same time. Instead of evening out, those periods result in performance penalties for the remaining virtual machines.

We therefore introduce a limit to the number of operations per second, to achieve a satisfying performance for all customers even during periods of increased load.

The new limit will be 250 IOPS. Regular hard drives with a sizing of 1 TiB and more usually provide up to 180 IOPS, so we think that 250 IOPS will be a fairly  large amount for VMs that usually have far less space (30-100GiB).

We have monitored the IOPS requirements of all VMs in the past weeks and found that 250 will be a good limit for most applications. Some applications that have shown extremely high IOPS have been reviewed already and have been harmonised with the new limit by optimising their configuration or their coding.

The new limit also provides more visibility about performance bottlenecks close to the application so that we can recognise problematic configuration or  programming more easily and then suggest optimisations. Two typical scenarios that we have found and that can be fixed easily:

  • Systems with too little RAM that are swapping too much or that have a too small VFS cache (we have seen this, for example, with Varnish).
  • Systems whose database configuration mismatches the access patterns from the application (we have seen this on MySQL databases with MyISAM tables processing large batch jobs with many INSERT/DELETE operations).

Those cases become visible directly on an affected VM now when the iowait figure goes up in CPU time breakdowns (for example, with top) and the disk performance continuously uses the 250 IOPS. This will typically result in higher response times of applications or significantly longer times to process batch jobs.

We are going to roll out the new limit in two phases to relax the effects for the remaining VMs that haven’t shown high IOPS before. In the first phase all existing VMs will be limited with 1.000 IOPS. If any problems should arise from that we will respond to them and prepare them for the new limit. After two weeks, we will then lower them to our final limit of 250 IOPS. Again, if problems should appear, we will work them out together with you.

If your application should require (despite all optimisations) a continuously
higher amount of IOPS we recommend to consider switching to our SSD-based
offering which provides at least 10.000 IOPS.

Let us know if you have any questions or feedback.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s