Automatically deleting things — safely and reliably
Managing “stuff” automatically is awesome. Getting rid of “stuff” automatically is even more awesome — but also a lot harder: there be dragons.
After shooting ourselves in the foot in the past we came up with a system that we feel confident in using and maintaining.
We developed a phased approach that splits risky and complex deletion workflows into separate steps starting with tasks that can be reverted easily and then progressing towards increasing impact until reaching the point of no return.
In Spring 2014 we our platform had an accident: a situation arose in which many of the disk images of virtual machines were suddenly deleted by our automated mechanisms. You can read upon the details in our root cause analysis.
In the aftermath we discovered that our strategy towards deleting was too naive to be sustainable. We reviewed our management code and removed all automatic deletions that we deemed potentially unsafe. We also identified certain patterns around automatically deleting things.
Pattern 1: Deletions can be dangerous in two ways
You forget to delete something when it’s important that you delete it. — You might remove a server from your Ceph cluster but accidentally leave their keyring installed so they may reconnect later unexpectedly.
You delete something that can not be restored easily but is essential not to be deleted. — The typical example here being customer data.
Those two ways may be relevant gradually: deleting customer data whilst having backup is bad — but not as bad as deleting the customer data without having backup.
During our incident we deleted things that should not have been deleted (customer data), but we deleted them only because we did not delete something else (Ceph keyring on a host removed from the cluster).
Pattern 2: Be wary to delete things you don’t know anything about
What bug in the code triggered that? Well, no direct “bug”, but an unexpected combination: we migrated a host from our main Ceph cluster to a customer-specific cluster to help them out.
The host was now accessing the customer’s inventory — which was basically empty at that point.
Unfortunately, it started to connect again to our primary cluster. Comparing the inventory lists caused that machine to work with inconsistent data. Trying to rectify the situation the system went ahead and removed those VMs from Ceph that it deemed should be non-existent.
Besides various mechanical precautions we could have established in this one specific case (firewalls between the Ceph clusters, checking whether an image is still in use, …) we see a more general and important structural pattern: important deletions should be performed on the basis of an explicit inventory statement (“I want to delete VM X”) instead of deducing this from the absence of information about a VM. As Python developers know: explicit is better than implicit. There are more bugs imaginable that cause implicit conditions to be fulfilled, than cause explicit conditions.
Our idea here was that we need to introduce explicit deletion records in our database — which it turns out give us a great tool for scheduling deletions and using this data in other important places.
However, this pattern doesn’t mean you have to create deletion records for everything: just be wary of doing it without.
Pattern 3: Reverting deletions can be hard (or impossible)
Last but not least: all of this is really important because some deletions can’t be recovered from easily. If you turn off a Qemu process and your VM stops, it’s usually easy to just turn it back on again. If you delete an IP on a VM you can add it back.
However, some things are impossible to restore, but this becomes only visible over time: data that you deleted is gone immediately if you don’t have a backup. Or it’s gone after that backup expires.
IP addresses may be easy to restore right away, but can be hard if they have been reassigned in between.
And a nasty scenario is when everything gets mixed up: VM images in backup may still contain your data but may be hard to restore as a snapshot as the IPs that are still configured may be re-used already.
When we discovered those circumstances we switched back to a manual process. This was mostly because we were in awe of the solution that would have to be conceived to make deleting VMs safe again and did not have the time to do it right at that point, but we also did not want to continue using a broken mechanism that bit us this badly.
(We even signed up Markus Holtermann for writing his Master thesis about the issue of automatic unmanagement with us — this is more in-depth than this article and will be published soon.)
Earlier this year we finally came around to implement our new VM deletion mechanisms and established a general workflow/framework to implement more kinds of safe deletions in our platform in the future.
The general workflow during a safe automatic deletion in the Flying Circus today looks like this:
- Mark a resource to be deleted on a certain day. Show this information in various UIs.
- Preparatory stages inform customers about the fact that a deletion is due when approaching the day.
- Soft stages perform operations that effectively stop the resource from being usable but automatically restorable within very short periods of time.
- Hard stages perform operations that cause increasing unavailability of resources and allow re-use of certain sub-components (e.g. IP addresses). The final result of those stages is the guaranteed removal of the primary resource.
- Recycling stages allow the primary resource to be re-used.
This staged approach and the record-keeping for deletions is implemented in a generic fashion and can be hooked up with specific code that performs deletions for virtual machines, physical machines, or any other resource that we need to implement specific deletion workflows for.
The separation between soft and hard stages serves the purpose to allow automatic reverts: as long as a resource has not reached a “hard” stage, our inventory will allow cancelling the deletion and will perform a resource-specific operation to restore it to a usable state.
Inventory and API
Our inventory allows our users and administrators again to delete VMs and monitor deletion progress:
- Marking a VM to be deleted on a certain day.
- Showing that a VM will be deleted on a certain day and which stages have been reached.
- Showing a list of system-wide deletions, their schedule and status.
- Cancelling a deletion (if possible).
The API that our inventory system needs to provide for the various parts are simple:
- Provide a type-specific list of deletions together with their schedule and status.
Based on this simple mechanical framework, we implemented the following stages for deleting VMs:
5 days before the deletion, the inventory system creates a maintenance window for the due date (with an insanely long period). This in turn causes the customer to be informed by email.
On the day of the deletion (typically a few minutes past midnight), we:
- Shutdown the Qemu process.
- Stop the accounting for this VM.
- Delete the stored puppet configuration.
At this point the inventory still has the master record and the VM is visible in the customer UI as “offline” and having reached the “soft” deletion stage. The “cancel” button is still visible and pressing it will cause the VM to come back online within a few minutes.
If nobody complained within 3 days of having the VM shut down, we perform more destructive actions:
- Delete the inventory master record, which in turn:
- stops listing the VM in any APIs,
- allows IPs and other numerical values associated with the VM to be recycled, and
- removes the VM’s IPs from DHCPs, firewalls, and name servers.
- Remove all Nagios configuration.
- Remove backup configuration.
- Destroy puppet node information and certificates.
- Delete the persistent Qemu config from hosts.
At this point the VM becomes invisible for the customer and our administrators and only the deletion record itself remains. The VM can not be simply turned on again as various things are now “out of order”:
- The ID of the VM may have been reassigned.
- Puppet won’t recognize the client automatically.
- IPs may have been reassigned.
However, with some manual labour this could still be remedied: the VM name won’t be reassigned as this is protected by the existence of the deletion record. We still have the disk image live and would have to be careful to fix the IP addresses and purge the puppet certificates before starting it again. Not impossible but also not automated (on purpose to keep things simple).
After 8 days of the due date (or 5 days after the hard stage) we reach a point of “no return” by deleting all persistent data related to the VM:
- Delete the Ceph image (and all snapshots)
- Delete all backups
- Delete historic monitoring data
We pondered whether to delete the Ceph image and the backups at the same time or not, but then decided to keep it it simple. When customers want the backups of a VM to still be around, we need them to pay us for it. In turn, if they ask us to delete their data, we want to be true to them about having deleting things reliably in a reasonable time frame. With this setup you can reach all your goals: keep 3 months of backup (just keep the VM), get a grace period to notice accidents, and be sure that your data has been deleted after you terminated a VM.
The last phase happens 38 days after the deletion date is a very simple one:
- remove the deletion record itself
This period is intended to avoid deleting a VM and then creating a new one (with the same name but different purpose) and then get confused whether the old one was already deleted or not, or what …
We do not have to keep this record for the sake of accountability as our inventory keeps track of all changes in a separate log.
After having screwed up big time last year, we took our time to find an adequate solution to this hard problem. We found that tracking deletions as a primary inventory item and using a stage-based workflow system raises our confidence in implementing a much more reliable automatic deletion.
Being aware that “there are no silver bullets”, we recognise the inherent risks and theoretical limits to find perfect solutions for this issue. Then again, operations isn’t about not failing, but about learning, increasing your mean time between failure, and reducing your mean time to restore.
Credits: Opening image by kevin dooley, CC-BY