Unimposing, less-than-fashionable, often hacked together without passion—yet, these little periodic data import jobs are still ubiquitous in any sizable datacenter. They often provide the glue that make data flow from one system to another. If they break, important stuff may get stuck. It’s time to pay them the attention they deserve.
The example we will follow throughout this article is Redmine‘s mail import mechanism. The fine documentation suggests to set up a cron job like this:
*/2 * * * * mail_import
This is the script referenced from the crontab entry in its simplest incarnation:
#!/bin/bash rake redmine:email:receive_imap RAILS_ENV=production \ host=imap.example.org firstname.lastname@example.org password=secret
Honestly, what could go wrong with something as simple as that? Unfortunately, a lot. Some time ago, a mail import job that feeds mails into the Flying Circus support tracker got stuck. We did not notice it for nearly a day. Instead we were wondering that no one of our customers raised support issues. Everyone ultimately happy? Everyone on leave? Not at all. The support mails simply did not come through. Good luck that we figured it out before any of our customers got really angry. But it should not happen again! Not much is required to make such a job production-grade. In the following, I will discuss three steps: state-based monitoring, thorough logging, and overrun protection.
Step 1: State-based monitoring
Sure, rake is supposed to emit some sort of error message to stderr in case of failure. Cron usually mails stdout/stderr from cron jobs to an admin address. But let us face it: who reads all cron mails? Are we even sure that cron mails are delivered at all? A better alternative to one-off error messages is a steady state that turns green if the job has succeeded lately and turns red if something got wrong. I recommend touching a time stamp file if the import job finishes without error. Given that the job is supposed to run every 2 minutes and including a bit of error margin, we could monitor that the time stamp file is no older than 4 minutes. File recentness checks are included in any decent monitoring solution. With that in mind, we can improve the script so that it looks like this:
#!/bin/bash set -e rake redmine:email:receive_imap RAILS_ENV=production \ host=imap.example.org email@example.com password=secret touch .stamp-mail-import
The -e flag instructs the shell to terminate immediately if any command exits unsuccessfully. The time stamp file will thus not be touched if rake fails. We now have to set up a monitoring check which will signal a problem if the stamp file is too old.
Step 2: Thorough logging
What if the monitoring check goes red? It’s debugging time. Hopefully the rake job leaves a detailed error message in its log file. But this is not enough. What, for instance, if someone messes around with our Ruby installation so that rake fails to start right away? I prefer to capture the script’s output on the outermost level to syslog. The rewritten crontab could look like this:
SHELL=/bin/bash */2 * * * * mail_import |& logger -t redmine/mail_import
Now we can find anything written to stdout/stderr nicely timestamped in /var/log/messages. Notice the changed shell setting: The standard POSIX shell /bin/sh often does not understand now to redirect stdout and stderr together in a pipe (“|&“).
Step 3: Overrun protection
Now we have a reliable indication if the job is running successfully, and we have means to figure out what went wrong. But we want the import job to recover on its own from minor irregularities like overloaded mail servers or short-term loss of network connectivity. To achieve this, I prefer to apply both timeouts and process interlocking. The timeout is, similar to logging, enforced best on the outermost level. So we rewrite the crontab entry once again:
SHELL=/bin/bash */2 * * * * timeout 15m mail_import |& logger -t redmine/mail_import
The maximum run time should not be too short. Otherwise, the import job could fail needlessly under high load or when the network is flaky. The main purpose of applying a timeout is to ensure that a half-done import job does not hang around indefinitely. Such a generous timeout means that 7 jobs could be running in parallel in the worst case. If the job is not re-entrant, we don’t want to risk overruns. Recently, overruns like the ones discussed here caused one support mail to receive two different tracking numbers at Flying Circus support. It was no fun to end up with one half of the conversation in one ticket and the other half in another. We should better add interlocking for increased peace of mind. With periodic jobs, I prefer non-blocking interlocking:
#!/bin/bash set -e exec 9>> .lock-mail-import flock -n 9 rake redmine:email:receive_imap RAILS_ENV=production \ host=imap.example.org firstname.lastname@example.org password=secret touch .stamp-mail-import
The exec statement opens file descriptor 9 (the number has been arbitrarily picked) and keeps it open until script execution terminates. In the following line, flock tries to obtain an exclusive lock on that file descriptor, or exits with an error code immediately. This way, we make sure that there is always at most one instance running. If one invocation is still running while the next is due, the second invocation aborts. The timeout ensures that a long-running instance will not block subsequent jobs forever.
Not much is needed to make periodic data import jobs out of the “pray that it won’t break” zone. The 3 steps “state-based montoring” – “thorough logging” – “overrun protection” are usually enough to raise reliability to production level.