Last weekend I attended the first Python Unconference Hamburg – a great event organized by a great team.
A session I chaired was called #failexchange and provided a small forum for people to discuss any kinds of failures they ran into. I like these formats because they’re both entertaining and show that everyone – developers, operations, users – are feeling the same pain: everyone is having similar issues and nothing is perfect.
Here’s a short list of things that I used to start the session with:
- chrooting your qemu process is great – if you’d like to get reminded why copying ‘resolv.conf’ during a Gentoo installation is an important step. Otherwise you’ll be surprised that everything works except that you get DNS resolution errors when trying to live-migrate the machine somewhere else.
- bacula is an enterprise-level backup utility. However, if you’re running your director daemon with the wrong settings and discover that one of your backups is faulty and are in the middle of a large-scale disaster recovery, you’re in a bad spot. Then you’ll notice that you can’t easily find settings to get that faulty backup restored without interrupting your whole recovery process because the director becomes an important bottleneck.
- The Linux IPv6 stack has some rough edges. If you’re using “ucarp” for router failover and are writing your up/down scripts to manage IPv6 addresses then you’ll notice that there isn’t something simple like “arping” for IPv6 around and your nodes will easily cache the old address for minutes or even longer. Additionally, it’s easy to get a race condition and have the IPv6 stack decide that a duplicate address event happened. Figuring that out might be hard if you didn’t notice the little “dad” bit in the middle of “ip -6 a” output.
- Middleboxes like fastly are helping services scale. They also help services fail in interesting ways. We keep experiencing issues when a fastly partner does intermediate faulty routing causing github, bitbucket, and PyPI to appear flaky in our data center. Good to know that fastly provides immediate support via irc for those events. Also, having statuspage provide third party information helps customers to figure out whether a problem they experience is with us or on a different level.
- If the internet goes down miraculously in various parts of the world then we might just experience a “640k is enough for everyone” moment.
- Does anyone remember 8.3 filenames? Well, depending on the way you’re looking at unix process names you get reminded of that era.
- Similar to that, quoting on the filesystem is really important. Otherwise a perl-based part of your toolchain might decide to choke on “*.mydomain.com_accesslog” when customers start using nginx wildcard servernames.
- And a last one actually about Python: MemoryError is a StandardError – which means that even though it indicates a situation where your application isn’t going to be performing a lot of meaningful work any longer it typically doesn’t exit main loops of application servers or frameworks.
Thanks again to the organizers and the great community – it’s a pleasure to meet up with you every single time!