Internal and External Interfaces From an Operations Perspective (long read)

We recently pitched for a project that explicitly asked for a team with experience dealing with a system that has many interfaces. I sat down and wrote an overview of our experience and what we learned about internal and external interfaces over time – and I’d like to share this in today’s post.

Individual applications (be it a business application, a CMS or else) are often assembled from many components (which would be internal interfaces) and other services (which would be external interfaces). For the sake of this article, I’m mostly considering inter-process interfaces, so typically where processes speak HTTP or any other IP application-level protocol.

Developers spend a good amount of time to choose their components wisely and companies carefully choose the services they connect to based on their needs. (Well. Mostly.) There are many things to watch out for when building this groundwork, but the work doesn’t stop there to ensure that things keep functioning over the (potential long, and usually longer than expected) lifetime of your project.

Issues with interfaces can be complex and sometimes only be detected, understood and dealt with appropriately over time. After all – your system is distributed and that is a hard problem to have. Issues with interfaces will cause issues in your system that affect its performance and stability which in turn can result in outages or even challenge the usefulness and acceptance of your application.

Analysis and Conceptual Phase

One thing we learned in our more than 15 years of experience here is that it’s wise to start at the beginning. Even if we just inherit a system in maintenance mode, we like to start talking to the development team as early as possible. At this point its useful to create a (however small) inventory of internal components and external systems and visualize their relations. Ask questions like how network issues (speed, outages, packet loss) will affect both sides of the interface. Ask what happens to the overall system when a specific component crashes. What happens when you have to quickly reboot all servers in no specific order? Also, what happens when traffic spikes?

As the complexity of all applications continues to rise it’s impossible to enumerate all potential issues and foresee all interactions. However, getting an overview of the complexity itself and a feeling for components that may require special attention (because they might be critical from a business point or because they could build a nexus that concentrates many interfaces at one point) will allow you to focus your energy and choose which risks to mitigate now and which to accept as a risk to deal with later.

Implementation

When starting your implementation and while programming it makes sense for us to stay available for developers to answer questions that they may have a bad feeling about. As no single person will have seen every component in various states of “being on fire” we like to invite developers to ask questions about operational issues early on in a project. This sometimes can lead to revisiting architectural choices or just simple advice like “make sure you add a timeout to your sockets over there” or about what data would be useful to log in certain parts of the application.

Internal vs. External

Why is it important for us to differentiate between internal and external interfaces? Because it influences how we can prepare for problems and how much we can influence the architecture of the system over time to make it more robust (or: antifragile).

With internal interfaces we control both endpoints of the connection. If something goes wrong we can look at all involved components and the infrastructure and decide how to fix it (and what to do to avoid it in the future). Those internal interface are typically between your webserver and caches, or your application server and your database. Here we also have the most room for preparing properly. The more control you want or need the more it makes sense to run a part internally.

External interfaces however limit our influence to one part of the connection. They second part is run by a third party, typically in a different data center. Those interfaces are typically social media networks but can also be external mail servers, authentication, databases from partner companies, or anything else, really.

The issue with external interface is that many times you’ll be working with no specified service level and will have to deal with unknown availability or performance. One of the best strategies we found is to try to decouple those functions in your application in strategic places in your code so that you can provide operational controls (like setting timeouts or even just disabling a certain interface) in a case of emergency. Additionally it also makes sense to have good and direct personal contact with the operators of the other interface so that they can be influenced to make reasonable changes on their side if the need should arise.

Additionally, external interfaces may change or be retired in time frames that might be out of your control or even unknown to you. So staying in touch with the provider makes double sense. You may gain some influence and knowledge that will prepare you to either adapt to a change on the other side yourself or get a little more runway if you’re in a pinch.

And last but not least, external interfaces can even mean that we control no part of connection, for example, if you embed Twitter through JavaScript then one endpoint is with your user’s browser and the other with Twitter. (This has benefits and drawbacks and isn’t a bad idea or a good idea per se.)

Service Level

We also like to look at interfaces – especially internal interfaces – to think about required service levels. With complex applications it makes sense to look at states of your system/application that may experience reduced availability without waking someone up in the middle of the night whereas other may be critical. Here the overview of the interfaces comes into play again to walk through and discuss whether a specific component needs a certain reliability and whether it’s fine when it’s offline for a while.

Monitoring

Similarly to the service level discussion, an overview of your interfaces also gives you good input for what to monitor. External interfaces also make sense to monitor so you can quickly spot when your application isn’t behaving right and you see that some external service isn’t reachable. Make sure you monitor that service from the location (or even the machines) that your application is living in. Also, your application might want to log about connections it establishes to internal and external interfaces, specifically errors, but also to see whether any connections are still open (and may be piling up).

Anecdotes

Some of the things we learned may seem obvious or even dry. However, the way we got there and what we remember when we talk about those, is quite interesting and maybe entertaining (in hindsight more than in the situation itself).

1. Accessing Twitter in “Realtime”

A customer’s web CMS showed massively spiking response times that resulted in outages. The analysis showed that requests to the main page and specific sub pages took extremely long. (haproxy’s logging is very valuable for this.) The application servers showed no activity on disk, CPU or network. Using a custom tracing utility we saw that those requests always got stuck when talking to the Twitter API. At this moment Twitter’s API was having issues, which in turn caused the customer’s application to also get stuck. Unfortunately Twitter wasn’t able to block the requests it couldn’t process but let them sit stuck. We quickly provided a fix by adding a firewall rule that blocked access to the Twitter API for the application. Fortunately the developers already built their code in a way that the site then continued to work without showing the Twitter integration. We advised the developers to guard their network code with reasonable socket timeouts and thus future Twitter outages did not cause outages in the local application any longer.

2. LDAP access to outside Vendors

A customer runs an intranet whose authentication is done with an LDAP server run as an internal interface. However, the customer allows developers to access this LDAP server for other applications that may run somewhere else. In this case we made it a point to be available for developers and provide them with a staging instance and also usually brief them about encryption aspects.

3. Operating massive mailing lists and mail servers

A customer runs an application with more than 30.000 users with a wide range of mailing lists. As email is a protocol that can break in many places we typically encounter situations where we have to talk to corporate IT departments all over the world. We still have to explain why certain anti-spam measures are a good idea and our whitelist for servers that don’t behave well has quite grown over time. Mail is typically quite challenging as there are departments out there with rules about email that have been carved in stone 20 years ago. It helps if you can be flexible on your side without sacrificing your own security. If you do mail (even if you don’t do SMTP directly but hand it over to a third party) – be prepared to spend time shaking your fist at people.

4. Scaling your database

Imagine a business application with an internal database. As the application’s popularity grew the response times for users grew as well. It appeared that the database was spending too much time on disk IO and we decided to increase the system’s memory. (This has been some years ago and I think we increased the machine’s RAM from 8 GiB to around 24 GiB or so.) However, unexpectedly the application’s performance did not just not improve it actually became atrociously worse.  We knew we got a good indicator that something was absolutely wrong when this happened and spend multiple hours on diagnostics. The database in question was a PostgreSQL database and we found that the application was opening a new connection for every query. Not just for every transaction (which would still have been bad) but for every query – so this also effectively eliminated all transaction mechanisms. This means every page load caused about 40-50 new database connections. However, PostgreSQL explicitly expects you to reuse and pool your connections (and use transactions properly). The manual even states that opening a new connection is not intended to be fast.

However, we were still stumped that the performance actually dropped when adding more RAM. In our case, the amount of available RAM in a machine is fed into our automation system which includes rules to configure PostgreSQL with matching settings for the available memory. When we increased the RAM we also implicitly increased those values. That was known and intended – after all we added the memory to be used. But: when PostgreSQL receives a new connection on the master process it forks a new process to handle this connection. When that happens, the fork will instruct Linux to create a copy of the page table associated with the process. Now, as the process was using more memory than before, the time needed to establish a new connection increased substantially. With initially a few dozen milliseconds (I think around 15 or so) we would now spend about 50-60ms when forking the process. With 40-50 queries per page the time spent establishing connections when rendering a page went from previously 750ms (which was already slow) to 2.500ms. And that doesn’t include the actual time spend working on queries and rendering the page.

Once we discovered what happened we fixed the issue temporarily by reducing the configured memory in PostgreSQL again (which reverted performance to the slightly better state). After that we helped the developer team identify those functions in their application that would cause the largest number of queries and quickly reduce the number of connections in a few hours. With the application substantially improved we both saw much better response times and could also leverage the additional memory for further improvements.

5. Batch importing from an external database

Another customer is running an information portal. The portal is an export of the customer’s main product and run with a simplified (static) MySQL database. The database was updated every morning with a simple batch job using a CSV import and some post-processing. At some point the customer changed the information portal code and decided to import a substantially larger amount of data than before using the established mechanism.

The import ran in a two-step process to reduce visible outages while records were deleted, updated, or added. (Proper transaction management was impossible on MySQL in this case as the performance would have been abysmal.) The first step was to create a new table in the database and use MySQL’s native CSV import. This worked fine and was fast. After that a complex query would synchronize the existing “live” table and the temporary “import” table by injecting new records, updating old records, and deleting deleted records.

With the new amount of data we noticed two things: a) the application would become unresponsive during the import for around 20 minutes and b) the machine started to consume thousands of IOPS for that time. The unresponsiveness was due to MySQL locking records and thus blocking further (read-only) queries while the import was running. The IOPS was due to the row-level granularity in which MySQL was processing inserts and deletes.

We needed a quick fix and noticed that an earlier assumption was broken: instead of directly importing the CSV we had this huge query going on to reduce the downtime. But the huge query now became the cause of the downtime. We decided to simplify the import step: import the CSV into a temporary table and then drop the live table and turn the temporary table into the live table by renaming it. Yes, this now has a period of time where the table does not exist and queries will fail. However, queries were already failing for a period of about 20 minutes anyway. The new approach limited the downtime to less then a second.

Learning about this, the customer then also decided to go forward with a much improved version of its import strategy and built a proper API that would stream changes to the database throughout the day without any required downtimes.

Thanks for listening!

If you took the time to read this far: thanks for your time. I do not offer a specific grand unifying theory about software interfaces. Maybe you found interesting or valuable bits to take away for your own applications. Let us know in the comments if you did – and we’re also happy to hear your stories on software interfaces!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s