Debugging web application performance with HAProxy – Part 2

Welcome back. In part one, we discussed the basics of reading haproxy logs – today we’ll look further.

My site is slow!

So you notice that your application is not as fast as you’re used to – but why?

The most common reason is, obviously, that your application itself takes a long time to respond to requests, but not always. To find out whether this is true (or whether something else is going on) haproxy has “timers”.

HAProxy timers

The example log excpert from part 1 showed the following sequence of numbers:  10/0/30/69/109

Feb  6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} "GET /index.html HTTP/1.1"

These are the “timers” and they’re given in milliseconds (ms). One by one, here’s what they are called and what they do:

  • Tq (10): How long did it take to receive the HTTP request. This does not include the data of a POST or PUT.
  • Tw (0): How long did the request wait in queues? This is the sum of the times from the backend queue (‘static’) and the server queue (‘srv1’).
  • Tc (30): How long did it take HAProxy to connect to the backend server on a TCP/IP level?
  • Tr (69): How long did it take the backend server to send us the HTTP response headers. (This does not include sending the response data.)
  • Tt (109): How long did the request take in total – from accepting the TCP connection to closing it?

High Tr means a problem on the side of the application

When you see high values for Tr that means the problem is (very likely) located in your application. For further investigation you need to leave HAproxy and have a look at the application server. Typical issues that increase answer time from your application are:

  • The CPU is too busy – you’ll need to add more CPU power (ideally through more servers).
    Ideally, under this condition, you limit the number of concurrent requests per application server to 1 (via maxconn) and follow the rule “1 request = 1 thread = 1 CPU core” as you want haproxy to pass requests to an application process whose CPU actually has time to work on it as fast as possible. If you have a mixed CPU/IO load you can slowly increase your threads and requests until you hit saturation.
  • If the CPU is idle, then you might be I/O bound – check tools like “dstat” or “iostat” and “top” to see whether processes are stuck in “iowait”. Increasing I/O throughput isn’t always possible and you may need to figure out smart ways to change either your applications behaviour, consider caching, or upgrade your resources.
  • If you see neither a busy CPU nor a saturated IO system then you might be calling  services that are too slow. This could be an internal service like your database (in which case, repeat the debugging procedure there) or an external service (like the Twitter API).
    If your application depends on external API then you need to consider implementing a small socket timeout for those connections (typically less than 5 seconds). Defaults may be “inifite” timeouts which will cause your requests to hang infinitely.

If your application responds slowly, you may also see secondary effects like your queue counters going up. Let’s look at these.

Queue counters

In our example log entry that is the 1/1/1/1/0. It is a snapshot of the queue state at the time of logging. In order of appearance:

  • actconn: Connections to and from the HAProxy process.
  • feconn: Connections to the HAProxy frontend, i.e. connections made to the HAProxy process.
  • beconn: Connections which are handled by the backend the logged request used.  This includes active (currently being processed) and queued (waiting for a server) connections.
  • srv_conn: Connections which are active at the server the logged request used.
  • retries: Number of retries. This should be 0. Requests get re-dispatched if a server goes down after the connection is attempted.

Escalating unresponsiveness

If your application is slow, and your service becomes more erratic over time then the following might be happening:

  • haproxy passes requests to your application
  • the application takes a long time to process the request
  • haproxy times out waiting for the request and either drops it or re-dispatches it (depending on the application TCP accept behaviour)
  • the server is marked as “free” and haproxy passes another request to your application

This can mean that a worker thread (or process) of your application is still active and haproxy keeps adding requests to it. This happens because haproxy doesn’t know what your process is actually doing – it assumes that it’s the only one passing work to your application and it makes the simplification to assume that when it closes the connection it can use the server (as long as the health check is fine).

So even if you set “maxconn” to a specific number, then your application might be dealing with many more requests and thus result in escalating response times. Also note that haproxy does not include health checks in the maxconn setting and that you may need to configure multiple threads in your application to satisfy out-of-band requests like monitoring.

Servers are fine, what else might it be?

Well, your site might also just be experiencing a lot more requests than usual. Here are some other common scenarios:

  • If your application servers are responding fine (low Tr) and requests are still waiting (high beconn),  your  problem might be that your have more visitors than usual or that your cache hit rate might not be what you’re used to.
  • If you see high srv_conn values for only one server, this one server has some problem that doesn’t affect the application structurally. Might be an infrastructure issue with your provider?
  • If you see only one server used all the time while the others are idle, review your configuration. Is the frontend/backend configuration correct? Also use the leastconn balance algorithm.
  • If you see erratic Tr figures combined with a high beconn queue this could indicate that distribution of requests to the servers can be improved. Also, separating your servers into different pools based on your request characteristics (IO, CPU, cache affinity) often helps wonders.

Improving your configuration

In the next part we’ll discuss some specific configurations to improve the performance of your web application.

If you would like to stay up to date on what’s happening at Flying Circus subscribe to our newsletter.

 

2 thoughts on “Debugging web application performance with HAProxy – Part 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s