HAProxy, high mysql request rate and TCP source port exhaustion

Synopsys


At HAProxy Technologies, we do provide professional services around HAPRoxy: this includes HAProxy itself, of course, but as well the underlying OS tuning, advice and recommendation about the architecture and sometimes we can also help customers troubleshooting application layer issues.
We don’t fix issues for the customer, but using information provided by HAProxy, we are able to reduce the investigation area, saving customer’s time and money.
The story I’m relating today is issued of one of this PS.

One of our customer is an hosting company which hosts some very busy PHP / MySQL websites. They used successfully HAProxy in front of their application servers.
They used to have a single MySQL server which was some kind of SPOF and which had to handle several thousands requests per seconds.
Sometimes, they had issues with this DB: it was like the clients (hence the Web servers) can’t hangs when using the DB.

So they decided to use MySQL replication and build an active/passive cluster. They also decided to split reads (SELECT queries) and writes (DELETE, INSERT, UPDATE queries) at the application level.
Then they were able to move the MySQL servers behind HAProxy.

Enough for the introduction 🙂 Today’s article will discuss about HAProxy and MySQL at high request rate, and an error some of you may already have encountered: TCP source port exhaustion (the famous high number of sockets in TIME_WAIT).

Diagram


So basically, we have here a standard web platform which involves HAProxy to load-balance MySQL:
haproxy_mysql_replication

The MySQL Master server is used to send WRITE requests and the READ request are “weighted-ly” load-balanced (the slaves have a higher weight than the master) against all the MySQL servers.

MySql scalability

One way of scaling MySQL, is to use the replication method: one MySQL server is designed as master and must manages all the write operations (DELETE, INSERT, UPDATE, etc…). for each operation, it notifies all the MySQL slave servers. We can use slaves for reading only, offloading these types of requests from the master.
IMPORTANT NOTE: The replication method allows scalability of the read part, so if your application require much more writes, then this is not the method for you.

Of course, one MySQL slave server can be designed as master when the master fails! This also ensure MySQL high-availability.

So, where is the problem ???

This type of platform works very well for the majority of websites. The problem occurs when you start having a high rate of requests. By high, I mean several thousands per second.

TCP source port exhaustion

HAProxy works as a reverse-proxy and so uses its own IP address to get connected to the server.
Any system has around 64K TCP source ports available to get connected to a remote IP:port. Once a combination of “source IP:port => dst IP:port” is in use, it can’t be re-used.
First lesson: you can’t have more than 64K opened connections from a HAProxy box to a single remote IP:port couple. I think only people load-balancing MS Exchange RPC services or sharepoint with NTLM may one day reach this limit…
(well, it is possible to workaround this limit using some hacks we’ll explain later in this article)

Why does TCP port exhaustion occur with MySQL clients???


As I said, the MySQL request rate was a few thousands per second, so we never ever reach this limit of 64K simultaneous opened connections to the remote service…
What’s up then???
Well, there is an issue with MySQL client library: when a client sends its “QUIT” sequence, it performs a few internal operations before immediately shutting down the TCP connection, without waiting for the server to do it. A basic tcpdump will show it to you easily.
Note that you won’t be able to reproduce this issue on a loopback interface, because the server answers fast enough… You must use a LAN connection and 2 different servers.

Basically, here is the sequence currently performed by a MySQL client:

Mysql Client ==> "QUIT" sequence ==> Mysql Server
Mysql Client ==>       FIN       ==> MySQL Server
Mysql Client <==     FIN ACK     <== MySQL Server
Mysql Client ==>       ACK       ==> MySQL Server

Which leads the client connection to remain unavailable for twice the MSL (Maximum Segment Life) time, which means 2 minutes.
Note: this type of close has no negative impact when the connection is made over a UNIX socket.

Explication of the issue (much better that I could explain it myself):
“There is no way for the person who sent the first FIN to get an ACK back for that last ACK. You might want to reread that now. The person that initially closed the connection enters the TIME_WAIT state; in case the other person didn’t really get the ACK and thinks the connection is still open. Typically, this lasts one to two minutes.” (Source)

Since the source port is unavailable for the system for 2 minutes, this means that over 534 MySQL requests per seconds you’re in danger of TCP source port exhaustion: 64000 (available ports) / 120 (number of seconds in 2 minutes) = 533.333.
This TCP port exhaustion appears on the MySQL client server itself, but as well on the HAProxy box because it forwards the client traffic to the server… And since we have many web servers, it happens much faster on the HAProxy box !!!!

Remember: at spike traffic, my customer had a few thousands requests/s….

How to avoid TCP source port exhaustion?


Here comes THE question!!!!
First, a “clean” sequence should be:

Mysql Client ==> "QUIT" sequence ==> Mysql Server
Mysql Client <==       FIN       <== MySQL Server
Mysql Client ==>     FIN ACK     ==> MySQL Server
Mysql Client <==       ACK       <== MySQL Server

Actually, this sequence happens when both MySQL client and server are hosted on the same box and uses the loopback interface, that’s why I said sooner that if you want to reproduce the issue you must add “latency” between the client and the server and so use 2 boxes over the LAN.
So, until MySQL rewrite the code to follow the sequence above, there won’t be any improvement here!!!!

Increasing source port range


By default, on a Linux box, you have around 28K source ports available (for a single destination IP:port):

$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768    61000

In order to get 64K source ports, just run:

$ sudo sysctl net.ipv4.ip_local_port_range="1025 65000"

And don’t forget to update your /etc/sysctl.conf file!!!

Note: this should definitively be applied also on the web servers….

Allow usage of source port in TIME_WAIT


A few sysctls can be used to tell the kernel to reuse faster the connection in TIME_WAIT:

net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_recycle

tw_reuse can be used safely, be but careful with tw_recycle… It could have side effects. Same people behind a NAT might be able to get connected on the same device. So only use if your HAProxy is fully dedicated to your MySql setup.

anyway, these sysctls were already properly setup (value = 1) on both HAProxy and web servers.

Note: this should definitively be applied also on the web servers….
Note 2: tw_reuse should definitively be applied also on the web servers….

Using multiple IPs to get connected to a single server


In HAProxy configuration, you can precise on the server line the source IP address to use to get connected to a server, so just add more server lines with different IPs.
In the example below, the IPs 10.0.0.100 and 10.0.0.101 are configured on the HAProxy box:

[...]
  server mysql1     10.0.0.1:3306 check source 10.0.0.100
  server mysql1_bis 10.0.0.1:3306 check source 10.0.0.101
[...]

This allows us to open up to 128K source TCP port…
The kernel is responsible to affect a new TCP port when HAProxy requests it. Dispite improving things a bit, we still reach some source port exhaustion… We could not get over 80K connections in TIME_WAIT with 4 source IPs…

Let HAProxy manage TCP source ports


You can let HAProxy decides which source port to use when opening a new TCP connection, on behalf of the kernel. To address this topic, HAProxy has built-in functions which make it more efficient than a regular kernel.

Let’s update the configuration above:

[...]
  server mysql1     10.0.0.1:3306 check source 10.0.0.100:1025-65000
  server mysql1_bis 10.0.0.1:3306 check source 10.0.0.101:1025-65000
[...]

We managed to get 170K+ connections in TIME_WAIT with 4 source IPs… and not source port exhaustion anymore !!!!

Use a memcache


Fortunately, the devs from this customer are skilled and write flexible code 🙂
So they managed to move some requests from the MySQL DB to a memcache, opening much less connections.

Use MySQL persistant connections


This could prevent fine Load-Balancing on the Read-Only farm, but it would be very efficient on the MySQL master server.

Conclusion

  • If you see some SOCKERR information messages in HAProxy logs (mainly on health check), you may be running out of TCP source ports.
  • Have skilled developers who write flexible code, where moving from a DB to an other is made easy
  • This kind of issue can happen only with protocols or applications which make the client closing the connection first
  • This issue can’t happen on HAProxy in HTTP mode, since it let the server closes the connection before sending a TCP RST

Links

5 thoughts on “HAProxy, high mysql request rate and TCP source port exhaustion”

  1. Since ha proxy 1.4.19 it’s possible to send RST on tcp backend too using option nolinger. It works fine…. but Aborted_clients will increase of course !

  2. Hi Baptiste,
    Thank you for this article. TIME_WAIT sockets for MySQL connection is something that has been bugging me (actually my customers) for years.
    My analysis of this problem leads to the exact opposite conclusion: if you let the server side close the connection, the _server_ will end up with all the TIME_WAIT sockets. Put a few hundred client applications in the game and you will rapidly exhaust all server available ports, eventually bringing down your MySQL box.
    All popular applications and protocols (ftp, ssh, telnet,…) work the same way: the client initiates the disconnection and gets the TIME_WAIT side of the socket. HAProxy acts as a “man in the middle” both as a client and as a server if you will. So at some point, it will get these TIME_WAIT state, no matter what.
    Gilles.

    PS: I like this solution of multiple source IP
    PPS: I’d be very interested in getting your feedback on this. It’s a very interesting issue, whatever we put behind “interesting” 🙂

    1. Hi Gilles,

      You’re wrong on a point, when you say that “the server will exhaust all available ports”.
      I mean, this is wrong from a TIME_WAIT point of view.

      Let me copy/paste here some explanations about this issue made by Willy (Maintainer of HAProxy and networking expert):
      =================8<==================
      The purpose of the TIME_WAIT state is to ensure that if the last ACK
      is lost and the peer has to retransmit its FIN, this FIN will not be
      confused with one from a future session. Thus it prevents the 5-tuple
      (proto,src-ip,src-port,dst-ip,dst-port) from being reused during the
      TIME_WAIT delay. The official delay is 2 MSL (240 seconds) but nobody
      follows that now, except Solaris in the default install. Timers are
      more commonly 60-120 seconds. In practice, it's easier to rememeber
      that the TIME_WAIT state is on the side of the first one doing the
      close() or shutdown(SHUT_WR). If both sides close simultaneously, it
      is possible to have a TIME_WAIT on both sides.

      If a client in TIME_WAIT state would reuse a source port early, before
      the server would get the ACK, what could happen is that the retransmitted
      FIN from the server would be mistaken as one for the new connection.
      TCP sequence numbers ensure this has a low probability of appearing,
      but in practice it does happen a lot, especially due to large windows.
      Most of the time a late packet ends up with a RST because it's not in the
      window, but if it falls into the window, you often see an ACK storm between
      the two sides who disagree on the exact sequence+ack numbers. That's why
      the TIME_WAIT delay must not be reduced too much. Our observations are
      that anything below 25 seconds can reliably trigger ACK storms.

      On the server side, there is a specificity : when the connection
      is in TIME_WAIT, the server knows it can reliably kill an old session
      because it's not the initiator, it receives a connection from a client which
      decided it could reuse the 5-tuple. For the server this is a proof
      that the client will not resend an old FIN, so the server accepts to kill
      the TIME_WAIT session and create a new one. This is only true if the
      sequence number of the new SYN packet is higher than the end of the previous
      window though, which explains why some rare misdesigned firewalls who
      randomize sequence numbers for "security" are often preventing connections from
      correctly establishing.

      Developers who don't know this tend to prefer to put the TIME_WAIT on
      the client side than on the server side because they prefer to see a
      million clients with one TIME_WAIT socket than one server with a million
      TIME_WAIT sockets. But this is the mistake. TIME_WAIT do not carry any data
      anymore, and they're extremely cheap, around 56-84 bytes on Linux depending on
      whether you're in IPv4 or IPv6. And validating them is very cheap as
      well so the cost is almost null. My record was 5.5 million TIME_WAIT
      sockets in a benchmark, with no measurable performance impact.

      But once you realize that client-side source ports are scarce, that
      changes the view. A typical linux client uses source ports 32768-61000,
      that's 28232 ports. With a 60 second TIME_WAIT, that's at most 470 connections per
      second.
      This is very low. By moving the TIME_WAIT to the server side, there's
      no more such limit, and the connection rate can easily achieve 100000/sec if
      needed.
      =================8<==================

      So 2 options:
      – you let the server close the connection first
      – the client closes the connection with an RST instead of a FIN

      HAProxy does the second solution on HTTP protocol and we can easily reach 100k conn/s per haproxy without hitting any resource exhaustion on HAProxy neither on the server behind.

      Baptiste

  3. Imagine We have 2 backend server,can I use “source” Directive for Using multiple IPs to get connected to a 2 backend server?
    for Example :

    server myapp-A 10.10.10.11:9999 check source 10.10.10.1
    server myapp-B 10.10.10.12:9999 check source 10.10.10.2

    Is this method works?

  4. Hi,

    question 1: why not to connect the web server and mysql directly? then there should be only 2 connetion pools…
    question 2: what if the concurrent users count beyond 65535? the connections between the haproxy and the web server are easily beyond 65535, in other word, each haproxy can only support 65535 conrurrent users online??

Leave a Reply

Your email address will not be published. Required fields are marked *