Benchmarking SSL performance

Introduction

The story

Recently, there has been some attacks against website which aimed to steal user identity. In order to protect their users, major website owners had to find a solution.
Unfortunately, we know that sometimes, improving security means downgrading performance.

SSL/TLS is a fashion way to improve data safety when data is exchanged over a network.
SSL/TLS encryption is used to crypt any kind of data, from the login/password on a personnal blog service to a company extranet passing through an e-commerce caddy.
Recent attack shown that to be protect users identity, all the traffic must be encrypted.

Note, SSL/TLS is not only used on Website, but can be used to crypt any TCP based protocol like POP, IMAP, SMTP, etc…

Why this benchmark?

At HAProxy Technologies, we build load-balancer appliances based on a Linux kernel, LVS (for layer 3/4 load-balancing), HAProxy (for layer 7 load-balancing) and stunnel (SSL encryption), for the main components.

  1. Since SSL/TLS is fashion, we wanted to help people ask the right questions and to do the right choice when they have to bench and choose SSL/TLS products.
  2. We wanted to explain to everybody how one can improve SSL/TLS performance by adding some functionality to SSL open source software.
  3. Lately, on HAProxy mailing list, Sebastien introduced us to stud, a very good, but still young, alternative to stunnel. So we were curious to bench it.

SSL/TLS introduction

The theory

SSL/TLS can be a bit complicated at first sight.
Our purpose here is not to describe exactly how it works, there are useful readings for that:

    SSL main lines

    Basically, there are two main phases in SSL/TLS:

    1. the handshake
    2. data exchange

    During the handshake, the client and the server will generate three keys which are unique for the client and the server, available during the session life only and will be used to crypt and uncrypt data on both sides by the symmetric algorithms.
    Later in the article, we will use the term “Symmetric key” for those keys.

    The symmetric key is never exchanged over the network. An ID, called SSL session ID, is associated to it.

    Let’s have a look at the diagram below, which shows a basic HTTPS connection, step by step:

    SSL_handshake

    We also represented on the diagram the factor which might have an impact on performance.

    1. Client sends to the server the Client Hello packet with some randon numbers, its supported ciphers and a SSL session ID in case of resuming SSL session
    2. Server chooses a cipher from the client cipher list and sends a Server Hello packet, including random number.
      It generates a new SSL session ID if resume is not possible or available.
    3. Server sends its public certificate to the client, the client validates it against CA certificates.
      ==> sometimes you may have warnings about self-signed certificates.
    4. Server sends a Server Hello Done packet to tell the client he has finished for now
    5. Client generates and sends the pre-master key to the server
    6. Client and server generate the symmetric key that will be used to crypt data
    7. Client and server tells each other the next packets will be sent encrypted
    8. Now, data is encrypted.

    SSL Performance

    As you can see on the diagram above, some factors may influence SSL/TLS performance on the server side:

    1. the server hardware, mainly the CPU
    2. the asymmetric key size
    3. the symmetric algorithm

    In this article, we’re going to study the influence of these 4 factors and observe the impact on the performance.

    A few other things might have an impact on performance:

    • the ability to resume a SSL/TLS session
    • symmetric key generation frequency
    • object size to crypt

    Benchmark platform

    We used the platform below to run our benchmark:

    SSL_benchmark_platform

    The SSL server has purposely much less capacity than the client in order to ensure the client won’t saturate before the server.

    The client is inject + stunnel on client mode.
    The web server behind HAProxy and the SSL offloader is httpterm

    Note: Some resuts were checked using httperf and curl-loader, and results were similar.

    On the server, we have 2 cores and since we have enabled hyper threading, we have 4 CPUs available from a kernel point of view.
    The e1000e driver of the server has been modified to be able to bind interrupts on the first logical CPU core 0.

    Last but not least, the SSL library used is Openssl 0.9.8.

    Benchmark purpose

    The purpose of this benchmark is to:

    • Compare the different way of working of stunnel (fork, pthread, ucontext, ucontext + session cache)
    • Compare the different way of working of stud (without and with session cache)
    • Compare stud and stunnel (without and with session cache)
    • Impact of session renegotiation frequency
    • Impact of asymmetric key size
    • Impact of object size
    • Impact of symmetric cypher

    At the end of the document, we’re going to give some conclusion as well as some advices.

    As a standard test, we’re going to use the following:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte

    For each test, we’re going to provide the transaction per second (TPS) and the handshake capacity, which are the two most important numbers you need to know when comparing SSL accelerator products.

    • Transactions per second: the client will always re-use the same SSL session ID
    • Symmetric key generation: the client will never re-use its SSL session ID, forcing the server to generate a new symmetric key for each request

    1. From the influence of symmetric key generation frequency

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 1 core

    Note that the object is void because we want to mesure pure SSL performance.

    We’re going to bench the following software:
    - STNL/FORK: stunnel-4.39 mode fork
    - STNL/PTHD: stunnel-4.39 mode pthread
    - STNL/UCTX: stunnel-4.39 mode ucontext
    - STUD/BUMP: stud github bumptech (github: 1846569)

    Symmetric key generation frequency STNL/FORK STNL/PTHD STNL/UCTX STUD/BUMP
    For each request 131 188 190 261
    Every 100 requests 131 487 490 261
    Never 131 495 496 261

    Observation:

    - We can clearly see that STNL/FORK and STUD/BUMP can’t resume a SSL/TLS session.
    - STUD/BUMP has better performance than STNL/* on symmetric key generation.

    2. From the advantage of caching SSL session

    For this test, we have developed patches for both stunnel and stud to improve a few things.
    The stunnel patches are applied on STNL/UCTX and include:
    - listen queue settable
    - performance regression due to logs fix
    - multiprocess start up management
    - session cache in shared memory

    The stud patches are applied on STUD/BUMP and include:
    - listen queue settable
    - session cache in shared memory
    - fix to allow session resume

    We’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 1 core

    Note that the patched version will be respectively called STNL/PATC and STUD/PATC in the rest of this document.
    The percentage highlights the improvement of STNL/PATC and STUD/PATC respectively over STNL/UCTX and STUD/BUMP.

    Symmetric key generation frequency STNL/PATC STUD/PATC
    For each request 246
    +29%
    261
    +0%
    Every 100 requests 1085
    +121%
    1366
    +423%
    Never 1129
    +127%
    1400
    +436%

    Observation:

    - obviously, caching SSL session improves the number of transaction per second
    - stunnel patches also improved stunnel performance

    3. From the influence of CPU cores

    As seen on the previous test, we could improve TLS capacity by adding a symmetric key cache to both stud and stunnel.
    We still might be able to improve things :).

    For this test, we’re going to configure both stunnel and stud to use 2 CPU cores.
    The kernel will be configured on core 0, userland on core 1 and stunnel or stud on cores 2 and 3, as shown below:
    cpu_affinity_ssl_2_cores

    For the rest of the tests, we’re going to bench only STNL/PTHD, which is the stunnel mode used by most of linux distribution, and the two patched STNL/PATC and STUD/PATC.

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 2 cores

    The table below summarizes the number we get with 2 cores and the percentage of improvement with 1 core:

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
    For each request 217
    +15%
    492
    +100%
    517
    +98%
    Every 100 requests 588
    +20%
    2015
    +85%
    2590
    +89%
    Never 602
    +21%
    2118
    +87%
    2670
    +90%

    Observation:

    - now, we know the number of CPU cores has an influence ;)
    - the symmetric key generation has doubled on the patched versions. STNL/FORK does not take advantage of the second CPU core.
    - we can clearly see the benefit of SSL session caching on both STNL/PATC and STUD/PATC
    - STUD/PATC performs around 25% better than STNL/PATC

    Note that since STNL/FORK and STUD/BUMP have no SSL session cache, no need to test them anymore.
    We’re going to concentrate on STNL/PTHD, STNL/UCTX, STNL/PATC and STUD/PATC.

    4. From the influence of the asymmetric key size

    The default asymmetric key size on current website is usually 1024 bits. For security purpose, more and more engineer now recommend using 2048 bits or even 4096 bits.
    In the following test, we’re going to use observe the impact of the asymmetric key size on the SSL performance.

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 2048 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 2 cores

    The table below summarizes the number we got with 2048 bits asymmetric key size generation and the percentage highlights the performance impact compared to the 1024 bits asymmetric key size, both tests running on 2 CPU cores:

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
    For each request 46
    -78%
    96
    -80%
    96
    -81%
    Every 100 requests 541
    -8%
    1762
    -13%
    2121
    -18%
    Never 602
    +0%
    2118
    0%
    2670
    +0%

    Observation:

    - the asymmetric key size has only an influence on symmetric key generation. The number of transaction per second does not change at all for the software which are able to cache and re-use SSL session id.
    - passing from 1024 to 2048 bits means dividing by 4 the number of symmetric key generated per second on our environment.
    - on an average traffic with renegotiation every 100 requests, stud is more impacted than stunnel but it performs better anyway.

    5. From the influence of the object size

    If you read carefully the article since the beginning, then you might be thinking “they’re nice with their test, but thier objects are empty… what happens with real objects?”
    So, I guess it’s time to study the impact of the object size!

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 1 KByte / 4 KBytes
    CPU: 2 cores

    Results for STNL/PTHD, STNL/PATC and STUD/PATC:
    The percentage number highlights the performance impact.

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
       1KB       4KB       1KB       4KB       1KB       4KB   
    every 100 requests 582 554
    -5%
    1897 1668
    -13%
    2475 2042
    -21%
    never 595 564
    -5%
    1997 1742
    -14%
    2520 2101
    -19%

    Observation

    - the bigger the object, the lower the performance…
    To be fair, we’re not surprised by this result ;)
    - STUD/PATC performs 20% better than STNL/PATC
    - STNL/PATC performs 3 times better than STNL/PTHD

    6. From the influence of the cipher

    Since the beginning, we run our bench only with the cipher AES256-SHA.
    It’s now the time to bench some other cipher:
    - first, let’s give a try to AES128-SHA, and compare it to AES256-SHA
    - second, let’s try RC4_128-SHA, and compare it to AES128-SHA

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA / AES128-SHA / RC4_128-SHA
    Object size: 4Kbyte
    CPU: 2 cores

    Results for STNL/PTHD, STNL/PATC and STUD/PATC:
    The percentage number highlights the performance impact on the following cipher:
    - AES 128 ==> AES 256
    - RC4 128 ==> AES 128

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
    AES256 AES128 RC4_128 AES256 AES128 RC4_128 AES256 AES128 RC4_128
    every 100 requests 554 567
    +2%
    586
    +3%
    1668 1752
    +5%
    1891
    +8%
    2042 2132
    +4%
    2306
    +8%
    never 564 572
    +1%
    600
    +5%
    1742 1816
    +4%
    1971
    +8%
    2101 2272
    +8%
    2469
    +8%

    Observation:

    - As expected, AES128 performs better than AES256
    - RC4 128 performs better than AES128
    - stud performs better than stunnel
    - Note that RC4 will perform better on big objects, since it works on a stream while AES works on blocks

    Conclusion on SSL performance

    1.) bear in mind to ask the 2 numbers when comparing SSL products:
    - the number of handshakes per second
    - the number of transaction per second (aka TPS).

    2.) if the product is not able do resume SSL session (by caching SSL ID), just forget it!
    It won’t perform well and is not scalable at all.

    Note that having a load-balancer which is able to maintain affinity based on SSL session ID is really important. You can understand why now.

    3.) bear in mind that the asymmetric key size may have a huge impact on performance.
    Of course, the bigger the asymmetric key size is, the harder it will be for an attacker to break the generated symmetric key.

    4.) stud is young, but seems promising
    By the way, stud has included HAProxy Technologies patches from @emericbr, so if you use a recent stud version, you may have the same result as us.

    5.) euh, let’s read again the results… If we consider that your user would renegociate every 100 request and that the average object size you want to encrypt is 4K, you could get 2300 SSL transaction per second on a small Intel Atom @1.66GHZ!!!!
    Imagine what you could do with a dual CPU core i7!!!

    By the way, we’re glad that the stud developers have integrated our patches into main stud branch:

    Related links

    About these ads

About Baptiste Assmann

Aloha Product Manager
This entry was posted in benchmark, ssl and tagged , , , , . Bookmark the permalink.

21 Responses to Benchmarking SSL performance

  1. Aleksandar Lazic says:

    Have you also thought to use nginx as ssl terminator in front of haproxy?

  2. Ann E. Mouse says:

    I know this is not directly related to your benchmark but can you share how you set he kernel to one cpu and other userland to another? Thanks for all the work!

  3. Pingback: Scaling out SSL | HAProxy Technologies – Aloha Load Balancer

  4. Pingback: blog.witalis.net » HAProxy w zastosowaniach – SSL

  5. Zerivael says:

    great article, exactly what I was looking for since lately I had problems with stunnel-haproxy performance and wasn’t sure how to improve it

  6. It might be useful to also include ciphersuites negotiated by the software. For example some software may be able to provide perfect forward secrecy (with DH or ECDH), while the other one may not be able to provide this level of security. They’re apples and oranges. To get a fair comparison, you may wish to force a specific ciphersuite supported by all of the tested software.

  7. Pingback: Enhanced SSL load-balancing with Server Name Indication (SNI) TLS extension | HAProxy Technologies – Aloha Load Balancer

  8. Joao Pedro says:

    Excellent article. Kudos :)

    One thing that wasn’t very clear in the article is how does the “Symmetric key generation frequency” is set. How did you generate a new key every 100 requests?

    Thank you.

    Joao

  9. Pingback: HOWTO SSL native in HAProxy | HAProxy Technologies – Aloha Load Balancer

  10. Pingback: How to get SSL with HAProxy getting rid of stunnel, stud, nginx or pound | HAProxy Technologies – Aloha Load Balancer

  11. Pingback: SSL Client certificate management at application level | HAProxy Technologies – Aloha Load Balancer

  12. Pingback: SSL Client certificate management at application level | HAProxy Technologies – Aloha Load Balancer

  13. Pingback: How to get SSL with HAProxy getting rid of stunnel, stud, nginx or pound | HAProxy Technologies – Aloha Load Balancer

  14. Pingback: HOWTO SSL native in HAProxy | HAProxy Technologies – Aloha Load Balancer

  15. Pingback: Enhanced SSL load-balancing with Server Name Indication (SNI) TLS extension | HAProxy Technologies – Aloha Load Balancer

  16. Pingback: Scaling out SSL | HAProxy Technologies – Aloha Load Balancer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s