Category Archives: benchmark

Serving ECC and RSA certificates on same IP with HAproxy

ECC and RSA certificates and HTTPS

To keep this practical, we will not go into theory of ECC or RSA certificates. Let’s just mention that ECC certificates can provide as much security as RSA with much lower key size, meaning much lower computation requirements on the server side. Sadly, many clients do not support ciphers based on ECC, so to maintain compatibility as well as provide good performance we need to be able to detect which type of certificate is supported by the client to be able to serve it correctly.

The above is usually achieved with analyzing the cipher suites sent by the client in the ClientHello message at the start of the SSL handshake, but we’ve opted for a much simpler approach that works very well with all modern browsers (clients).

Prerequisites

First you will need to obtain both RSA and ECC certificates for your web site. Depending on the registrar you are using, check their documentation. After you have been issued with the certificates, make sure you download the appropriate intermediate certificates and create the bundle files for HAproxy to read.

To be able to use the sample fetch required, you will need at least HAproxy 1.6-dev3 (not yet released as of writing) or you can clone latest HAproxy from the git repository. Feature was introduced in commit 5fc7d7e.

Configuration

We will use chaining in order to achieve desired functionality. You can use abstract sockets on Linux to get even more performance, but note the drawbacks that can be found in HAproxy documentation.

 frontend ssl-relay
 mode tcp
 bind 0.0.0.0:443
 use_backend ssl-ecc if { req.ssl_ec_ext 1 }
 default_backend ssl-rsa

 backend ssl-ecc
 mode tcp
 server ecc unix@/var/run/haproxy_ssl_ecc.sock send-proxy-v2

 backend ssl-rsa
 mode tcp
 server rsa unix@/var/run/haproxy_ssl_rsa.sock send-proxy-v2

 listen all-ssl
 bind unix@/var/run/haproxy_ssl_ecc.sock accept-proxy ssl crt /usr/local/haproxy/ecc.www.foo.com.pem user nobody
 bind unix@/var/run/haproxy_ssl_rsa.sock accept-proxy ssl crt /usr/local/haproxy/www.foo.com.pem user nobody
 mode http
 server backend_1 192.168.1.1:8000 check

The whole configuration revolves around the newly implemented sample fetch: req.ssl_ec_ext. What this fetch does is that it detects the presence of Supported Elliptic Curves Extension inside the ClientHello message. This extension is defined in RFC4492 and according to it, it SHOULD be sent with every ClientHello message by the client supporting ECC. We have observed that all modern clients send it correctly.

If the extension is detected, the client is sent through a unix socket to the frontend that will serve an ECC certificate. If not, a regular RSA certificate will be served.

Benchmark

We will provide full HAproxy benchmarks in the near future, but for the sake of comparison, let us view the difference present on an E5-2680v3 CPU and OpenSSL 1.0.2.

256bit ECDSA:
sign verify sign/s verify/s
0.0000s 0.0001s 24453.3 9866.9

2048bit RSA:
sign verify sign/s verify/s
0.000682s 0.000028s 1466.4 35225.1

As you can see, looking at the sign/s we are getting over 15 times the performance with ECDSA256 compared to RSA2048.

Quand le marketing dépense des sous qui ne vont pas à l’innovation

Pas facile le load-balancing !

J’ai d’ordinaire un très grand respect pour nos concurrents, que je qualifie plutôt de confrères et qu’il m’est même déjà arrivé d’aider en privé pour au moins deux d’entre eux. Etant à l’origine du répartiteur de charge libre le plus répandu qui donna naissance à l’Aloha, je suis très bien placé pour savoir à quel point cette tâche est difficile. Aussi j’éprouve une certaine admiration pour quiconque s’engage avec succès dans cette voie et plus particulièrement pour ceux qui parviennent à enrichir leurs produits sur le long terme, chose encore plus difficile que d’en recréer un de zéro.

C’est ainsi qu’il m’arrive de rappeller à mes collègues qu’on ne doit jamais se moquer de nos confrères quand il leur arrive des expériences très désagréables comme par exemple le fait de laisser traîner une clé SSH root sur des appliances, parce que ça arrive même aux plus grands, que nous ne sommes pas à l’abri d’une boulette similaire malgré tout le soin apporté à chaque nouvelle version, et que face à une telle situation nous serions également très embarrassés.

Toutefois il en est un qui ne semble pas connaître ces règles de bonne conduite, probablement parce qu’il ne connait pas la valeur du travail de recherche et développement apporté aux produits de ses concurrents. Je ne le nommerai pas, ce serait lui faire l’économie d’une publicité alors que c’est sa seule spécialité.

En effet, depuis que ce concurrent a touché $16M d’investissements, il n’a de cesse de débiner nos produits ainsi que ceux de quelques autres de nos confrères auprès de nos partenaires, et ce de manière complètement gratuite et sans aucun fondement (ce que les anglo-saxons appellent le “FUD”), juste pour essayer de se placer.

Je pense que leur première motivation vient sans doute de leur amertume d’avoir vu leur produit systématiquement éliminé par nos clients lors des tests comparatifs sur les critères de facilité d’intégration, de performance et de stabilité. C’est d’ailleurs la cause la plus probable vu que ce concurrent s’en prend à plusieurs éditeurs à la fois et que nous rencontrons nous-mêmes sur le terrain des confrères offrant des produits de qualité tels que Cisco, et F5 pour les produits matériels, ou load-balancer.org pour le logiciel. Bon et bien sûr il y a aussi ce concurrent agressif que je ne nomme pas.

Il est vrai que cela ne doit pas être plaisant pour lui de perdre les tests à chaque fois, mais lorsque nous perdons un test face à un confrère, ce qui nous arrive comme à tous, nous nous efforçons d’améliorer notre produit sur le critère qui nous a fait défaut, dans l’objectif de gagner la fois suivante, au lieu d’investir lourdement dans des campagnes de désinformation sur le vainqueur.

A mon avis, ce concurrent ne sait pas par où commencer pour améliorer sa solution, ce qui explique qu’il s’attaque à plusieurs concurrents en même temps. Je pense que ça relèverait un peu le débat de lui donner quelques bases pour améliorer ses solutions et se donner un peu plus de chances de se placer.

Déjà, il gagnerait du temps en commençant par regarder un peu comment fonctionne notre produit et à s’en inspirer. Je n’ai vraiment pas pour habitude de copier la concurrence et je préfère très largement l’innovation. Mais pour eux ça ne devrait pas être un problème vu que déjà ils ont choisi les mêmes matériels que notre milieu de gamme, à part qu’ils en ont changé la couleur, préférant celle des cocus. Les malheureux ne savaient pas qu’un matériel de bonne qualité ne fait pas tout et que le logiciel ainsi que la qualité de l’intégration comptent énormément (sinon les VM seraient toutes les mêmes). C’est comme cela qu’ils se sont retrouvés avec un décalage dans la gamme par rapport à nous : ils ont systématiquement besoin du boîtier plus gros pour atteindre un niveau de performances comparable (je parle de performances réelles mesurées sur le terrain avec les applications des clients et les méthodologies de test des clients, pas celles de la fiche produit qu’on trouve sur leur site et qui n’intéressent pas les clients).

Et oui messieurs, il faudrait aussi se pencher un peu sur la partie logicielle, le véritable savoir faire d’un éditeur. Déjà, utiliser une distribution Linux générique de type poste de travail pour faire une appliance optimisée réseau, ce n’était pas très fin, mais s’affranchir du tuning système, ça relève de la paresse. Au lieu de perdre votre temps à chercher des manuels d’optimisation sur Internet, prenez donc les paramètres directement dans notre Aloha, vous savez qu’ils sont bons et vous gagnerez du temps. Sans doute que certains paramétrages n’existeront pas chez vous vu que nous ajoutons constamment des fonctionnalités pour mieux répondre aux besoins, mais ce sera déjà un bon début. Ne comptez quand même pas sur nous pour vous attendre, pendant que vous nous copiez, nous innovons et conserverons toujours cette longueur d’avance :-). Mais au moins vous aurez l’air moins ridicules en avant-vente et éviterez de mettre vos partenaires dans l’embarras chez le client avec un produit qui ne fonctionne toujours pas au bout de 6 heures passées sur un test simple.

Volontairement je vais publier cet article en français. Cela leur fera un petit exercice de traduction qui leur sera bénéfique pour s’implanter sur le territoire français où les clients sont très exigeants sur l’usage de la langue française que leur support ne pratique toujours pas d’ailleurs.

Ah dernier point, j’invite tous les lecteurs de cet article à chercher “Exceliance” sur Google, par exemple en cliquant sur ce lien : http://www.google.com/search?q=exceliance

Vous noterez que notre concurrent préféré a même été jusqu’à payer des Google Adwords pour faire apparaître ses publicités lorsqu’on cherche notre nom, il faut croire qu’il nous en veut vraiment! C’est le seul à déployer autant d’efforts pour essayer de nous faire de l’ombre, comme si c’était absolument stratégique pour lui. Vous ne verrez pas cela de la part d’A10, Brocade, F5 ou Cisco (ni même Exceliance bien sûr) : ces produits ont chacun des atouts sur le terrain et n’ont pas besoin de recourir à des méthodes pareilles pour exister. Pensez à cliquer sur leur lien de pub, ça leur fera plaisir, ça leur coûte un petit peu à chaque clic, et puis ça vous donnera l’occasion d’admirer leurs beaux produits :-).

Links

Hypervisors virtual network performance comparison from a Virtualized load-balancer point of view

Introduction

At HAProxy Technologies, we edit and sell a Load-Balancer appliance called ALOHA (stands for Application Layer Optimisation and High-Availability).
A few month ago, we managed to make it run on the most common hypervisors available:

  • VMWare (ESX, vsphere)
  • Citrix XenServer
  • HyperV
  • Xen OpenSource
  • KVM

< ADVERTISEMENT>So whatever your hypervisor is, you can runan Aloha on top of it 🙂 </ADVERTISEMENT>

Since a Load-Balancer appliance is Network IO intensive, we thought it was a good opportunity to bench each Hypervisor from a virtual network performance point of view.

Well, more and more companies use Virtualization in their infrastructures, so we guessed that a lot of people would be interested by the results of this bench, that’s why we decided to publish them on our blog.

Things to bear in mind about virtualization

One of the interesting feature of Virtualization is to be able to consolidate several servers onto a single hardware.
As a consequence, the resources (CPU, memory, disk and network IOs) are shared between several virtual machines.
An other issue to take into account is that the Hypervisor is like a new “layer” between the hardware and the OS inside the VM, which means that it may have an impact on the performance.

Purpose of benchmarking Hypervisors

First of all: WE ARE TOTALLY NEUTRAL AND HAVE NO INTEREST SAYING GOOD OR BAD THINGS ABOUT ANY HYPERVISORS.

Our main goal here is to check if each Hypervisor performs well enough to allow us to sell our Virtual Appliance on top of it.
From the tests we’ll run, we want to be able to measure the impact of the Hypervisor on the Virtual Machine performance

Benchmark platform and procedure

To run these tests, we use the same server for all Hypervisors, just swapping the hard-drive, to run each hypervisor independently.

Hypervisor Hardware summarized below:

  • CPU quad core i7 @3.4GHz
  • 16G of memory
  • Network card 1G copper e1000e

NOTE: we benched some other network cards and we got UGLY results. (Cf conclusions)
NOTE: there is a single VM running on the hypervisor: The Aloha.

The Aloha Virtual Appliance used is the Aloha VA 4.2.5 with 1G of memory and 2 vCPUs.
The client and WWW servers are physical machines plugged on the same LAN than the Hypervisor.
The client tool is inject and the web server behind the Aloha VA is httpterm.
So basically, the only thing that will change during these tests is the Hypervisor.

The Aloha is configured in reverse-proxy mode (using HAProxy) between the client and the server, load-balancing and analyzing HTTP requests.
We focused mainly on virtual networking performance: number of HTTP connections per seconds and associated bandwidth.
We ran the benchmark with different object size: 0, 1K, 2K, 4K, 8K, 16K, 32K, 48K, 64K.
NOTE: by “HTTP connection”, we mean a single HTTP request with its response over a single TCP connection, like in HTTP/1.0.

Basically, the 0K object test is used to get the number of connections per second the VA can do and the 64K object is used to measure the maximum bandwidth.

NOTE: the maximum bandwith will be 1G anyway, since we’re limitated by the physical NIC.

We are going to bench Network IO only, since this is the intensive usage a load-balancer does.
We won’t bench disks IOs…

Tested Hypervisors


We benched a native Aloha against Aloha VA embedded in the Hypervisors listed below:

  • HyperV
  • RHEV (KVM based)
  • vshpere 5.0
  • Xen 4.1 on Ubuntu 11.10
  • XenServer 6.0

Benchmark results


Raw server performance (native tests, without any hypervisor)

For the first test, we run the Aloha on the server itself without any Hypervisor.
That way, we’ll have some figures on the capacity of the server itself. We’ll use those numbers later in the article to compare the impact of each Hypervisor on performance.

native_performance

Microsoft HyperV


We tested HyperV on a Windows 2008 r2 server.
For this hypervisor 2 network cards are available:

  1. Legacy network adapteur: which emulates the network layer through the tulip driver.
    ==> With this driver, we got around 1.5K requests per seconds, which is really poor…
  2. Network adapteur: requires the hv_netvsc driver supplied by Microsoft in open source since Linux Kernel 2.6.32.
    ==> this is the driver we used for the tests

hyperv_performance

RHEV 3.0 Beta (KVM based)

RHEV is Red Hat Hypervisor, based on KVM.
For the Virtualization of the Network Layer, RHEV uses the virtio drivers.
Note that RHEV was still in Beta version when running this test.

VMWare Vsphere 5

There are 3 types of network cards available for Vsphere 5.0
1. Intel e1000: e1000 driver, emulates network layer into the VM.
2. VMxNET 2: allows network layer virtualization
3. VMxNET 3: allows network layer virtualization
The best results were obtained with the vmxnet2 driver.

Note: We have not tested Vsphere 4 neither ESX 3.5.

vsphere_performance

Xen OpenSource 4.1 on Ubuntu 11.10

Since CentOS 6.0 does not provide Xen OpenSource in its official repositories, we decided to use the latest (Oneiric Ocelot) Ubuntu server distribution, with Xen 4.1 on top of it.
Xen provides two network interfaces:

  1. emulated one, based on 8139too driver
  2. virtualized network layer, xen-vnif

Of course, the results are much better with xen-vnif, so we’re going to use this driver for the test.

xen41_performance

Citrix Xenserver 6.0


The network driver used for XenServer is the same one than the Xen OpenSource: xen-vnif.

xenserver60_performance

Hypervisors comparison


HTTP connections per second


The graph below summarizes the http connections per second capacity for each Hypervisor.
It shows us the Hypervisor overhead by comparing the light blue line which represents the server capacity without any Hypervisor to each hypervisor line..

http_connections_comparison

Bandwith usage


The graph below summarizes the http connections per second capacity for each Hypervisor.
It shows us the Hypervisor overhead by comparing the light blue line which represents the server capacity without any Hypervisor to each hypervisor line..

bandwith_comparison

Performance loss


Well, comparing Hypervisors to each others is nice, but remember, we wanted to know how much performance was lost in the hypervisor layer.
The graph below shows, in percentage, the loss generated by each hypervisor when compared to the native Aloha.
The highest the percentage, the worste for the hypervisor…

performance_loss_comparison

Conclusion

  • the Hypervisor layer has a non-negligible impact on networking performance on a Virtualized Load-Balancer running in reverse-proxy mode.
    But I guess it would be the same for any VM which is Networking IO intensive
  • The shortest the connections, the biggest the impact is.
    For very long connection like TSE, IMAP, etc… virtualization might make sense
  • Vsphere seems in advanced compared to its competitors on a performance point of view.
  • HyperV and Citrix XenServer have interesting performance.
  • RHEV (KVM) and Xen open source can still improve performance.
    Unless this is related to our procedure.
  • Even if the hardware layer is not accessed by the VM anymore, it still has a huge impact on performance.
    IE, on vsphere, we could not go higher than 20K connections per second with a Realtek NIC in the server…
    With Intel e1000e driver, we got up to 55K connections per second….
    So, even when you use virtualization, hardware counts!

Links

Scaling out SSL

Synopsis

We’ve seen recently how we could scale up SSL performance.
But what about scaling out SSL performance?
Well, thanks to Aloha and HAProxy, it’s easy to manage smartly a farm of SSL accelerator servers, using persistence based on the SSL Session ID.
This way of load-balancing is smart, but in case of SSL accelerator failure, other servers in the farm would have a CPU overhead to generate SSL Session IDs for sessions re-balanced by the Aloha.

After a talk with (the famous) emericbr, HAProxy Technologies dev team leader, he decided to write a patch for stud to add a new feature: sharing SSL session between different stud processes.
That way, in case of SSL accelerator failure, the servers getting re-balanced sessions would not have to generate a new SSL session.

Emericbr’s patch is available here: https://github.com/bumptech/stud/pull/50
At the end of this article, you’ll learn how to use it.

Stud SSL Session shared caching

Description

As we’ve seen in our article on SSL performance, a good way to improve performance on SSL is to use a SSL Session ID cache.

The idea here, is to use this cache as well as sending updates into a shared cache one can consult to get the SSL Session ID and the data associated to it.

As a consequence, there are 2 levels of cache:

      * Level 1: local process cache, with the currently used SSL session
      * Level 2: shared cache, with the SSL session from all local cache

Way of working

The protocol understand 3 types of request:

      * New: When a process generates a new session, it updates its local cache then the shared cache
      * Get: When a client tries to resume a session and the process receiving it is not aware of it, then the process tries to get it from the shared cache
      * Del: When a session has expired or there is a bad SSL ID sent by a client, then the process will delete the session from the shared cache

Who does what?

Stud has a Father/Son architecture.
The Father starts up then starts up Sons. The Sons bind the TCP external ports, load the certificate and process the SSL requests.
Each son manages its local cache and send the updates to the shared cache. The Father manages the shared cache, receiving the changes and maintaining it up to date.

How are the updates exchanged?

Updates are sent either on Unicast or Multicast, on a specified UDP port.
Updates are compatible both IPv4 and IPv6.
Each packet are signed by an encrypted signature using the SSL certificate, to avoid cache poisoning.

What does a packet look like?


SSL Session ID ASN-1 of SSL Session structure Timestamp Signature
[32 bytes] [max 512 bytes] [4 bytes] [20 bytes]

Note: the SSL Session ID field is padded with 0 if required

Diagram

Let’s show this on a nice picture where each potato represents each process memory area.
stud_shared_cache
Here, the son on host 1 got a new SSL connection to process, since he could not find it in its cache and in the shared cache, he generated the asymmetric key, then push it to his shared cache and the father on host 2 which updates the shared cache for this host.
That way, if this user is routed to any stud son process, he would not have to compute again its asymmetric key.

Let’s try Stud shared cache

Installation:

git clone https://github.com/EmericBr/stud.git
cd stud
wget http://1wt.eu/tools/ebtree/ebtree-6.0.6.tar.gz
tar xvzf ebtree-6.0.6.tar.gz
ln -s ebtree-6.0.6 ebtree
make USE_SHARED_CACHE=1

Generate a key and a certificate, add them in a single file.

Now you can run stud:

sudo ./stud -n 2 -C 10000 -U 10.0.3.20,8888 -P 10.0.0.17 -f 10.0.3.20,443 -b 10.0.0.3,80 cert.pem

and run a test:

curl --noproxy * --insecure -D - https://10.0.3.20:443/

And you can watch the synchronization packets:

$ sudo tcpdump -n -i any port 8888
[sudo] password for bassmann: 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes

17:47:10.557362 IP 10.0.3.20.8888 > 10.0.0.17.8888: UDP, length 176
17:49:04.592522 IP 10.0.3.20.8888 > 10.0.0.17.8888: UDP, length 176
17:49:05.476032 IP 10.0.3.20.8888 > 10.0.0.17.8888: UDP, length 176

Related links

Benchmarking SSL performance

Introduction

The story

Recently, there has been some attacks against website which aimed to steal user identity. In order to protect their users, major website owners had to find a solution.
Unfortunately, we know that sometimes, improving security means downgrading performance.

SSL/TLS is a fashion way to improve data safety when data is exchanged over a network.
SSL/TLS encryption is used to crypt any kind of data, from the login/password on a personnal blog service to a company extranet passing through an e-commerce caddy.
Recent attack shown that to be protect users identity, all the traffic must be encrypted.

Note, SSL/TLS is not only used on Website, but can be used to crypt any TCP based protocol like POP, IMAP, SMTP, etc…

Why this benchmark?

At HAProxy Technologies, we build load-balancer appliances based on a Linux kernel, LVS (for layer 3/4 load-balancing), HAProxy (for layer 7 load-balancing) and stunnel (SSL encryption), for the main components.

  1. Since SSL/TLS is fashion, we wanted to help people ask the right questions and to do the right choice when they have to bench and choose SSL/TLS products.
  2. We wanted to explain to everybody how one can improve SSL/TLS performance by adding some functionality to SSL open source software.
  3. Lately, on HAProxy mailing list, Sebastien introduced us to stud, a very good, but still young, alternative to stunnel. So we were curious to bench it.

SSL/TLS introduction

The theory

SSL/TLS can be a bit complicated at first sight.
Our purpose here is not to describe exactly how it works, there are useful readings for that:

    SSL main lines

    Basically, there are two main phases in SSL/TLS:

    1. the handshake
    2. data exchange

    During the handshake, the client and the server will generate three keys which are unique for the client and the server, available during the session life only and will be used to crypt and uncrypt data on both sides by the symmetric algorithms.
    Later in the article, we will use the term “Symmetric key” for those keys.

    The symmetric key is never exchanged over the network. An ID, called SSL session ID, is associated to it.

    Let’s have a look at the diagram below, which shows a basic HTTPS connection, step by step:

    SSL_handshake

    We also represented on the diagram the factor which might have an impact on performance.

    1. Client sends to the server the Client Hello packet with some randon numbers, its supported ciphers and a SSL session ID in case of resuming SSL session
    2. Server chooses a cipher from the client cipher list and sends a Server Hello packet, including random number.
      It generates a new SSL session ID if resume is not possible or available.
    3. Server sends its public certificate to the client, the client validates it against CA certificates.
      ==> sometimes you may have warnings about self-signed certificates.
    4. Server sends a Server Hello Done packet to tell the client he has finished for now
    5. Client generates and sends the pre-master key to the server
    6. Client and server generate the symmetric key that will be used to crypt data
    7. Client and server tells each other the next packets will be sent encrypted
    8. Now, data is encrypted.

    SSL Performance

    As you can see on the diagram above, some factors may influence SSL/TLS performance on the server side:

    1. the server hardware, mainly the CPU
    2. the asymmetric key size
    3. the symmetric algorithm

    In this article, we’re going to study the influence of these 4 factors and observe the impact on the performance.

    A few other things might have an impact on performance:

    • the ability to resume a SSL/TLS session
    • symmetric key generation frequency
    • object size to crypt

    Benchmark platform

    We used the platform below to run our benchmark:

    SSL_benchmark_platform

    The SSL server has purposely much less capacity than the client in order to ensure the client won’t saturate before the server.

    The client is inject + stunnel on client mode.
    The web server behind HAProxy and the SSL offloader is httpterm

    Note: Some resuts were checked using httperf and curl-loader, and results were similar.

    On the server, we have 2 cores and since we have enabled hyper threading, we have 4 CPUs available from a kernel point of view.
    The e1000e driver of the server has been modified to be able to bind interrupts on the first logical CPU core 0.

    Last but not least, the SSL library used is Openssl 0.9.8.

    Benchmark purpose

    The purpose of this benchmark is to:

    • Compare the different way of working of stunnel (fork, pthread, ucontext, ucontext + session cache)
    • Compare the different way of working of stud (without and with session cache)
    • Compare stud and stunnel (without and with session cache)
    • Impact of session renegotiation frequency
    • Impact of asymmetric key size
    • Impact of object size
    • Impact of symmetric cypher

    At the end of the document, we’re going to give some conclusion as well as some advices.

    As a standard test, we’re going to use the following:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte

    For each test, we’re going to provide the transaction per second (TPS) and the handshake capacity, which are the two most important numbers you need to know when comparing SSL accelerator products.

    • Transactions per second: the client will always re-use the same SSL session ID
    • Symmetric key generation: the client will never re-use its SSL session ID, forcing the server to generate a new symmetric key for each request

    1. From the influence of symmetric key generation frequency

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 1 core

    Note that the object is void because we want to mesure pure SSL performance.

    We’re going to bench the following software:
    STNL/FORK: stunnel-4.39 mode fork
    STNL/PTHD: stunnel-4.39 mode pthread
    STNL/UCTX: stunnel-4.39 mode ucontext
    STUD/BUMP: stud github bumptech (github: 1846569)

    Symmetric key generation frequency STNL/FORK STNL/PTHD STNL/UCTX STUD/BUMP
    For each request 131 188 190 261
    Every 100 requests 131 487 490 261
    Never 131 495 496 261

    Observation:

    – We can clearly see that STNL/FORK and STUD/BUMP can’t resume a SSL/TLS session.
    STUD/BUMP has better performance than STNL/* on symmetric key generation.

    2. From the advantage of caching SSL session

    For this test, we have developed patches for both stunnel and stud to improve a few things.
    The stunnel patches are applied on STNL/UCTX and include:
    – listen queue settable
    – performance regression due to logs fix
    – multiprocess start up management
    – session cache in shared memory

    The stud patches are applied on STUD/BUMP and include:
    – listen queue settable
    – session cache in shared memory
    – fix to allow session resume

    We’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 1 core

    Note that the patched version will be respectively called STNL/PATC and STUD/PATC in the rest of this document.
    The percentage highlights the improvement of STNL/PATC and STUD/PATC respectively over STNL/UCTX and STUD/BUMP.

    Symmetric key generation frequency STNL/PATC STUD/PATC
    For each request 246
    +29%
    261
    +0%
    Every 100 requests 1085
    +121%
    1366
    +423%
    Never 1129
    +127%
    1400
    +436%

    Observation:

    – obviously, caching SSL session improves the number of transaction per second
    stunnel patches also improved stunnel performance

    3. From the influence of CPU cores

    As seen on the previous test, we could improve TLS capacity by adding a symmetric key cache to both stud and stunnel.
    We still might be able to improve things :).

    For this test, we’re going to configure both stunnel and stud to use 2 CPU cores.
    The kernel will be configured on core 0, userland on core 1 and stunnel or stud on cores 2 and 3, as shown below:
    cpu_affinity_ssl_2_cores

    For the rest of the tests, we’re going to bench only STNL/PTHD, which is the stunnel mode used by most of linux distribution, and the two patched STNL/PATC and STUD/PATC.

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 2 cores

    The table below summarizes the number we get with 2 cores and the percentage of improvement with 1 core:

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
    For each request 217
    +15%
    492
    +100%
    517
    +98%
    Every 100 requests 588
    +20%
    2015
    +85%
    2590
    +89%
    Never 602
    +21%
    2118
    +87%
    2670
    +90%

    Observation:

    – now, we know the number of CPU cores has an influence 😉
    – the symmetric key generation has doubled on the patched versions. STNL/FORK does not take advantage of the second CPU core.
    – we can clearly see the benefit of SSL session caching on both STNL/PATC and STUD/PATC
    STUD/PATC performs around 25% better than STNL/PATC

    Note that since STNL/FORK and STUD/BUMP have no SSL session cache, no need to test them anymore.
    We’re going to concentrate on STNL/PTHD, STNL/UCTX, STNL/PATC and STUD/PATC.

    4. From the influence of the asymmetric key size

    The default asymmetric key size on current website is usually 1024 bits. For security purpose, more and more engineer now recommend using 2048 bits or even 4096 bits.
    In the following test, we’re going to use observe the impact of the asymmetric key size on the SSL performance.

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 2048 bits
    Cipher: AES256-SHA
    Object size: 0 byte
    CPU: 2 cores

    The table below summarizes the number we got with 2048 bits asymmetric key size generation and the percentage highlights the performance impact compared to the 1024 bits asymmetric key size, both tests running on 2 CPU cores:

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
    For each request 46
    -78%
    96
    -80%
    96
    -81%
    Every 100 requests 541
    -8%
    1762
    -13%
    2121
    -18%
    Never 602
    +0%
    2118
    0%
    2670
    +0%

    Observation:

    – the asymmetric key size has only an influence on symmetric key generation. The number of transaction per second does not change at all for the software which are able to cache and re-use SSL session id.
    – passing from 1024 to 2048 bits means dividing by 4 the number of symmetric key generated per second on our environment.
    – on an average traffic with renegotiation every 100 requests, stud is more impacted than stunnel but it performs better anyway.

    5. From the influence of the object size

    If you read carefully the article since the beginning, then you might be thinking “they’re nice with their test, but thier objects are empty… what happens with real objects?”
    So, I guess it’s time to study the impact of the object size!

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA
    Object size: 1 KByte / 4 KBytes
    CPU: 2 cores

    Results for STNL/PTHD, STNL/PATC and STUD/PATC:
    The percentage number highlights the performance impact.

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
       1KB       4KB       1KB       4KB       1KB       4KB   
    every 100 requests 582 554
    -5%
    1897 1668
    -13%
    2475 2042
    -21%
    never 595 564
    -5%
    1997 1742
    -14%
    2520 2101
    -19%

    Observation

    – the bigger the object, the lower the performance…
    To be fair, we’re not surprised by this result 😉
    STUD/PATC performs 20% better than STNL/PATC
    STNL/PATC performs 3 times better than STNL/PTHD

    6. From the influence of the cipher

    Since the beginning, we run our bench only with the cipher AES256-SHA.
    It’s now the time to bench some other cipher:
    – first, let’s give a try to AES128-SHA, and compare it to AES256-SHA
    – second, let’s try RC4_128-SHA, and compare it to AES128-SHA

    For this test, we’re going to use the following parameters:
    Protocol: TLSv1
    Asymmetric key size: 1024 bits
    Cipher: AES256-SHA / AES128-SHA / RC4_128-SHA
    Object size: 4Kbyte
    CPU: 2 cores

    Results for STNL/PTHD, STNL/PATC and STUD/PATC:
    The percentage number highlights the performance impact on the following cipher:
    – AES 128 ==> AES 256
    – RC4 128 ==> AES 128

    Symmetric key generation frequency STNL/PTHD STNL/PATC STUD/PATC
    AES256 AES128 RC4_128 AES256 AES128 RC4_128 AES256 AES128 RC4_128
    every 100 requests 554 567
    +2%
    586
    +3%
    1668 1752
    +5%
    1891
    +8%
    2042 2132
    +4%
    2306
    +8%
    never 564 572
    +1%
    600
    +5%
    1742 1816
    +4%
    1971
    +8%
    2101 2272
    +8%
    2469
    +8%

    Observation:

    – As expected, AES128 performs better than AES256
    RC4 128 performs better than AES128
    stud performs better than stunnel
    – Note that RC4 will perform better on big objects, since it works on a stream while AES works on blocks

    Conclusion on SSL performance

    1.) bear in mind to ask the 2 numbers when comparing SSL products:
    – the number of handshakes per second
    – the number of transaction per second (aka TPS).

    2.) if the product is not able do resume SSL session (by caching SSL ID), just forget it!
    It won’t perform well and is not scalable at all.

    Note that having a load-balancer which is able to maintain affinity based on SSL session ID is really important. You can understand why now.

    3.) bear in mind that the asymmetric key size may have a huge impact on performance.
    Of course, the bigger the asymmetric key size is, the harder it will be for an attacker to break the generated symmetric key.

    4.) stud is young, but seems promising
    By the way, stud has included HAProxy Technologies patches from @emericbr, so if you use a recent stud version, you may have the same result as us.

    5.) euh, let’s read again the results… If we consider that your user would renegociate every 100 request and that the average object size you want to encrypt is 4K, you could get 2300 SSL transaction per second on a small Intel Atom @1.66GHZ!!!!
    Imagine what you could do with a dual CPU core i7!!!

    By the way, we’re glad that the stud developers have integrated our patches into main stud branch:

    Related links