Friday, May 19. 2017
Update (2020-09-16): While three years old, people still find this blog post when looking for information about Stapling problems. For Apache the situation has improved considerably in the meantime: mod_md, which is part of recent apache releases, comes with a new stapling implementation which you can enable with the setting MDStapling on.
Today the OCSP servers from Let’s Encrypt were offline for a while. This has caused far more trouble than it should have, because in theory we have all the technologies available to handle such an incident. However due to failures in how they are implemented they don’t really work.
We have to understand some background. Encrypted connections using the TLS protocol like HTTPS use certificates. These are essentially cryptographic public keys together with a signed statement from a certificate authority that they belong to a certain host name.
CRL and OCSP – two technologies that don’t work
Certificates can be revoked. That means that for some reason the certificate should no longer be used. A typical scenario is when a certificate owner learns that his servers have been hacked and his private keys stolen. In this case it’s good to avoid that the stolen keys and their corresponding certificates can still be used. Therefore a TLS client like a browser should check that a certificate provided by a server is not revoked.
That’s the theory at least. However the history of certificate revocation is a history of two technologies that don’t really work.
One method are certificate revocation lists (CRLs). It’s quite simple: A certificate authority provides a list of certificates that are revoked. This has an obvious limitation: These lists can grow. Given that a revocation check needs to happen during a connection it’s obvious that this is non-workable in any realistic scenario.
The second method is called OCSP (Online Certificate Status Protocol). Here a client can query a server about the status of a single certificate and will get a signed answer. This avoids the size problem of CRLs, but it still has a number of problems. Given that connections should be fast it’s quite a high cost for a client to make a connection to an OCSP server during each handshake. It’s also concerning for privacy, as it gives the operator of an OCSP server a lot of information.
However there’s a more severe problem: What happens if an OCSP server is not available? From a security point of view one could say that a certificate that can’t be OCSP-checked should be considered invalid. However OCSP servers are far too unreliable. So practically all clients implement OCSP in soft fail mode (or not at all). Soft fail means that if the OCSP server is not available the certificate is considered valid.
That makes the whole OCSP concept pointless: If an attacker tries to abuse a stolen, revoked certificate he can just block the connection to the OCSP server – and thus a client can’t learn that it’s revoked. Due to this inherent security failure Chrome decided to disable OCSP checking altogether. As a workaround they have something called CRLsets and Mozilla has something similar called OneCRL, which is essentially a big revocation list for important revocations managed by the browser vendor. However this is a weak workaround that doesn’t cover most certificates.
OCSP Stapling and Must Staple to the rescue?
There are two technologies that could fix this: OCSP Stapling and Must-Staple.
OCSP Stapling moves the querying of the OCSP server from the client to the server. The server gets OCSP replies and then sends them within the TLS handshake. This has several advantages: It avoids the latency and privacy implications of OCSP. It also allows surviving short downtimes of OCSP servers, because a TLS server can cache OCSP replies (they’re usually valid for several days).
However it still does not solve the security issue: If an attacker has a stolen, revoked certificate it can be used without Stapling. The browser won’t know about it and will query the OCSP server, this request can again be blocked by the attacker and the browser will accept the certificate.
Therefore an extension for certificates has been introduced that allows us to require Stapling. It’s usually called OCSP Must-Staple and is defined in RFC 7633 (although the RFC doesn’t mention the name Must-Staple, which can cause some confusion). If a browser sees a certificate with this extension that is used without OCSP Stapling it shouldn’t accept it.
So we should be fine. With OCSP Stapling we can avoid the latency and privacy issues of OCSP and we can avoid failing when OCSP servers have short downtimes. With OCSP Must-Staple we fix the security problems. No more soft fail. All good, right?
The OCSP Stapling implementations of Apache and Nginx are broken
Well, here come the implementations. While a lot of protocols use TLS, the most common use case is the web and HTTPS. According to Netcraft statistics by far the biggest share of active sites on the Internet run on Apache (about 46%), followed by Nginx (about 20 %). It’s reasonable to say that if these technologies should provide a solution for revocation they should be usable with the major products in that area. On the server side this is only OCSP Stapling, as OCSP Must Staple only needs to be checked by the client.
What would you expect from a working OCSP Stapling implementation? It should try to avoid a situation where it’s unable to send out a valid OCSP response. Thus roughly what it should do is to fetch a valid OCSP response as soon as possible and cache it until it gets a new one or it expires. It should furthermore try to fetch a new OCSP response long before the old one expires (ideally several days). And it should never throw away a valid response unless it has a newer one. Google developer Ryan Sleevi wrote a detailed description of what a proper OCSP Stapling implementation could look like.
Apache does none of this.
If Apache tries to renew the OCSP response and gets an error from the OCSP server – e. g. because it’s currently malfunctioning – it will throw away the existing, still valid OCSP response and replace it with the error. It will then send out stapled OCSP errors. Which makes zero sense. Firefox will show an error if it sees this. This has been reported in 2014 and is still unfixed.
Now there’s an option in Apache to avoid this behavior: SSLStaplingReturnResponderErrors. It’s defaulting to on. If you switch it off you won’t get sane behavior (that is – use the still valid, cached response), instead Apache will disable Stapling for the time it gets errors from the OCSP server. That’s better than sending out errors, but it obviously makes using Must Staple a no go.
It gets even crazier. I have set this option, but this morning I still got complaints that Firefox users were seeing errors. That’s because in this case the OCSP server wasn’t sending out errors, it was completely unavailable. For that situation Apache has a feature that will fake a tryLater error to send out to the client. If you’re wondering how that makes any sense: It doesn’t. The “tryLater” error of OCSP isn’t useful at all in TLS, because you can’t try later during a handshake which only lasts seconds.
This is controlled by another option: SSLStaplingFakeTryLater. However if we read the documentation it says “Only effective if SSLStaplingReturnResponderErrors is also enabled.” So if we disabled SSLStapingReturnResponderErrors this shouldn’t matter, right? Well: The documentation is wrong.
There are more problems: Apache doesn’t get the OCSP responses on startup, it only fetches them during the handshake. This causes extra latency on the first connection and increases the risk of hitting a situation where you don’t have a valid OCSP response. Also cached OCSP responses don’t survive server restarts, they’re kept in an in-memory cache.
There’s currently no way to configure Apache to handle OCSP stapling in a reasonable way. Here’s the configuration I use, which will at least make sure that it won’t send out errors and cache the responses a bit longer than it does by default:
I’m less familiar with Nginx, but from what I hear it isn’t much better either. According to this blogpost it doesn’t fetch OCSP responses on startup and will send out the first TLS connections without stapling even if it’s enabled. Here’s a blog post that recommends to work around this by connecting to all configured hosts after the server has started.
To summarize: This is all a big mess. Both Apache and Nginx have OCSP Stapling implementations that are essentially broken. As long as you’re using either of those then enabling Must-Staple is a reliable way to shoot yourself in the foot and get into trouble. Don’t enable it if you plan to use Apache or Nginx.
Certificate revocation is broken. It has been broken since the invention of SSL and it’s still broken. OCSP Stapling and OCSP Must-Staple could fix it in theory. But that would require working and stable implementations in the most widely used server products.
Posted by Hanno Böck in Cryptography, English, Linux, Security at 23:25 | Comments (5) | Trackback (1)
Defined tags for this entry: apache, certificates, cryptography, encryption, https, letsencrypt, nginx, ocsp, ocspstapling, revocation, ssl, tls
Related entries by tags:
How I tricked Symantec with a Fake Private Key
Lately, some attention was drawn to a widespread problem with TLS certificates. Many people are accidentaly publishing their private keys. Sometimes they are released as part of applications, in Github repositories or with common filenames on web servers.
Weblog: Hanno's blog
Tracked: Jul 20, 16:58
Display comments as (Linear | Threaded)
The default SSLStaplingCache from the Mozilla Config Generator (with Apache 2.4.18) is:
Which I assume is 128000 seconds, or 1.48 days.
Assuming that an OCSP response can be longer than that (LetsEncrypt has given me one for 7 days), I can see why that number should be increased.
But in your example you have 128000000 (1,482 days), which seems to be a little excessive.
Is that correct, or would 1280000 (14.8 days) be more appropriate?
PS: Thanks for the article, I didn't realise that Apache would ReturnResponderErrors by default, and is why this option should be disabled for Firefox users (says he hoping that the whole process will be improved soon).
So I may be wrong here, but my naive thinking was that this number is actually the size of the cache. However I tried to confirm this and I couldn't find anything in the docs explaining what this number means (which again confirms my impression that the apache docs on this are really poor).
This was a while ago, but I remember that I once had a problem where I had the impression not all stapling replies fitted into the cache. So I increased it to a much larger value.
Good point, I misread the Apache docs, it is the size (in bytes) of the cache:
So I assume the size should be based on the number of websites your hosting... where the Mozilla recommendation of 128000 bytes would probably be enough for most (assuming they are configuring their own server, which will probably host less than 10 websites).
When OCSP stapling was enabled in Nginx I immediately tried it. It was kind of confusing (because you not only had to enable it, you also had to provide an additional certificate of the OCSP server and set a DNS resolver just for OCSP), and quite hard to test.
That was several years ago - I thought the implementation would improve at some point, but it never did. Having like 6 different configuration options which all need to be set correctly for OCSP to work and making it difficult to test made me deactivate it at some point again. Because every time I looked at the OCSP configuration I was confused what the options meant and if it actually worked, and every additional domain/certificate started the whole process again, so that it never became easy. It always was confusing and hard.
I really hope somebody invests in this at some point - enabling OCSP should be one option set to "enable", and then work reliably on every request / between server restarts. That would improve so much!
It seems the simplest solution is to make "SSLStaplingReturnResponderErrors off" pass on previous valid responses when an error is recieved from the root ocsp server (but still pass on revocations of course), so I will post a bug report on that with apache.
There is of course a second issue which is that apache should not wait until the cache expiry before fetching a new response, it needs some form of intelligence where it will refresh its cache before it expires, but the first issue is by far the more urgent.
Your suggested config is actually very similar to what I have been running for a couple of years, I ave just added must-staple to a couple of domains but after reading your article I think I am likely to reverse the changes until apache implements a proper solution.
You can find my web page with links to my work as a journalist at https://hboeck.de/.
You may also find my newsletter about climate change and decarbonization technologies interesting.
Hanno on Mastodon
Show tagged entries