GRC's | SSL/TLS Certificate Revocation Awareness – The case for OCSP Must-Staple

Executive Summary

The original public key infrastructure (PKI) certificate revocation list (CRL) scheme didn't scale as the number of certificates and inevitable revocations exploded.
The online certificate status protocol (OCSP), which was intended to replace the collapsing CRL system, always suffered from design and deployment problems which compromised privacy, reliability and security.
For many years nothing much happened. Neither system was usefully being enforced while stolen and revoked certificates would or could typically be honored by web browsers. 44% of certificate revocations are due to key compromise, so this failing is important. But end users had no way of detecting what was going on, so the problem never got the attention it deserved. “Revocation” was given lip service, but the insiders knew it didn't really work.
Several years ago the idea for “OCSP Stapling” was adopted and implemented in web servers. Then, very recently, by a few web browsers and client operating systems.
But the whole system was still set to “fail soft.” This allowed for an effective bypass of the authentication it was attempting to provide. Because it could be argued that revocation still didn't work, some web browsers continued to ignore it and didn't even bother trying.
Finally, the concept of “Must-Staple” was born and took root. When this is implemented it will allow web browsers to safely and reliably implement a “fail hard” policy. This will finally deliver the certificate revocation security users always thought they were getting, but never really did.

What's the problem?

The fundamental problem is that the Internet has outgrown the original system
we are still trying to use. And rather than fixing it, the Internet industry has
largely stopped using it . . . and hoped no one would notice or care.

The original Certificate Revocation List (CRL) solution doesn't “scale” well. The simple idea of a certificate revocation list is that the collected unique serial numbers of any not yet expired but known invalid certificates are maintained in a master list.

Each certificate authority publishes its own continuously changing list of the certificates it has issued that have been revoked and are not yet expired. Each issued certificate contains a URL link pointing to this list so that the operating system or web browser relying upon the certificate's veracity knows where to obtain the list for inspection.

This creates problems: Since certificates are being continually revoked, replaced or invalidated for various reasons, the “true list” is a continually moving target, and any copy of the list is out of date from nearly the moment it is published. Certificate authorities update their published revocation lists weekly. Since the certificate authority (CA) model is “trust unless proven invalid”, this means that an aging CRL will not contain information of recently revoked certificates, resulting in those certificates being erroneously trusted when they should not be.

The obvious solution is to update the CRLs as often as possible. So already we face a tradeoff in a system that, to be secure, should not have tradeoffs. The question is how often to we update? The answer is complicated by the fact that many certificate revocation lists are quite large:

crl_file_size
Credit: Websense blog, July 10th, 2013

Note that the vertical axis is logarithmic. Thus, this chart tends to obscure the true magnitudes.

The smallest of these CRL's is 236 bytes, while the largest is a stunning 28,198,759 bytes! In other words, it is completely infeasible to download the larger CRL's on-the-fly . . . yet the CRL system only works if all web browsing clients have copies of the very latest up-to-date revocation lists.

Fortunately, only certificates which are both non-expired and invalid need to be listed in the CRLs. Domain validation (DV) certificates that are used to authenticate the identity of remote web servers have, at most, a three-year life and the life of Extended validation (EV) certificates is limited to two years. So the CRLs only need to maintain lists of invalid certificates for, at most, three years. But, as shown below, the rapid escalation in certificate revocation rates has outpaced even the expiration mechanism:

revocation_trend
Credit: Websense blog, July 10th, 2013

Note that the apparent decline in 2013 is only caused by the fact that only the first five months of 2013 were available in this data. During only those first five months, 766,451 revocations occurred. Thus, 2013 was well on its way to hugely exceeding the previous year's total.

The Heartbleed Effect

None of this takes into account the “revocation tsunami,” as it's being
called, resulting from the many tens of thousands of certificates revoked
in the wake of the “heartbleed” vulnerability. Heartbleed has massively
increased the size of CRLs for the next three years. [Reference: A page
at SANS tracking heartbleed-related revocations with lots of great detail.]

The (pre-heartbleed) explosion in certificate revocation depicted in the chart above begs the question: Why are all of these certificates being revoked? CRLs provide a facility for allowing their publisher to specify the reason for each revocation. Although not all CAs take advantage of this feature, most do. The following chart summarizes the available data:

revocation_reason
Credit: Websense blog, July 10th, 2013

This data demonstrates that while the most prevalent reason for revocation is web server key compromise (or believed compromise), the strong runner-up is “Cessation of Operation.” In these cases, certificate authorities are absolutely required to revoke any certificates that they had issued to domains whose ownership has changed.

Many years before the CRL crisis became this acute, the Internet's
engineers were working on a replacement . . . known as OCSP.

OCSP – Online Certificate Status Protocol.

The concept behind OCSP was simple: Allow web browsers and other clients to query the status of an individual certificate in real time. As stated in the original OCSP RFC document, RFC 2650 in June 1999:

This document specifies a protocol useful in determining the current
status of a digital certificate without requiring CRLs. Additional
mechanisms addressing PKIX operational requirements are specified in
separate documents.

With OCSP, web browsers and other clients are no longer required to obtain and cache possibly huge and somewhat obsolete certificate revocation lists. That was always inefficient. Remember that the list contains all of an authority's revoked certificates, whereas the client only wants to check one.

The OCSP protocol allows anyone relying upon the validity of any apparently-valid certificate, to directly query the issuing certificate authority, on-the-fly, to determine whether the certificate is still valid.

At the time of this writing, April of 2014, the OCSP protocol is nearly fifteen years old. The primary incentive behind OCSP was to deal with the inherent “always out of date” nature of any static revocation list. So the impetus was to move to a dynamic real time revocation system.

There were always two problems with this system:

Browsing privacy didn't generate as much concern fifteen years ago as it does today. The trouble with having everyone's web browser checking-in with every certificate's issuer to verify its validity, is that the certificate issuer then obtains both the querying user's IP address, and the website they wish to securely visit. Thus, OCSP creates a significant and, for many applications, unacceptable disclosure of every user's private use of the Internet when visiting secured websites.
The tremendous load on OCSP responders also wasn't of nearly as much concern fifteen years ago as it is today. With OCSP, every time any user in the world connects to any secured website, that user's web browser must query the certificate authority's OCSP server. The typical certificate authority will issue certificates for hundreds of thousands of individual web sites. So every visitor of every web site authenticated by that authority will be querying for one of those sites' OCSP certificate status.

In other words, OCSP leaks browsing behavior
and, like CRL's, it does not scale very well.

There was also a HUGE problem for the browser vendors: What to do for no reply?

If an OCSP “responder” quickly affirms that a certificate is valid, everybody is happy. And if an OCSP responder affirms that a certificate which a site is attempting to use has previously been revoked, then the user is protected from what is almost certainly a site with malicious intent. So that's good too.

But what does, or should, the web browser do when it receives no response from the OCSP responder? How long should it wait for a response? Remember that the browser's goal is to protect its user from possibly malicious websites. So the web browser must suspend the new connection until it can determine whether it is safe to proceed. Due to the way the current certificate authority system operates, the only reliable way to know about the true up-to-the-second status of any certificate is to use OCSP to ask the certificate issuer directly whether the possibly-still-valid certificate is truly still valid. But this imposes a massive and ultimately undeliverable burden upon the OCSP servers. They are too often overwhelmed and unable to respond.

To make matters worse, there are situations where an OCSP query and/or response might be administratively blocked, such as when “captive portals” are used. A captive portal is one, such as where free WiFi is available, where the user's access to the wider Internet is blocked until they have logged on with their credentials, agreed to the portal's terms of service, watched an advertisement, or jumped through whatever hoops the bandwidth provider requires. The point is that such pages might require a user logon and might therefore be secured. But if the portal disallows any access to the wider Internet, there's no way for the user's OCSP query to be answered to confirm the portal's own security certificate. If web browsers had always enforced strong revocation checks, captive portals would have been designed to permit those checks. But that's not the way history has been written.

The “Fail Soft” capitulation

Needing to find a practical solution, web browsers do not block their user in the event of no reply to an OCSP request. If web browsers bother to perform OCSP checks at all (and many do not under their default security settings), they adopt a “fail soft” policy, meaning that they treat no reply as a good reply. Only the Firefox browser offers an option to treat no reply as invalid. It's alone in the pack, and even that option is disabled unless the user turns it on.

Today, only one browser offers
you the choice to be totally safe:

Firefox today offers what could be described as “Must OCSP.” If you enable the second
option, you will be completely protected at the theoretical cost of false positive blocks.
We have always had it enabled and have never encountered a single false positive.

The nearly universal adoption of OCSP fail-soft policy opens the revocation system to malicious interference. If an attacker can simply arrange to block a web browser's access to the OCSP system, the browser will fail soft, to treat a revoked certificate as valid.

Unfortunately, this original design of the OCSP system results in a single point of failure for the certificate verification system. If an OCSP server is offline, overloaded, under attack or unable to reply for any reason, certificate validity cannot be confirmed. And even when it is, the user's privacy is threatened. We still need a better solution.

As the very real problems of OCSP began to surface, clever Internet
engineers invented another solution known as “OCSP Stapling.”

OCSP Stapling

With OCSP stapling, rather than requiring the web browser to independently obtain an assurance of a certificate's current validity, the same web server that offers the certificate also provides a “fresh assertion” of the validity of the security certificate it is offering to the user. The web server periodically queries the certificate's designated OCSP responder to receive a refreshed and updated assertion. Since, like the original certificate, the OCSP response is also cryptographically signed by the certificate authority's key, it cannot be forged.

Then, as defined by RFC 6066 in June, 2011, the TLS protocol was extended to allow a web browser to request and a web server to supply this OCSP information in its initial connection handshake. If the connection-initiating web browser indicates that it is aware of this TLS extension, and the web server offers the feature, the OCSP assertion can be provided at the time of the connection:

Aside from the fact that it is considerably more elegant to have the web site that's offering a certificate also able to reassert that certificate's continued and current validity, “stapling” the current OCSP status into the initial TLS handshake solves most of OCSP's longstanding problems:

Privacy: Since the site's web server is making OCSP queries on behalf of ALL of its site's visitors, the OCSP privacy concern disappears. The certificate authority's OCSP responder receives periodic queries for updated certificate status from websites using the various certificates it has issued. The OCSP responder no longer receives direct queries from all of the website's visiting web browsers. Thus, browsing information leakage is eliminated.
Bandwidth & OCSP server load: Using traditional non-stapled OCSP on a busy website, tens of thousands of individual connections would result in tens of thousands of individual OCSP queries to the certificate authority. With stapling, a single periodic query from a server on that site can replace ALL of those individual queries.
Reliability: With the OCSP server load dramatically reduced, overload of the server and its bandwidth will be eliminated. Furthermore, the website's server can begin querying for the next OCSP update well before its last update has expired. This is typically once daily. If the first update attempt fails, many others can be performed so that the web server is always able to immediately supply a valid OCSP stapled response with every connection.
Captive portals: The problem presented by captive portals disappears since the portal's secure connection can simultaneously validate its own certificate without the web browser needing to obtain additional Internet-located certificate assurances. This allows the web browser to operate with much tighter and more secure controls –requiring OCSP– without inconveniencing its user with validation failures.
Performance: Browser vendors are always working to enhance performance. Moving to stapled OCSP is a huge relief since they no longer need to make separate DNS lookups and a separate time-consuming OCSP query. The browser receives everything it needs all at once within the initial connection establishment.

Note that we didn't say that OCSP stapling solved every OCSP
problem . . . only MOST of them. What's the remaining problem?

Why is OCSP Stapling NOT the total answer?

To function, the web browser and/or the client operating system MUST first indicate that it is aware of the TLS OCSP stapling extension. That indication makes it safe for an equally aware web server to then provide the OCSP information in the connection handshake. If the client doesn't first ask, the server cannot safely reply.

A July 2013 survey by Netcraft showed that only 22% of all certificates were served with a stapled response:

ocsp_support_pie
Credit: Netcraft Archives, July 19th, 2013

And of those 22%, more than 95% of them were served by Microsoft servers. Microsoft's servers have had OCSP stapling built in and enabled by default for many years. So all recently deployed and updated Microsoft-based websites will be offering stapling:

os_pie
Credit: Netcraft Archives, July 19th, 2013

Since there is no known reason for the majority of the Internet's non-Microsoft web servers to have stapling disabled, there is no excuse. All of the major servers offer it, though, as can be seen above, few websites are bothering to use it:

The Apache web server has supported OCSP stapling since v2.3.3 (ref).
The nginx web server has supported OCSP stapling since v1.3.7 (ref).
The LiteSpeed web server has supported OCSP stapling since v4.2.4 (ref)

We have a bit of a chicken and egg problem here. It's probably going to take Internet users to show that they care enough about security to complain to their favorite websites and browser vendors about the lack of good certificate revocation and for those websites and browser vendors to get the attention of the operating system developers.

OCSP stapling has had many birthdays. It's growing up. There's
NO excuse for both web browser and servers not to be using it.

How to check any website for OCSP stapling

The fastest service, is offered by the Digicert certificate authority at: https://www.digicert.com/help/. Enter the domain name of any website and it will quickly display the site's certificate revocation status using both CRL and OCSP queries... and whether the server offers OCSP stapled connections.

A fabulous site for digging deeper into any site's security is https://www.ssllabs.com/ssltest/. You'll need to give it a minute to dig out all of a web server's security details, but you'll be amazed (and probably overwhelmed) by everything that it determines.

If important websites you visit (that brag about their security) are not yet offering stapled OCSP replies, tell them to get with it! ALL websites should have been offering it for years.

Overcoming the final excuse

What's the final excuse?

Some Internet engineers correctly argue that since we have never been able to absolutely count on obtaining an OCSP response, the system has always been “soft-fail,” making it useless. These engineers argue that all the bad guys have to do is block any OCSP responses that would indicate that the certificate had been stolen, and the browser will permit the unverified connection anyway.

But now we know that a stapled OCSP response which is bundled right into the connection's opening handshake doesn't require access to another OCSP server. So it cannot be blocked.

So then the engineers explain, again correctly, that if bad guys steal a certificate and setup a malicious duplicate website, they will turn off their own server's OCSP stapling. Assuming that they can also block the web browser's direct access to the OCSP server, and the client is set to fail soft, the web browser will assume everything's fine and proceed with the malicious connection.

So what's the final solution?

It's known as “Must Staple” and it's coming soon
to your web browser . . . or at least to Firefox!

OCSP “Must Staple”

Think about this in the context of the section above:

If a web browser could be absolutely certain that an authentic website did provide OCSP stapling, it could safely and reliably hard fail (preventing the connection) instead of soft fail (shrugging and permitting the connection) if, for any reason, that site failed to staple a recent assertion that its certificate was still valid.

This is known as OCSP “must staple”, it's the solution to the certificate revocation problem, and it's coming soon... at least to the Firefox browser.

How can a browser know that a site “must staple”?

There are two proposals, both of which are likely to be adopted:

1: Add a “must staple” assertion to the site's security certificate:

Just as the TLS protocol has “extensions”, so, too, do security certificates. Here is the formal IETF proposal for adding a new extension to security certificates for this purpose:

http://tools.ietf.org/html/draft-hallambaker-muststaple-00

Once this feature is implemented, any website wishing to protect its visitors from the possibility of revoked certificate abuse can include that assertion in its certificate. The web browser that receives this certificate can verify that an OCSP stapled reply was provided by the server and flatly refuse to proceed (hard fail) with the connection otherwise. It's the best possible solution to the revocation problem.

Any attacker who attempts to abuse such a certificate will be out of luck, because the certificate itself asserts that OCSP stapling must be provided. But the attacker cannot provide OCSP stapling because that would identify their use of the certificate as fraudulent. This renders stolen and revoked certificates completely useless, as they were always meant to be . . . but never really have been.

No changes are required to the web's servers other than updating Apache, nginx, and LiteSpeed to the latest versions that support OCSP stapling and enabling it. But it will take some time to get this certificate extension added to the Internet standards. In the meantime, we have the second solution . . .

2: Create a new HTTP response header similar to HSTS:

An alternative and immediately available solution has been proposed and is being worked on by members of the Mozilla Firefox team. Any web server that wants to protect its users from certificate fraud, and thus offers OCSP stapling as the first step, will soon be able to add a “Must-Staple:” response header to their server replies. The header includes a “max-age” specification which is usually set to a large value. Once this header is received – over a stapled connection to prevent abuse – any aware web browsers will note and retain that information for many months or years. This prevents bad guys from excluding that header and not using OCSP stapling. All web servers support the addition of “static response headers” through simple configuration options, so adding this assertion to the server takes just minutes.

As you will have already guessed, if the web browser has flagged a site as Must-Staple, because it once saw that header and the age hasn't yet expired, it will hard fail if (a) it cannot obtain OCSP from stapling or (b) as a backup it also cannot obtain it from the OCSP provider designated in the certificate. And once again we have high-reliability robust enforcement of certificate revocation.

The “first visit” problem: As with the HSTS (HTTP Strict Transport Security) header, this solution does suffer from the “first visit” problem: Without having at least once previously visited the authentic site to receive and retain the Must-Staple assertion, a web browser would not know to always insist upon OCSP for that site. This creates an opening for an attacker who can interfere with a web browser's first-ever visit to a website. While this is indeed possible, it's an instance where we should not allow the quest for a perfect solution to prevent us from using something that's very good in the meantime. Binding the Must-Staple assertion into a site's security certificate is the ultimate solution. But until then, we'll be able to have nearly-as-good enforcement almost immediately. And from a practical standpoint, the simple response header solution fully protects the websites we visit routinely where we are most likely to have a relationship requiring strong security, rather than sites we've never visited before.

Here is the late October, 2013, IETF mailing list archive posting by Brian Smith of the must-staple proposal: http://www.ietf.org/mail-archive/web/tls/current/msg10351.html

Do we ‑THEN‑ have a perfect system?

If by “perfect” we mean that the instant a certificate is revoked it can no longer be used anywhere, then no, we do not have a perfect solution by that definition. However, OCSP Must Staple gets us as close as we can get:

The best imperfect system possible

The only way to achieve that “instant global revocation” level of perfection, would be for the security of every TLS connection being made everywhere on the Internet to be individually verified, in real time, by the issuing certificate authority. Doing this securely would require not only absolutely reliable real time access to the issuing certificate authority, but also unique queries containing random nonces, each individually cryptographically signed to prevent both spoofing and reuse. There is no known practical way to achieve those goals today. None.

If it's not possible to do it perfectly, how then ‑exactly‑ does the OCSP Must Staple system perform when faced with a supremely powerful and capable attacker?

Here is the worst-case attack scenario for a site protected by the OCSP Must-Staple system:

The private key certificate of a site using OCSP Must-Staple somehow becomes compromised.
The attacker sets up a clone server using the stolen certificate and runs it for as long as they can. Since the attacker must also provide must-staple revocation checking (because that assertion is either bound into the stolen certificate or was asserted by the original server's response headers), the attacker also continually obtains fresh OCSP status from the certificate authority.
At some point (hopefully sooner rather than later) the victim site realizes that its key has been compromised. So it re-keys and revokes the compromised certificate.
The revocation of the stolen certificate prevents the attacker from obtaining any further fresh OCSP assertions from the revoked certificate's CA. So the attacker is stuck with reusing the last and most recent valid OCSP response obtained . . . until it finally expires.
When that response expires (typically daily, but possibly every few hours) the attacker is no longer able to impersonate the server because no valid OCSP response will then exist anywhere in the universe. Yet every client of the website will demand one, either because the certificate includes the requirement for an OCSP response, or the user's browser will have visited that site in the past and will remember that the site is “OCSP Must-Staple.”

So . . . the system works.

No, it's not instantaneous, but we've already established that “instantaneous” is not possible, and this system cannot be bypassed. That's huge. It is enforceable, and it provides the best protection our current technology can provide. To minimize the post-revocation vulnerability window, sites requiring heightened security may negotiate to obtain shorter OCSP response lifetimes from the issuing authority.

When you consider that traditional certificate revocation lists, when they worked at all, were published weekly, this represents a considerable improvement.

It's where we need to go.

In conclusion

Problems with certificate revocation have increasingly plagued the Internet's public key infrastructure (PKI) as that infrastructure has grown in size through the past decades. The original certificate revocation list (CRL) design didn't scale well, and the online certificate status protocol (OCSP) which attempted to replace it was inherently fraught with privacy problems and reliability issues making it both undesirable and impossible to depend upon.

Web server engineers have generally stayed on the leading edge of improvements. Although when software has not enabled those features by default, they have often remained disabled.

Web browsers have faced the daunting task of “just working” for users who are focused upon the page's content and have no interest whatsoever in what's going on behind the scenes . . . and never want to. But at the same time, those users demand that their web browsers detect and protect them from any possible malicious activity.

With the advent of “Must-Staple” OCSP stapling, we finally appear to be nearing the end of a decades long struggle to find a solution to the certificate revocation problem which is both reliable and secure.

All major web servers can offer OCSP stapling today—Microsoft's have for years, by default.

The Mozilla team appears to be leading the effort to introduce near-term support for response header based must-staple, and the IETF is on the way to moving this into security certificates where it ultimately belongs.

All of the required technology is in place. Technically astute end users should ask their browser and operating system vendors to give priority attention to adding support for OCSP must-staple.

The Internet's engineers keep saying there's no demand for it. We believe that's because everyone assumes the current revocation systems work. Now we know they don't.

Only by asking for change will any change happen.

There is still a bit more coming: The surprising
truth about how the various operating systems
and web browsers operate today. Stay tuned!

Source references for additional reading, research and background

How it's being fixed

Mozilla's “Plan for Improving Revocation Checking in Firefox”
This is a terrific outline and history of the thoughtful work the Mozilla project has been doing as they've been struggling to find a compromise between revocation security, convenience and reliability. Scroll all the way down to the bottom and see the entry under “OCSP Must-Staple”.
Proposal for Better Revocation Model of SSL Certificates (9-page PDF)
This nine page document was created toward the end of 2013. It is described on the previous page as “Results of a research project commissioned by Mozilla to investigate the state of SSL Certificate Revocation.” It provides a very good overview of the situation, with recent numbers that clearly show what's going on.
This is the Mozilla team's SecurityEngineering/Roadmap
Along the roadmap, you'll find that OCSP Must-Staple is underway and in the “HTTPS can be used as default” section. When we first encountered this page, we were relieved to see that it was not filed under “Ideas Not Yet Awesome Enough.” Clearly, it is!
The Mozilla Security Blog posting announcing “OCSP Stapling in Firefox”
July 19th, 2013 - This Mozilla Security Blog post takes its reader through the utility and workings of OCSP Stapling.
“Fixing Revocation for Web Browsers on the Internet” (8-page PDF)
This eight page report by Tom Ritter of iSEC Partners nicely lays out the terrain of the revocation problem and reaches prospective conclusions similar to those discussed here. It's a worthwhile additional ‘take’ on the problem.
Phillip Hallam-Baker's OCSP Stapling Required IETF draft
Phillip is one of the prime movers behind the work to add an extension to certificates to assert that just as the connection must be secure, it must also be validly OCSP stapled. This is the ultimate goal for the must-staple efforts. Once we have this, with web servers stapling and web browsers insisting, we finally have robust certificate revocation.
Brian Smith's IETF mailing list posting about his work on a Must-Staple: HTTP header
Until all CAs are supporting the new Must-Staple certificate extension, a long-term cached assertion, provided by websites, is a very useful and immediately applicable interim solution.

How and why it's broken

Apple's official response about OCSP policy on iOS
This is believed to be an official response from Apple Engineering about the certificate revocation mechanisms on iOS and iPhone. It begins with: “Currently there is no way to configure the OCSP policy on iOS. This would make a good enhancement request IMO.” (Yeah, no kidding.)
Netcraft: How certificate revocation (doesn't) work in practice.
This Netcraft page, posted on May 13th, 2013, walks us through a specific case history where revocation failed for the www.mcafeestore.com eCommerce site.
Netcraft: Microsoft's Achievement of World OCSP Stapling domination
This is the Netcraft page from which several of the OCSP pie charts were sourced. The page contains additional details and information of interest, including a chart breaking down browser support for OCSP stapling.