Oh ho, lookit what the EFF went and did!
The EFF SSL Observatory is a project to investigate the certificates used to secure all of the sites encrypted with HTTPS on the Web. We have downloaded a dataset of all of the publicly-visible SSL certificates, and will be making that data available to the research community in the near future.
This is exciting. I knocked together a less ambitious version of this last year, but the EFF guys are doing it like grown-ups, and are getting some interesting data.
Numbers-wise, they’re in the right ballpark, as far as I can tell. Their numbers (1-2m CA-signed certs) coarsely match ones I’ve seen from private sources. I’ve heard from a few CAs that public-crawl estimates tend to err 50-80% low since they miss intranet dark matter, but at least the EFF is tracking other public-crawls. Given that their collection tools and data are going to be made public, that’s a really big deal. Previously, I haven’t been able to get this kind of data without paying for it or collecting it myself. If the database is actively maintained and updated, this will be a great resource for research.
Their analysis of CA certificate usage is also interesting. I’d like to see more work done, here, and in particular I’d like to see how CA usage breaks down between the Mozilla root store and others. We spend considerable effort managing our root store, and recently removed a whole pile of CA certificates that were idle. In some places, the paper seems to make the claim that fully half of trusted CAs are never used, but in other places, the number of active roots they count outnumbers our entire root program. I understand why they blurred the line for the initial analysis, but it would be swell to see it broken out.
As they mention, there are legit reasons for root certs to be idle, particularly for future-proofing. We have several elliptic curve roots, and some large-modulus RSA roots, which are waiting for technology to catch up before they become active issuers while giving CAs a panic switch in the case of an Interesting Mathematical Result — that feels okay to me. On the other hand, if there are certs which are just redundant, it would be great to know, so that we can have that conversation with the relevant CAs, and understand the need to keep the cert active.
This is exactly what I hoped would come of my crawler last year, but they’ve done a much more thorough job. We’ve seen an uptick in research interest in SSL over the last few years. Having a high quality data source to poke when testing a hunch is going to make it easier to spot trends, positive or otherwise. Interesting work, folks; keep it going!