Recent events have really thrown light onto something I’ve been feeling for a while now: we need better public information about the state of the secure internet.Â We need to be able to answer questions like:
- What proportion of CA-signed certs are using MD5 signatures?
- What key lengths are being used, with which algorithms?
- Who is issuing which kinds of certificates?
So I decided to go get some of that information, so that I could give it to all of you wonderful people.
What I Did
I put together some python to crawl a list of sites and record the details of their response to an SSL handshake into an sqlite3 database. I don’t know python, and my SQL is extremely rusty, but it works.Â The code is here, take it and make it better!Â Pay particular attention to the TODO list!
As for a list, I used Alexa’s list of the top 1,000,000 sites, which they quite helpfully make available for free download. Of course we can have all kinds of fun debating whether this is the right list to use, it’s obviously going to have some skews. Anyone with a similar list can either hook me up and I’ll give it a whirl, or download the code and run it themselves.Â This list did get me 382,860 certificates from the public internet though, so that’s pretty okay.
What I Can Do With It
For each host in the list, I currently record:
- Whether the connection succeeded or not, and if not, why
- The verification result, using the Mozilla list of CAs
- Various and sundry connection/certificate details (subject, issuer, cipher, keylength)
- The PEM-encoded end-entity cert, for post-analysis
What that means is that I can answer some reasonably relevant questions. For instance, during the recent excitement over MD5 weaknesses, we have been having conversations about retiring MD5 as a supported signature algorithm as, I’m certain, have the other browsers. Making that decision in an informed way requires us to understand how much of the internet still relies on MD5, though, and when those certificates will expire.
What Can You Do With It?
There’s almost 400,000 certificates, all told, in a big SQL-queryable database. Want to see a total breakdown of certs by issuer? By verification code? Want to see the distribution of key lengths?Â Or cipher suites?
Maybe you write software that processes certificates – want 400,000 real world examples to test against?Â As far as I know, this kind of data hasn’t been available before without paying for it, so I’m actually really interested to know what you can do with it.
How Much of the Internet is this?
Good question! Other estimates I’ve seen for the total population of servers responding to SSL hails out there is about 1-4M.Â Based on that, I’d say this is probably about 10-20% of the secure internet, but I wouldn’t try to use this data to make magnitude assessments; it’s better suited to proportions and comparative work, really.Â If my estimates are close, then this is a big enough chunk of the total population to produce pretty good data.Â Remember, it will exhibit the skews you’d expect from sampling the more popular sites, but that will also serve to weight it towards the certs people are more likely to see. If that’s not the bias you want, use a different list!
- The database (gzip’d, sqlite3 format, 367MB) for my Jan 15 crawl.
- The code I used to gather it.Â (It needs love, love it.)
- If you want to play with something a little less gargantuan, I’ve also put up a trimmed version with just the top 10,000 (gzip’d, 3MB).
If you find good stuff in here, I hope you’ll leave a comment letting me know.Â If you can’t access the database or download the file and want me to run a query against it, let me know.Â I only connect to each host once, and all I do is open an SSL connection, so the load on the servers is non-existent.Â The load on my machine while running it though, can be heavy.Â To keep the interruption to a minimum, I use conservative settings which make a crawl take about 40 hours, so until I have it running in parallel on a rack of excitingly fast machines, please understand that requests for re-crawls will take a while.