Have you ever had an SSL certificate expire before you provisioned a new one? Had a production outage caused by some glitch in your renewal process that caught you off guard? I’ve had my share of those, and decided to resolve these issues once and for all with a side-project app. In this post, I’ll share my motivation and early findings as I try to make SSL certificate expirations a thing of the past. I may even surprise some of you with the existence of Certificate Transparency, which affects you even if you aren’t using it.
I’ve held Engineering Manager positions for a few years now, and find myself drawn to the operational aspects. I enjoy seeing a piece of software run smoothly and generate value to its users, consistently and predictably. I am also big on security, where a lot of operational aspects come to light. It’s not enough to set up a proper environment that passes some security tests. You need to have processes in place to keep it as such, especially as time passes and changes are being introduced. SSL certificates provide various security benefits. But they come at an operational cost - they need to be kept up to date, or they will stop functioning. I will focus on one simple aspect which is part of every SSL certificate out there - its expiration date.
The expiration date of an SSL certificate is decided upon issuing. To remain available, a service that uses SSL certificates to prove its identity needs to repeat the issuing process at a known time in the future. Sounds simple, right? That’s exactly what I thought until I stumbled upon an unpleasant surprise. An SSL certificate issued for one of my company’s user-facing services had expired before we got the chance to renew it. All the rocket science that went into inventing these protocols and standards couldn’t stand against a simple case of forgetfulness. Problem between keyboard and chair, as they say.
The world of operational automation, SSL certificate management included, has seen a lot of progress in the past years. Driven by awareness to security, embracing more ‘dev’ in ‘ops’, major cloud providers and actors such as Let’s Encrypt are taking the manual ‘sting’ out of SSL certificate provisioning. You don’t even need to copy-paste a newly issued certificate and restart your server. While a bliss to work with, I argue that the automation of provisioning only emphasizes the need to monitor its correctness. The more you expect of your service provider, the more you need to ‘trust but verify’.
VIA Venafi: 8 Steps to Stopping Certificate-Related Outages
There must be a solution to this, right?
I was adamant to make sure the production outage induced by my forgetful mind doesn’t repeat itself. There are plenty of monitoring solutions that alert towards the upcoming expiration of an SSL certificate. We already had basic uptime monitoring in place, which could also check on the SSL certificate every time it checked on some app. But we were spinning up new microservices and issuing new SSL certificates constantly. I didn’t want to manually update my monitoring service with the addition of every new microservice we brought up. I also didn’t want to conflate uptime monitoring with SSL certificate monitoring. Uptime monitoring is very sensitive—I want to be alerted the moment my app isn’t available. Every check is as important as the previous one. Business-wise, these services usually price themselves per check frequency and number of checks. SSL certificate monitoring, on the other hand, is not sensitive at all. Every check basically re-confirms the already known remaining time until the certificate needs to be renewed. If a check fails due to some communication error, I don’t want to be alerted in the context of the underlying SSL certificate. Instead, I basically want to set a timer for every SSL certificate I have available and have them ‘ring’ on time. Early enough for me to renew the certificate without a hassle. Late enough so that I don’t spend too much time renewing certificates. Most importantly, I want this to be a ‘fire and forget’ service, such that I don’t have to constantly check and wonder whether a live certificate is not monitored yet for some reason.
This is when I found out about Certificate Transparency. Shortly put, there is a publicly available audit trail left by every SSL certificate being issued. This is an amazing source of information, even though it scared me quite a bit at first. We weren’t keeping any secrets in the SSL certificates we were issuing. But realizing that all the internal-but-publicly-accessible hostnames used by the team are known to anyone on the Internet was new to me. It is as if anyone could traverse our domain’s DNS entries, learning about all the hostnames we have A and CNAME records for.
Certificate Transparency can be very valuable for the monitoring use-case I was facing. It would allow us to automatically detect all the SSL certificates we were provisioning in our domain and feed our monitoring service with new hosts to be monitored. But something was still off about how the uptime monitoring service operated. Not only were we billed by the number of checked URLs, but the monitoring service would also alert us if a newly added hostname would not respond properly to an HTTP request. This is what made it so great for application uptime monitoring. At the same time, it was diverging from how we saw our need for SSL certificate monitoring. I did not want some automated process to increase my spend with the service, just because 100+ new SSL certificates were issued. I also did not want the service to alert me about some host being down, while the SSL certificate was issued in advance, and the DNS entry for the new host wasn’t yet set up. All I wanted was to never forget an upcoming expiration of a live SSL certificate.
Building my own solution
These learnings led me to create a service dedicated to this exact problem. haveibeenexpired.com detects and monitors hosts for their SSL certificate expiration. It uses Certificate Transparency to keep itself up to date with new hosts. It notifies users with a variety of team collaboration tools (Slack, Discord and more), and doesn't rely on email. Production outages induced by expiring SSL certificates can finally be eliminated for good! Oh, and the name also pays tribute to Troy Hunt’s amazing haveibeenpwned service.
I launched this service as a side-project on April 1st, 2021. At first, the app only allowed testing a single website for certificate expiration. With some paid traffic (my first AdWords campaign), I got some confidence that I am looking at problem others are experiencing as well. I saw strong user engagement with about 50% of visitors testing a website’s SSL certificate.
I then leaned in and developed an app that would monitor relevant websites for its users. Once signed in, a user provides one or more domains that are of interest to them. The service then periodically checks for public SSL certificates that were issued within this domain. It also extracts additional context from every certificate, monitoring more and more hosts. Another periodic process performs SSL handshakes with every detected host. It checks whether the presented SSL certificate is about to expire anytime soon. The service sends out a notification in case the remaining time to expiration is below a configured threshold. This alerts the user to act and renew the certificate on time.
Remaining focused solely on the expiration aspect is hard, but very rewarding. There are many potential needs around SSL certificate validation, such as matching the host where the certificate is served to the Subject Name or one of the Subject Alternative Names of the certificate. There are also other services providing this exact value as a feature in a much wider monitoring suite. Staying simple and focused is my way of standing out in the crowd. Amazing products I’ve used in the past got diluted as features and capabilities creeped in and became more mediocre as a result. I don’t have the capacity to change the way the SSL certificate ecosystem is operating. I just want to make one tiny thing better - reduce and eventually eliminate all outages occurring due to untimely expiration of SSL certificates.
Right now, my focus is growing the app in terms of users and fine-tuning the product offering to really solve this problem at scale. I’m just getting started with 63 registered users so far. The service is monitoring ~500 domains and ~5,000 hosts, sending out tens of notifications on average per day. I’m excited to see how far this can go!
Please feel free to sign up at haveibeenexpired.com, or follow me on Twitter for more updates.
Take Control of Your Machine Identities With Automation and ELIMINATE Outages!
Related posts