For those of you that are regular readers of this blog you will know that outages of the site are very, very rare with the past year showing almost 99.9% up time. Well if you were around here this AM you will know that the record for up time was killed, with the sites being down almost 10 hours today. In this post I'll talk a little about my experience with this as I know a number of other people across the world have experienced similar issues this morning based on my conversations with a few of them.
First of all I want to start out by giving a bit of information on exactly how the problem started to manifest itself as my server was actually impacted by two separate issues, both of which appear to be tracked back to the Windows Updates.
I followed a link this morning after seeing a Pingdom alert that my site had been down, keyword "had", so I didn't think much of it. When I pulled up the site I got the well known "DotNetNuke Under Construction" page. That is when a bit of panic started to set in. Yes, I have good backups of everything, I test them once a month, but what was going on. I start pulling other sites on the box, all of them are having the same issue, even my non-DotNetNuke sites.
Problem 1 - SQL Server
As I start to dig into this, I know that the most common issue that will cause the Under Construction page is if the database server isn't available. So after remoting into the server I launch Sql Server Management Studio and try to connect. Failure! My nothing was responding, I couldn't connect or anything. I open the "Services" window and SQL Server isn't started, it is set to Automatic but it is not running. I start the service and then SSMS started to work just fine.
Hopeful that this was my only issue I tried the sites again, still no luck, even with applications that I hadn't tried yet this AM.
Problem 2 - IIS
For whatever reason the my websites had given up trying to communicate to the database. I decided to try an iisreset to get things going. Amazingly after doing this and with SQL Server running the sites are finally back up, but there were a few lessons learned about from this that I thought would be worth sharing.
Overall from a website perspective up time and performance are always key items of consideration. If you site isn't available or isn't performing well you will loose traffic and typically you will loose it quickly. As part of today's activities I've implemented a few additional changes to my processes that I will share here. I had monitoring in place, but that monitoring failed and told me my sites were back up.
Lesson 1 - Use Smart Up Time Monitoring
As I mentioned before, I had monitoring in place that actually pinged the sites on the server to monitor the uptime. However, they were what are known as a "simple" check. They request a URL and if the response is HTTP 200 "OK" then the site is shown as up. This caught the reboot of the server last night, however, when the site was showing the DotNetNuke Install Configuration page it thought the site was up as it also returned an HTTP 200.
As a preventative measure to ensure that I know about an outage where the desired content isn't being shown. I switched to using a check that also looks for a specific string, in my case the page title of that page. This will ensure that even if my site is responding, if it isn't showing my homepage I'm going to get an outage notification.
Lesson 2 - Monitor SQL Server
I don't have a valid solution for this one yet, as I believe my hosting provider should have a solution, but have not yet gotten something figured out. However SQL Server is a critical piece to this puzzle, if it is down I want to know about it as well. Yes, I'll know that it is down because all of the sites will be down, but being able to quickly notice "SQL's Down" I can get that problem fixed right away and hopefully restore service sooner that if I have to diagnose things step-by-step.
I hope that this information has been helpful. For the rest of you that were impacted by this today I hope you were able to get everything back up and running quickly. I know we have adjusted our auditing practices on servers and will prevent this type of issue in the future. The only other note that I have on this is that I manage 3 servers all with the same configuration, this issue only impacted 1 of the 3, so I'm not sure of the exact reasoning behind it all, but wanted to share what I know!