So to give you some back story, when we re-launched the site with stats functionality in April, we moved our entire platform to Heroku which runs on AWS. AWS is generally a rock solid Infrastructure as a service and Heroku is a rock solid Platform as a service so we have been extremely confident and proud of our site's uptime and reliability since April. We even boasted in our public support forums that we haven't been down for more than 60 consecutive seconds since moving to Heroku.
Then Friday morning happened.
Everything that could go wrong, did. And then some. And this happened just 4 hours before Windmill Windup was supposed to start.
At midnight Eastern time (6am in Amsterdam), my phone started going nuts. I was sleeping, so I missed the first few voice mails and texts, but at about 12:45 I woke up to my phone buzzing and it was guys from the Windmill Windup in Amsterdam calling me in panic saying Leaguevine was down. I bolted up and went through all the regular fixes - checked the logs, restarted the servers, re-deployed. The site went online for about a minute but it was gone again. Completely inaccessible. Crap. Definitely no sleep for me tonight.
Chris Schaffner, a rockstar developer from Amsterdam quickly pointed out to me that Heroku as well as AWS were completely down. This is very rare for them as they boast 99.95% uptime, but apparently they had some huge power failure that took down a bunch of high profile sites as well.
There was no telling when Heroku would be back up and Windmill was relying on Leaguevine for score reporting and swiss scheduling. The first efforts were just to get the site on Heroku back up. This ended up being impossible, as command line interface had been completely disabled by Heroku.
So the first real effort was to move all our code, assets, and database onto a completely different server. No problem, right? I'll just use my existing Webfaction account and put everything up on there. I set everything up, cloned my git repo, pointed the DNS records to this new site, and configured my settings to work on their servers.
Now we just need to get the database up and running. First we need to push it onto the servers. But wait, why is our database backup so big? It's 9x bigger than it was at noon on Thursday. What the heck? Oh well, let's just push it and load it and deal with fixing whatever happened later. Our database is pretty big because of stats so this took a good half hour, but it succeeded and we were ready to load it and get up and running.
But it didn't load because the version of Postgres on Webfaction was different from that on Heroku. Turns out there are special ways to do a Postgres DB dump if you want to make it backwards compatible, so we quickly did this and then re-uploaded this to Webfaction. If only it were that simple. Doing the db dump in that manner caused the size of the sql file to be enormous and the upload was going to take over 2 full hours. I started the upload anyway, knowing it probably wouldn't finish in time for the swiss captain's meeting, but started fiddling with deploying our site onto some other services. Turns out doing a full deployment from scratch from a huge backup in an hour or two is hard. And I failed.
Then there was a glimmer of hope. Heroku's site said it was back up! I quickly pointed my DNS records back to Heroku, and started trying to do all the regular stuff on Leaguevine. I could access all my CLI tools and read logs and stuff. But the site was still completely down. Apparently our database was one of the unlucky few that weren't brought back online right away, and with just 30 minutes until game time, it looked unlikely that we'd be back up and running in time.
The Windmill Windup guys stepped up big time at this point and decided to run the first few rounds using excel spreadsheets with matchups calculated by hand. These guys are amazing, and were super supportive during this outage when they should have been outraged.
Eventually I am able to get through to the Heroku support, and they work one on one with me to bring our database back online. This takes about 30 minutes. We created a follower, synced it to the master database, and then promoted the follower to be the master database.
Phew, back online. People were submitting scores, and we had an instant rush of users who were probably trying to get online all morning. But we were getting complaints that people weren't receiving their registration emails. And worse yet, the site was so slow that the Windup guys couldn't create new swiss rounds without timing out. Oh, and no stats were being calculated.
Roger tackled the email problem, and I worked on the slowness. We decided we'd get to the stats after the other two were fixed.
The email problem appeared to be completely unrelated to the Heroku outage, as we had been using the Webfaction email servers to send email. It clearly just wasn't working for some reason. Most emails were not being sent (but some were... weird.). And by noon Eastern, 48 people had already tried creating accounts without receiving the necessary activation email. We gave up trying to fix this after 20 minutes, and decided to just move all our email over to SendGrid. Roger managed to fix this pretty quickly and got our registration system back online. We then manually activated those 48 accounts and sent out apology emails to those folks that experienced these difficulties.
The slowness, however, wasn't so easy to fix. We revamped some code to speed lots of things up, and spent the entire day heads down in our computers doing whatever we could. We had to manually add and fix the swiss rounds for the Windmill tournament because the web interface was often timing out for them. Everything was just so slow. By the end of Friday we still hadn't fixed the slowness problem. It was just barely fast enough to be usable, but with Wisconsin Swiss coming up on Saturday, this would be unacceptable because they only had 15 minutes between rounds instead of the 45+ minutes between rounds that Windmill Windup had.
So we stayed up another night coding and working on this issue. We did absolutely everything in our power to make database writes as fast as possible. The frustrating this was that on our staging server, the code was working beautifully and plenty fast, but our production server was slow even when traffic was low.
One thing that improved performance noticeably was solving the issue with our enormous database size. Turns out on Thursday a user had been goofing around with our API interface testing things out and deleting the stuff he created, and this triggered one of our background processes to be executed on an endless loop. This process did some data book keeping and ended up making a couple million extra records. Worse, it was created a backlog of 70,000 worker processes in our task queue, making it so the background workers that process stats never made it to the front. This turned out to be an easy fix, and helped with performance. But only slightly. At least it fixed stats.
This whole time we had to be at our computers communicating to the TDs of both tournaments because they needed our system to create the next swiss round. And game play lasted from 3am-5pm Eastern time on Saturday. So I finally got to sleep at 4 or 5 on Saturday.
And then I woke up on Sunday and everything magically worked. Our site was back to it's normal speed, and apparently Heroku was fine.
In hindsight, we should have been more prepared. The AWS outage might have been beyond our control but there were other things we could have done. First, we should have noticed that our database was growing unreasonably fast before we went to sleep on Thursday. This would have meant our re-deploy would have been faster and we might have been able to be up before games started on Friday. To fix this, we're building better real-time monitoring and notification tools into our app.
Second, our site shouldn't have been so slow in the first place. When Heroku was struggling, page loads were taking about 5x longer than usual. If page loads and writing data normally took 1-2s, then this wouldn't be an issue. But because it was already slow and creating a round takes a good 5-8 seconds, we were hitting this Heroku 30s timeout limit and these rounds were not being created successfully every time. We're actively improving the speed of every aspect of our site, and it's more apparent than ever how important speed is.
Anyway, we are sincerely sorry for all of the stress we caused you over this weekend with your swiss tournaments. Thankfully the TDs this past weekend were incredibly competent and resilient and made it seem like there weren't any issues with scheduling rounds. For all you other players out there, thanks for bearing with the slow speeds and sticking with Leaguevine Mobile to enter your scores.
We truly do have the best community of users that we could ever ask for. We love all you Ultimate players, and are crossing our fingers that when we venture into other sports we end up with users who are as passionate, kind, and understanding as you all have been.