Monday, April 06, 2009

 

Oi - What a weekend!

So as I'd mentioned, this was the big weekend for work...really, the culmination of almost 16 months worth of effort. Going in to things, we certainly expected to have a busy / challenging night, but I had no idea just how busy it would be...

There are / were a few "background" challenges:
- We had a limited window of time to do the work...stores in Vancouver close 9pm their time, so midnight here. Stores in Newfoundland open at 11am, so 8:30 here. To make the changes, the whole website had to be turned off to the public, which again means the less downtime the better.

- We were making massive changes from a "data" standpoint...we literally moved and merged over 25 million rows of data repesenting > 4 million customers. It took more than 4 hours with some HARDCORE servers to actually just move the data.

- The sheer number of different systems and servers involved. We modified / updated roughly 10 totally independent applications, ranging from the core systems for creating accounts, logging on, buying items and managing your account, to the systems that customer support uses (to look up orders, handle refunds, detect fraud), to corporate orders, to the applications that send emails ( 4 totally seperate apps). To make things more interesting, because it's such a huge site, most of the applications are load balanced across many different machines...probably touched 20-25 machines total. ( the site can handle lots of concurrent customer requests quickly by balancing the requests out to the 8 different sets of servers)

But we were ready...we'd tested the snot out of this puppy and were ready to roll. Super detailed plan, and probably about 30 people total involved with the deployment( although most were working the 5am - ... shift).

A few highlights:
- By 7:30am things mostly looked good. There'd been 1 or 2 complaints about error pages, but couldn't track them down.

- At 8:30am we turned the site back on for the public....and started to get complaints about error pages, as well as 4 or 5 other major issues (about 10% of people couldn't view their order history, etc). I was one of ~3 tech oriented people trying to track down, isolate and fix the issue....except I had no idea what was causing the problem!

- By 10am traffic started to really pick up as people woke up and came online....started getting ~10 customers per minute seeing error screens, and roughly 5-10 customer complaints to support every 15 minutes, mostly about not being able to place orders. We figured out that one of the problems was a configuration issue on all 8 main servers, so fixed that problem.

- On the conference calls we were having with the VPs every 2 hours, the highlight was the "Looks like in the past hour our sales are down 25% vs last weekend at this time...", then the inevitable "When will this be fixed" question directed over my way ; )

- Finally figured out that the main problem was 1 of the 8 servers....still don't really know WHY it was having problems, but we disabled the server and the errors stopped more or less completely. We were able to test and push out a fix for 4 of the other issues (order history, etc).

- Finally wrapped up and headed home....at 3:45pm. Slept straight from 6pm -> 7:15 this morning. Expect today to be another drama filled day.

Labels:


Comments:
Have you checked your code to be sure didn't accidentally put in an open bracket when it should have been a close bracket? I hear that can really F-up a program!

;)

Go Greg Go!!!
 
Congrats dude!

The site looks good, but I'm still going to use the library. :)
 
I as well am a fan of the library. Working in TO gives me access to the entire TO library system!
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?