The perils of no pre-production testing

FASTER! FASTER! FASTER!

This is war cry we all know too well these days with the advent of “Web 2.0”. How quickly a service can be brought to market can either make or break a startup. The benefits of first-mover advantage can’t be discounted especially when buzz and eyeballs are so important these days to build a brand and grow a userbase.

However, with the desire to move to shorter development cycles and get to the “production” environment more quickly, companies are finding ways to take shortcuts as compared to traditional methodologies. Most services at Microsoft have at least 2 staging environments prior to deploying to the “live site”.

image

As our development team reaches code complete, we deploy our bits to the test cluster where our QA team goes through a test pass. After meeting our test pass exit criteria, we roll the bits into an intermediary staging environment, “Pre-production”. This staging environment matches our production hardware as close as possible and we use it to do final acceptance testing. This model, while seemingly heavy-weight, helps ensure the highest quality release when we finally deploy to our live site (aka “Production”). In particular, it’s important when your product integrates with several services that you don’t own — in our case, an example would be Windows Marketplace’s integration with Windows Live ID.

I myself have often been frustrated at times when I want to go to production faster with a feature that is seemingly “small”, but I have to remind myself that the rigor in staging our releases is worth it if we’re ensuring a higher quality product at the expense of slower time to market.

However, I do think there many teams (my current one included) can work on a model that is better at identifying the lower risk features and perhaps roll that directly to live site in a throttled manner. For example, roll out a feature so 5% of our users see if, and if it breaks, roll it back. Otherwise, increase the rollout slowly over a period of time until 100% of the users are using the new feature. The entire time make sure we monitor the snot out of things.

Facebook has been given accolades for how quick they get to market with their new features. When I first started using it last year, I was impressed with how much functionality they rolled out in a given week. From Facebook’s job site:

Our development cycle is extremely fast, and we’ve built tools to keep it that way. It’s common to write code and have it running on the live site a few days later. This comes as a pleasant surprise to engineers who have worked at other companies where code takes months or years to see the light of day. If you work for us, you will be able to make an immediate impact.

In speaking with several people that know engineers at Facebook, they apparently don’t have the same QA process as Microsoft and instead often go straight production and use throttling to control a feature’s exposure. Sounds great!

However, based on my experiences over the past 6 weeks with Facebook, the pitfalls of going “Faster! Faster! Faster!” are showing in a string of very visible problems. Some juicy examples:

No profile anyone?

I’m signed in, but am invisible and profile-less.

image

No news feed items?

My biggest beef is there are too many news items that I can’t even begin to sift through them. But this is ridiculous 🙂

image

Site maintenance in the middle of the day anyone?

Seems like an odd time to have a planned outage, given Facebook is in the same timezone as I am (PST) and it’s smack dab in the middle of the day (11:35am).

image

30 mins later, same maintenance message, but surprise! Looks like they’ve authenticated me and can tell me how many messages I have in my inbox. Something is astray.

image

Awkward error messages

I got this juicy one earlier today when I tried to confirm a new friend request. Some debugging message that it getting piped through to the FE users?

image

I’m definitely not saying that Facebook’s process is bad and Microsoft’s is better. Just that there has to be a happy medium in doing the appropriately amount of testing and monitoring to ensure we’re striking the right balance between quality and time-to-market. In our quest to ship more features faster, we shouldn’t lose sight that we can’t screw our end users, otherwise there is no reason to ship our products at all.