Shakespeare might not approve of my taking liberties with his play Hamlet, though prince Hamlet was essentially saying the same thing as I was feeling last year:
To be, or not to be, that is the question—
Whether ’tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles…
Would Hamlet go on or cease from all? Yes, I may have felt just as Hamlet in the Nunnery Scene when I thought about my “sea of troubles” just one year ago. Well, maybe I’m waxing a bit too dramatic but there were real concerns on my part regarding last year’s events (oh how “smart a lash” that memory doth make!). What was that worrisome memory? Allow me now to retrace my steps to that challenging day and give you context to my soliloquy.
This story begins last fall and there are many actors on the stage of events. Yes, my story begins with our mobile app and our external API and reaches its climax when seemingly all users download their family trees all at once! Oh the misery. Let us begin our tale of woe.
Ancestry.com (the bastion of family history) has an external API that is used by our mobile apps and other strategic initiatives to share and update a person’s family tree, events, stories and photos. Our external API has been most important in our mobile efforts (11 million downloads) and in working with the popular TV series, “Who Do You Think You Are?” Our mobile team had successfully grown our mobile usage to such an extent that I began to worry it might actually tax our systems. That concern was beginning to bubble up from my subconscious the fall of last year and this leads us to that disquieting day. Last year, our iOS mobile app was promoted in the Apple App Store because of the updates we had made for the release (along with other Ancestry promotions). Those promotions led to large numbers of simultaneous family-tree downloads and it weakened the mighty knees of our backend services. We endured a week of utmost traffic and were consigned to throttle (limit usage of) our mobile users (the API team saved the day by throttling usage and thus preserving the backend services). After experiencing that calamitous week, we might well have cried, “Get thee to a nunnery!” or “O, woe is me” but we repressed such impulses.
OK, it wasn’t actually a “calamitous week”, I was just getting into my Hamlet character. Given that impact to our website was quite minimal, most of our users had a good experience. However, it was a bit frustrating for many of our mobile users – it took too long for many to successfully download their family tree to their mobile device. This really is great news that our mobile traffic has been growing. We realized that we must architect a plan to take us through the next round of application and user growth. Here’s how it happened:
That experience caused us to reconsider how we deploy our mobile apps, how our mobile apps interact with the API, how we call our backend services, how we deliver a tree download and if we should continue to aggregate our services at the API layer. Each of these areas of the company went under a review to see how we might optimize our systems. After holding periodic meetings, discussions and code reviews over several months a plan began to gel. Below is a list of some changes made to our systems and application:
- Pass Through: Rather than aggregate our services at the API layer, we took the strategy of creating a “pass-through” layer back to our backend services. This put the responsibility directly on our services to further optimize their code, and in some cases, create new endpoints specifically with mobile usage in mind. This methodology also enabled our mobile teams to more effectively cache data according to their needs and Service team recommendations. More on that below.
- Mobile Usage: As our users became more mobile we have increased traffic through our APIs from mobile devices. Last fall our mobile usage at Ancestry.com reached critical mass and put serious pressure on our services (especially during big promotions and app updates). Because mobile usage differs from the website in important ways it was time to address this in our backend services. After several meetings involving cross-functional teams, a few service calls were designed with mobile usage in mind. One of the results was that downloading your entire family tree became much faster. Downloading a tree with 10 thousand persons (and all their associated events etc.) decreased in time from several minutes to under 1 minute.
- Cashing: Because we changed our API model to a pass-through, our mobile app could now cache data from each call at appropriate intervals thus taking pressure off of our backend services and network. This meant fewer calls (in the long run) to the external API.
- Mobile App Optimization: One area of review was our mobile application. After the code review we theorized that our app might have put undue pressure on our services. What was the root cause? Apple has two new, interesting features:
- Apple can automatically download and install new applications on your iPhone or iPad
- Apple can wake up apps in the background and do tasks
When we released our app last year, we believe it was automatically downloaded by Apple (onto most Apple devices) and then, in the background, automatically downloading that user’s main tree. To be sure, this process would have happened anyway once the end-user opened the app manually (that was required for that app update) but doing it manually would have helped spread the traffic over several days rather than all at once. Of course this is just theory but we wanted to ensure it was not happening and would not happen next time.
- User Queuing: As you know, queue is just another word for getting in a line. People get in line to buy a new iPhone or to buy tickets for a concert. That’s what we do when there are too many requests at a given moment. Anticipating high traffic from our new 6.0 mobile app (plus other site promotions at that time of year), we created a new way of throttling too-high traffic. Rather than throttling a percentage of calls to our API (making it hard for any-one user to successfully download their family tree to their mobile device) we created a system called User Queuing which allowed a certain # of users into our system at one time. By allowing X number of users into our systems for 10 minutes of uninterrupted usage ensured each would have a pristine experience. This would also protect our backend services from being overloaded as well. We could adjust on the fly how many users were allowed through our API at any one moment. Thus more individuals would have a better experience and the others would be invited to return in a few minutes. We would only turn-on User Queuing if too many users made requests at the same moment.
- Load Tests: To ensure our systems and new service calls could handle beyond-expected peak calls we ran them through a gauntlet of load tests. These series of tests ensured we had proper capacity.
Now, once our app was approved by Apple, we could have immediately released our app but there were things to consider. Here is how we timed the successful release:
- We received permission from Apple to release our app in the app store the day before the Apple promotion – thus helping us take some of the steam off of the release.
- We decided to release at a time of day when we anticipated traffic would be somewhat low.
- We decided to release when our engineers and database administrators were all available in case we needed to react quickly and also to monitor traffic.
Finally, the day arrived and we were ready. All hands on deck. User Queuing ready to trigger. There was great excitement and nerves. How would our systems hold out? Which internal system might buckle under pressure or show up with a previously undiscovered bug? How long after the launch would we need to kick in User Queuing and how many users would be temporarily turned away by the queue? Did we have enough servers, or memory or database throughput? On the other hand, we had tested our code so well, how could it fail? There was much excitement in the air.
All engineers were readied…and…the button was pushed to release our new mobile app!
Did it all collapse? Were there cascading failures? Was the load too much to bear? Did everything explode?
Nothing happened, OK, it seemed like nothing. The load gradually increased over the next few hours but our systems held up wonderfully. No strain, no collapse, no running low on memory, no bottlenecks. Nothing. Yes, there were a few minor bugs to fix but most customers had a great experience and it went very smoothly. The time, effort, and planning paid off. It worked!
We were so happy – and relieved. We had done our job. In the coming days several teams went to lunch to celebrate the successful execution of months of planning and work. Some of the engineers actually smiled on the day that nothing happened. Even Hamlet dropped by and asked me a question: “Didst thou not explode with a sea of troubles?” And I said, “not on your life!”