DIY Hack: Optimizing AWS Scalability
After focusing strictly on B2B SaaS software for the past few years, we recently catapulted ourselves into the B2B2C space with a new feature called Optitrack. Optitrack can record hundreds of thousands of individual consumer actions on our clients’ websites and mobile apps, simultaneously and in realtime. This cool feature gives marketers the ability to execute realtime, trigger-based customer marketing campaigns based on specific customer behaviors and actions.
Unlike our core retention automation platform, which processes very large data sets once per day, Optitrack is a realtime system that needs to handle thousands of concurrent connections with our clients’ customers – millions of events per day.
Moving into the Cloud
Like any good SaaS business, we have a thorough understanding of how our compute infrastructure costs will scale with customer growth. As a B2B shop, we serve hundreds of customers a day and run intensive database batch operations on massive amounts of data. Most of this computing infrastructure is running on dedicated servers hosted by a third-party data center.
However, the intensive server-side resources required by Optitrack can spike very quickly during certain times. For example, usage jumps in the evening when our clients’ customers return home from work and start playing their favorite games, or during high retail shopping seasons.
We knew that we needed to build a system that could react to these spikes in demand by scaling rapidly, yet inexpensively. In other words, it was clear to us that we needed to deploy Optitrack in the cloud. After surveying the market, we decided to develop Optitrack using PHP and MySQL running on Amazon Web Services (AWS) EC2 and RDS.
Once we decided to move to AWS EC2 and RDS, we quickly realized that this was going to get very expensive, very quickly, using Amazon’s auto-scaling capabilities.
Unlike what many people think, cloud-based infrastructure is not always cheaper than leased dedicated servers in a data center. The reality is that when moving to the cloud, IT usually wants to improve their infrastructure by taking advantage of the effortless, new capabilities, such as high availability across multiple geographic/availability zones and rapid elasticity/scalability to handle peak times better. But, these on-demand cloud resources don’t come cheap!
As anyone using AWS knows, there are three different types of compute servers available:
- On-Demand Instances – pay by the hour (the most expensive type)
- Reserved Instances – require medium-to-long-term prepaid leases (which are 30-40% cheaper than On-Demand Instances)
- Spot Instances – auction-based pricing (the least expensive option, typically 60-90% cheaper than the price of On-Demand Instances, but with the downsides of unpredictable availability and sudden terminations when the current price exceeds our bid)
Comparing the price level of the three AWS server options
While we needed the benefits of the cloud – and were willing to pay the premium to deliver the service levels that our clients need – we wanted to see if we could keep the expense levels to a minimum. Logically, this would lead us to using Spot Instances.
The challenge I wanted to address is how to get the low-cost benefit of the Spot Instances, while providing the stability and reliability required for us to deliver no-compromise service to our clients.
I had an idea, which took a couple of months to explore, test and refine. My idea was to use a mix of Spot and On-Demand Instances, along with a way to balance between the two in terms of cost and reliability. In other words, I wanted to blend the cost-efficiency of Spot Instances with the stability and predictability of On-Demand Instances. (I did not consider using Reserved Instances as an option due to the long-term commitments they require.)
The theory is simple: As long as the current price of Spot Instances is lower than that of On-Demand Instances, we will use mostly Spot machines. However, we will always keep at least one On-Demand Instance running, to ensure that there will be no down-time in the event that all the Spot Instances suddenly terminate (which can theoretically occur if all current Spot pricing exceeds our bids). So, in the event that the Spot Instance prices spike (across all availability zones) and those servers are suddenly taken offline, the On-Demand group will immediately scale up to provide the required capacity.
The typical way one would accomplish something like this (and more) would be to develop some code that continuously monitors our software’s load requirements while evaluating the Spot Instance pricing and reacting, in realtime, by balancing the available Spot and On-Demand instances (using Amazon’s EC2 API). However, developing, testing and maintaining this kind of code is, itself, expensive and time-consuming (and a distraction away from developing our primary applications), so I came up with a “hack” type of solution to accomplish my purposes without writing a single line of code.
How the Hack Works
The solution I came up with leverages the many available configuration settings in the AWS Management Console to ensure that we are always delivering the resources and performance that our clients expect, while taking advantage of the best-available Spot Instance prices at every point in time.
Before I delve into the details, there are three important Amazon EC2 concepts one needs to understand:
- Auto-scaling Groups – These are policy-based configurations that allow us to maintain application availability, along with the ability to automatically scale resources up or down, according to conditions that we define in advance. (Learn more.)
- Elastic Load Balancing – This Amazon service automatically distributes application traffic across multiple Amazon EC2 instances (of all three types), providing us with the capacity and fault tolerance our applications require. (Learn more.)
- CloudWatch – Amazon CloudWatch is a monitoring service for AWS cloud resources and applications. (Learn more.) For this particular hack, we use <href=”#CloudWatchAlarms”>CloudWatch alarms to trigger auto-scaling policies.
I use the Management Console to define two auto-scaling groups, one for Spot Instances – which provides price optimization – and one for On-Demand Instances – which provides predictable reliability and performance. I place this mix of Spot and On-Demand Instances behind a single Elastic Load Balancer.
The trick is knowing how to balance between the two groups to ensure that we have the necessary reliability and performance, at the lowest price. I came up with the following rules to accomplish this goal:
- I set the maximum bid for the Spot Instances to be equal to the On-Demand Instance price – so that we will never pay more for a Spot Instance than for a more reliable On-Demand instance.
- I set the threshold for scaling up the Spot Instance group below the threshold for scaling up the On-Demand Instance group – so that the system will always first add capacity using the lower-cost Spot Instances. For example, I set the Spot Instance group to automatically scale up when CPU utilization exceeds 65%, whereas the On-Demand Instance group will only scale up when CPU utilization exceeds 75%.
- I set the threshold for terminating instances in the opposite manner, such that On-Demand Instances will be terminated before the cheaper Spot Instances.
- I create an auto-scaling policy for the On-Demand Instance group based on the overall response latency reported by the Elastic Load Balancer. This ensures that our first priority is achieved, namely that the quality of service always remains above a pre-determined threshold.
With this approach, the On-Demand group serves as a fail-proof backbone – there will always be at least one On-Demand instance operating. This guarantees that, no matter what happens to the Spot Instance pricing, our service will always be operational.
The above rules are implemented using a combination of auto-scaling settings and CloudWatch alarms – without requiring a single line of code.
AWS provides the dynamic scalability and flexible configuration that we needed to deploy our realtime Optitrack system in the cloud, at a reasonable price point, and without compromising on performance or reliability. The trick was using the options available in the AWS Management Console to optimally configure the system for the ideal mix of performance, reliability and price.