Scaling Microservices on AWS: Our Way to 5 Million Requests per Minute
Have you ever thought about the challenges in the world of cloud computing? As apps grow and more of us use them, a big question pops up: Can microservices handle more tasks without getting slow? Why should this matter? Because if they’re slow, we might get annoyed, and it could cost more.
But here’s another thing: We know about AWS and its many tools and services. Yet, how do we know which ones best fit our app? Do we truly understand our app’s setup, the number of people using it, and the points where it might lag?
In our discussion, we’ll look at our approaches to scale our AWS cloud solution to handle 5 million requests per minute. Our team summarized key insights from a practical R&D use case we investigated during one of our client’s IoT projects. We’ll discuss the significance of load testing and the stages required in its implementation. We’ll look at the microservice architecture, focusing on the individual AWS services that make it possible to build such a framework. We’ll also go over ways to optimize and scale the database effectively.
Why Should You Care?
The goal we have set for ourselves is not unusual in a world where technology evolves daily. Our client had a cloud solution that was constantly communicating with several connected consumer electronics devices and eventually needed to scale up to millions of devices. Scaling was needed in this scenario to manage additional work without failure or delays.
This is a significant issue since it reflects a prevalent reality. Large and small businesses must expand their systems to meet rising needs. The task at hand is not simply one of expansion, but also of maintaining efficiency and dependability in a more extensive network. It is a serious worry for anyone who works with cloud solutions and anticipates future expansion.
AWS is a guiding light for companies looking to grow their apps in the cloud. Its many tools and services provide great flexibility and growth options, making it a top pick for putting microservices to work. But this same flexibility can also complicate things as apps grow and change. In our research and use of increasing load, we spotted a few tricky problems. However, we found an approach that helped us face up to these issues, and now, let’s take a closer look at the entire path we traveled.
Testing Performance with CloudWatch
To understand where to start, we had to use load testing to determine the number of devices we could handle. This is important to understand the capacity of a cloud solution and, after each optimization, to identify the next weak spots. We realized that this process should be automated to save time, especially because we need to do it frequently. When the load needs to be heavy, so many servers need to be used.
We created a testing architecture with a commander that uses Ansible to automate our testing tasks and many bombers that generate a load similar to clients on our cloud. We can install benchmarking tools like Apache Bench or WRK on our bombers. The WRK is the most powerful, as it can generate more load than Apache Bench.
It’s important to send the load gradually when conducting load testing to avoid the risk of getting a denial of service from the cloud. We distributed requests to bombers from the commander’s system and slowly increased the load we sent to the cloud, allowing it to scale up to handle more requests.
It’s important to understand that intense tests can make this process expensive. They might cost a company thousands of dollars, if not more. So, getting as much valuable data as possible from these tests is essential. Also, we need to know which measurements are the most important for us and make sure we’re keeping track of them all.
For our needs on AWS, we employed CloudWatch, a comprehensive tool capable of aggregating all necessary data from our entire infrastructure, eliminating the need for additional solutions.
For these tests, we utilized an RDS database, specifically MySQL, hosted on AWS, tracking various server metrics, including CPU load, memory usage, queries, and more, as these details were key in pinpointing any system vulnerabilities later on.
Through the first load testing, we assessed the capacity of our cloud solution to handle device connections. Initially, it could support 4,000 devices per minute, far short of our goal of 5 million. Clearly, substantial improvements were necessary.
Exploring Microservice Architecture
Essentially, there are different structures in software design. For a comparative example, one is called monolithic architecture, where everything is built as one big unit and uses only one programming language. This setup can take a lot of work to grow and change.
Adopting another one – a microservice architecture – provides significant advantages, particularly for larger projects. This strategy breaks down an application into smaller, self-contained units, permitting targeted and efficient modifications to each segment, thereby conserving time and resources. It accommodates the utilization of diverse programming languages that meet specific demands, with languages such as Go and Rust gaining popularity due to their robust performance in high-demanding tasks.
The adaptability of this framework promotes more rapid development and launch, leading to a swifter entry into the market – an essential competitive edge. It allows for simultaneous workflows, prompt adjustments, and immediate upgrades, all contributing directly to enhanced product cycles and customer contentment. Furthermore, the resilience inherent in this system safeguards against the failure of a single service affecting the whole operation, preserving ongoing reliability and consumer confidence.
Using microservice architecture definitely gives more freedom, but it also comes with its own problems. The main issue is during deployment, which is when we make the software ready to use. This is mainly because microservices work separately, so they must communicate well and work together without issues. This situation can make both setting up the software and managing it later more difficult. However, for expanding a product, using microservices is a much better approach. Also, if we use AWS, which offers many different tools for creating software, we can use many helpful services to establish a microservice setup.
Choosing the Right Tools
In the beginning, we relied on solutions like the ELB load balancer, EC2 servers, and auto-scaling groups (ASG). The ASG service was vital in consolidating all servers into a single group and predicting the potential load. If the load was excessive, the ASG could automatically introduce an additional server, and so forth. However, this method is now considerably outdated.
As you can see, we’re presented with a variety of solutions to choose from. We can utilize more current services, like ECS or Elastic Container Service. This can be used in conjunction with either EC2 servers or Fargate’s serverless containers, representing the most up-to-date solution available for a microservice architecture on AWS, aside from architectures employing Lambda functions.
Using Kubernetes on AWS is an option, but it’s not right for every situation because it can make things more complex. This is because it comes with many features and settings that need to be managed, which can be overwhelming if you don’t need them. That’s why it’s important to consider whether we need Kubernetes, or something like ECS, which is easier to use, and as a result, would work better for our needs.
On a different note, people seldom use Docker Swarm these days. This is mainly due to Kubernetes, which, though complex, offers more features and has greater support from big tech companies. Docker Swarm, simpler and once popular, can’t match Kubernetes’ capabilities or wide adoption.
Initial Architecture: Orchestrating Microservices
In the first version of our architecture, we dealt with numerous microservices, but for simplicity, we’ll focus on just three.
When an API request was made, it first went to one microservice. This microservice then sent the request to another one that checked if the request was coming from a valid source. The request was then moved to a different microservice to get the final answer if everything checked out. The first microservice was like a traffic director, making sure everything went where it needed to go. It’s worth mentioning that each microservice works with its own database, using the inter-service communication in our system.
We utilized a tool called Route 53 and set up a load balancer. Our microservices were running on EC2 servers, each within an Autoscale Group. This setup meant that if the load on a microservice increased, the Autoscale Group would automatically introduce an additional server to handle the extra demand.
System Monitoring and Database Management
We employed RTS Aurora, assigning a separate database to each microservice. Within each microservice, Nginx was positioned after our node. We were operating on Node.js v8 and used PM2 to replicate the process across every processor core, which was necessary because Node.js runs on a single thread, so scaling was imperative, especially when utilizing a server with multiple cores.
Each server was equipped with an AWS agent. This agent was responsible for monitoring server operations and gathering data, such as memory usage. We also relied on a MySQL database. This setup constituted the first rendition of our architecture, setting the stage for its subsequent evolution, and we will show later what it will become.
Challenges We Bumped Into & How We Dealt With Them
The first problem we saw after the load tests was in the code. Numerous challenges were identified, not initially related to the infrastructure but to the coding itself. The effectiveness of the code fundamentally determines whether we can maximize our server capacity or if certain obstacles will arise.
All metrics look good, but something is stopping us from reaching the target of the Auto-Scale policy (CPU > 60%).
We improved our system by abandoning Babel and started using ESM and updating to a newer version of Node. We made this change easily because our service was first built with new ES6 modules and used Babel for preparing and running files. The updated Node.js version did everything we needed, so we no longer needed Babel. Using ESM instead, we kept working with ES6 modules effectively.
AWS “Stolen CPU”
The next issue we encountered was a tricky one involving our servers, which, although not common, definitely exists. We were using T2 servers. These are now considered old-fashioned, but you can still come across them. We only figured out this problem by closely monitoring the CPU load.
On constant load, CloudWatch shows that CPU load drops from 90% to 40%, but Htop shows 100% CPU load. Auto-Scale policy can’t handle it because of the incorrect metric.
The T2 servers operate based on a “burstable performance” mechanism. This mechanism allows the accumulation of “CPU credits” during idle periods. When the CPU becomes active, it utilizes these credits to enhance performance. However, a challenge arises when these accumulated credits are exhausted. The CPU experiences throttling, leading to what is termed a “stolen CPU.”
In simple words, here’s the thing about these kinds of servers: when there’s a lot of work to do, they can borrow power from other servers. Imagine if our server is completely busy, AWS has the ability to lend us a bit more strength from another server without charging us. However, it’s like a loan that AWS will eventually take back from our server’s resources.
We want to clarify that the “stolen CPU” isn’t a system error or malfunction. It’s an intentional feature of the T2 design, providing users with performance they’ve paid for. However, this fluctuating performance can be problematic for applications with high demands.
For applications that require consistent and high computational power, transitioning to other server types, such as C4 or C5, is a viable solution. These servers are designed explicitly for compute-intensive tasks. Unlike the T type, they don’t operate on a CPU credit mechanism. Instead, they provide dedicated CPU performance.
That’s why it’s wise to avoid using server types like T2 or T3 for tasks that require high performance. Instead, it’s better to choose servers labeled with a C, which stands for Compute Intensive Service. Those like C4, C5, C6, and others don’t come with this borrowing issue. However, overcoming this obstacle led us to a new one.
This hiccup came when we began using c4.large-type servers.
At 82.000 req/sec, we have reached the limit of our subnetwork. It was limited in the number of available IP addresses.
We hit a ceiling of 657 servers. That was as many as we could launch because we maxed out our network’s capacity. Within the AWS Virtual Private Cloud (VPC), a logically isolated section of the AWS Cloud, resources are organized into subnets. Each subnet resides within one Availability Zone and has its IP address range. As resources like EC2 instances, RDS databases, or Lambda functions are launched into a subnet, they consume IP addresses from the subnet’s IP address range.
Even with c4.large-type specs, we were just shy of our goal, aiming for 82,000 requests every second. We needed to keep scaling up. In other words, when scaling to accommodate many requests, there’s a risk of exhausting the available IP addresses in a subnet. This can lead to resource provisioning failures and disrupt the system’s ability to scale further.
Since our network was maxed out on IP addresses, meaning we couldn’t just add more servers, we had two choices: set up a new network to host more servers or switch to more productive servers. We opted for the latter and upgraded our server power.
By choosing more powerful EC2 instance types, fewer instances are required to handle the same workload. Fewer IP addresses are consumed from the subnet, mitigating the risk of exceeding IP addresses. This strategy prevents the potential downtime associated with restructuring the VPC, like adding new subnets or creating a new VPC. Again, upgrading our servers led us to another obstacle.
After transitioning to using servers equipped with 16 cores, we observed that our software, operating within these containers, could not fully utilize the capacity of these 16 cores.
On c4.4xlarge it runs Node.js with PM2 cluster mode and 16 processes. But it’s unable to reach even 40% CPU load.
The root of the issue was identified as the communication process between servers. In a microservices architecture, individual services are decoupled and communicate over the network. This communication is typically done using HTTP/1.1 over TCP. Every new HTTP/1.1 request traditionally opens a new TCP connection. This process involves a three-way handshake (SYN, SYN-ACK, ACK) introducing latency. Furthermore, TCP’s slow start mechanism means that initially, only a few packets are sent until it’s confirmed that the network can handle more, leading to additional latency.
Here’s what happens: when servers interact, they initiate numerous new connections. The operating system oversees these connections and requires a specific duration to be established and terminated. Therefore, we aimed to optimize these connections by maintaining a single connection over time rather than continually creating new ones. To address this challenge, we considered several solutions.
One approach was to migrate to gRPC, given its compatibility with the HTTP2 protocol, which provides the reuse of connections. The second option was the straightforward adoption of HTTP.
The third strategy involved a modification within our existing code. Specifically, we adjusted the client component responsible for dispatching requests to other microservices. Within this client, we integrated a “keep-alive” function while transmitting requests, instructing the system to persist with the current connection without closing it. This adjustment effectively resolved the loading complications with our high-powered servers, enabling us to process more requests.
By enabling the KeepAlive flag in TCP, connections can be reused for multiple requests. This means that once a connection is established, subsequent requests don’t need to go through the handshake process again, reducing latency.
For perspective, we compared the number of requests we could accommodate with and without the “keep-alive” function. Remarkably, the disparity was substantial, often more than doubling the capacity.
As we said earlier, there is an alternative – the possibility of using Rest over protocols like HTTP/2 and gRPC (HTTP/2 under the hood). This allows multiplexing, which means sending multiple requests for data in parallel over a single TCP connection, reducing the overhead of establishing multiple connections and significantly reducing latency. gRPC, in particular, uses Protocol Buffers (a method developed by Google), which are both lightweight and efficient for serializing structured data.
After overcoming all these obstacles, we reviewed our architecture to optimize it further.
As our infrastructure evolved, we transitioned to ECS, which primarily allowed us to do away with Autoscale Group and EC2 servers. We adopted Fargate, a serverless container service for running our code, streamlining our operations significantly. With ECS and Fargate, we simply create our container, a Docker image, and then upload it to ECS. Using autoscaling policies, ECS can increment the number of containers as required.
However, it’s important to define the deployment configuration with ECS accurately. When we update our software and deploy new versions, ECS changes the containers. It might take away some of the current containers to make space for new ones or even double the number of containers we have running. This change can affect our services, so we must consider it carefully. Thus, our architecture underwent significant changes.
We transitioned from the classic load balancer to the Application Load Balancer, optimized for containers. Upon adopting ECS and Fargate, we also modified our database clusters.
Initially, our clusters only contained masters, handling both read and write functions. Our updated architecture introduced replicas to the clusters, dedicated solely to reading tasks. This change means heavy read demands can be directed to a replica, reducing the load on our primary node, the master. Additionally, we eliminated Nginx from our microservices, as it became redundant. The ALB load balancer now handles communication with microservices, efficiently directing requests to the appropriate ports.
Fargate, while slightly slower than EC2, spares us the need to manage servers or concern ourselves with their security and configuration, freeing up valuable time for our team. Nonetheless, Fargate comes with its limitations. A single Fargate container is restricted to no more than 4 virtual CPU cores and 30 gigabytes of memory. Also, as it’s slower than EC2 servers, it imposes certain constraints. For instance, if we require over 2000 containers, we must request an increase from AWS support, which could take up to a week.
|Easy to start with a simple configuration.
|One task (container) is limited to 4 vCPU and 30 GB RAM.
|You dont care about OS, security patch, support, AWS configuration.
|Lower performance compared to EC2.
|Container Image Repository.
|Increasing Fargate limits (up to 2000 tasks) in our case takes a couple of weeks.
We also conducted performance evaluations, comparing EC2 and Fargate, and discovered that EC2 servers can handle more requests per second for a similar cost, proving more efficient.
EC2 vs Fargate
|657 c4.large (2 vCPU)
|724 tasks (4 vCPU)
|2 vCPU, 7.5 GB (c4.xlarge)
|4 vCPU, 8 GB
|$0.199 per Hour
|$0.197 per Hour
With EC2 servers, we can select the desired number of cores, a critical consideration on occasion. Yet, while we pay a premium for Fargate, its ease of use and time-saving advantages justify the expense for our team.
Database: Scaling and Optimization
As we’ve covered the challenges and optimizations of the architecture, the database became our next focus.
Firstly, we needed to identify and address slow queries within our database. This involved using indexes to expedite our requests. However, excessive indexing can ironically hinder performance, making it important to strike a balance. We could monitor and optimize query performance efficiently by enabling slow query tracking.
It’s noteworthy that while indexes enhance data retrieval speed, they can impede data writing actions. This happens because each new piece of data requires an update to the index, demanding additional time. Also, the manner in which we structure database requests is key. For instance, instead of sequentially adding data, we can merge multiple entries into one comprehensive request, a bulk insert. This method significantly accelerates database interactions.
In environments utilizing a Microsoft database, employing stored procedures might be advantageous. These are sets of precompiled instructions stored in the database, allowing intricate operations to be executed efficiently, optimizing performance directly at the database level. Here are some small notes based on the experience gained:
- Indexes will speed-up Read queries and slow down Inserts.
- Update by PK where possible.
- Replace couple inserts with Bulk Inserts – one transaction.
- Replace couple selects with a single and filter data on server side.
- Remove as much write queries as possible. If can’t, delay it.
For database testing, we recommend a Sysbench. It’s a reliable tool that evaluates database performance under various loads. Sysbench is instrumental in mimicking high-traffic scenarios to test how well the databases cope, identifying areas that need enhancement.
If in a monolithic architecture, we often use one database, then in a microservice architecture, on the contrary, we use many different databases for each microservice. These databases can also be different.
For example, we can use transactional databases like MySQL if we’re processing payments. If we need to quickly compile logs or store our users’ data, we can use NoSQL databases, like MongoDB and Cassandra. If, in another microservice, we need to build a tree, we can use graph databases, and so on.
We can choose the appropriate database depending on the microservice and its logic. However, while applying these approaches, we also encountered some obstacles.
The Aurora Upgrade Challenge
Amazon Aurora is a cloud-optimized relational database supporting MySQL and PostgreSQL. As apps grow and more people use them, current RDS systems processing power and memory can quickly become insufficient.
At 1,426 req/sec, our RDS CPU was loaded up to 99%. So we needed to change our db.t2.medium RDS instance to be more powerful. The problem was that the R4 instance type (db.r4.xlarge) required a newer version of Aurora (v2.03). Our Aurora version was 1.14.1.
We encountered this problem when we needed a more powerful server for the database. The challenge was further compounded by the fact that the desired RDS instance type required an upgrade to a newer version of Aurora, which wasn’t directly compatible with the current version.
For RDS Aurora, we could only do this once we updated the driver of this cluster to version 2.0. When we did, we could move to a more powerful server. Remember, however, that such operations can take up to 15 minutes of downtime. This means it is important to build clusters correctly, so there is never just one server in the RDS cluster. If we are performing some updates on the master, if we have a second or third server, then the RDS cluster can choose another replica and make it the master, meaning we won’t have downtime.
Connection Pool Size Limits
Establishing a new database connection is a resource-intensive process. However, once established, maintaining that connection requires minimal resources. This is how it should be, although practice shows some variations.
CPU load for RDS and applications just stopped growing. Response time increased with the number of requests.
This problem with our database was also in our software that communicated with the database. The problem was in the connection pool – the number of connections our container can open for requests to the database. When an application component needs to communicate with the database, it borrows a connection from this pool. Once the operation is complete, the connection is returned to the pool, ready to be reused.
In our case, the system was experiencing latency spikes, and upon investigation, it was found that the connection pool size was limited, causing bottlenecks.
Our first version had a default limit of 5 connections to the database. Then we increased them to 200. Thereby, we increased the entire bus – the number of requests we could send simultaneously to the database. This solved our problem.
In order to achieve the main goal, we somehow needed to increase the power of our database because we were already using the most powerful servers.
The DatabaseConnections or CPU metric of the CloudWatch reached the instance limit.
Initially, we identified a significant issue: our containers, specifically our APIs, were improperly configured. They directed all requests, irrespective of their nature, exclusively to our master server within the cluster. This resulted in an imbalanced system; the master was overwhelmed, while the replica remained largely unutilized. Subsequently, we encountered another hurdle. Despite employing the most robust servers available for both the master and the replica, we realized a need for even greater capacity.
We created an autoscale policy for RDS, in which we said that new replicas need to be added when the load on our processor increases and when the number of simultaneous connections to the database reached 3500.
- Avg CPU = 40%
- Avg connections = 3500 (4000 max for our instance type)
Second policy measure connections ONLY over replicas, so you need to have at least 1 replica.
This is almost the peak number of connections that can be opened with one server. When we get to this point, we may send simple requests, but there are many of them. At this time, we need to add another replica because, when we are sending very simple requests, our processor is not loaded, but this server can no longer open new connections.
Also, we added such a configuration to our code so that all requests for updating, adding, or deleting data went to our master. All requests for reading data went only to read-only replicas. That is, we relieved our master and distributed our requests. After this, our requests to the database became very fast.
Tech Stack to Highlight
Setting up multiple services on AWS feels like directing a large team where everyone has a unique role. This isn’t just about launching services, but blending numerous tools and techniques to get everything running smoothly. Our team has gained insights into the valuable tools and methods required for this task through detailed study and hands-on testing.
Apache Bench offers a detailed look into how HTTP servers perform. By copying various user activities, we can spot potential issues, analyze delay points, and test the system’s response when faced with many simultaneous tasks. We can replicate various user experiences with tailored settings, from a single user’s activity to times of maximum user traffic.
WRK is crafted for the current web landscape. It’s remarkable in handling a heavy workload, even on a single computer with many processing cores. The Lua scripting feature lets us design specific user patterns, helping us mimic intricate user interactions and offering a complete perspective on our system’s behavior.
When dealing with many services, uniformity is very useful. Ansible guidelines ensure that every part, service, and setting follows a consistent pattern. This uniform setup guarantees that our performance results remain consistent as we repeatedly compare similar scenarios.
EC2 (Elastic Compute Cloud)
EC2 is central to AWS. It offers adjustable computer resources, allowing you to pick the computing power, memory, and storage that suits your project. It also adjusts independently to traffic, ensuring readiness at all times.
ECS & Fargate
ECS streamlines the management of containers, providing the deployment, adjustment, and control of containerized applications. Partnered with Fargate, it eliminates the need to worry about the background tech, allowing the focus to be on the application’s core functions.
RDS and DynamoDB
Databases are central to most applications. Employing strategies like fine-tuning queries, organizing storage, and using caching ensures quick responses. AWS provides RDS for structured databases and DynamoDB for flexible NoSQL options. Features like using multiple data reading points enhance data retrieval speeds, while other techniques ensure efficient data storage for high-write scenarios.
Wrapping Up and What’s Next
Setting up many services on AWS is a learning experience with its ups and downs. AWS gives us a rich set of tools and services perfect for managing these setups. We learned a lot about AWS, how to make databases run better, and how to get services to talk to each other. Yet,we know that tech keeps changing all the time.
Main Points to Remember
- Network Capacity: Ensure your network has a sufficient reserve of IP addresses.
- Service Limits: For large projects, verify the limits on services like Fargate and Lambda. Remember, increasing these limits takes time.
- KeepAlive Attribute: Always consider using the KeepAlive attribute for interservice communications, especially when frequent requests are made between services. It enhances request speed significantly.
- Server Selection: Don’t use T2 or T3 servers for high-load tasks. They’re not designed for high performance. Opt for C-series servers instead.
- Avoid Nginx in Microservices: Nginx is unnecessary and may introduce complications when developing microservices with ECS and load balancers.
- Authorization Method: Prefer using JWT tokens for authorization when feasible.
- Database Connection Pool: Proper configuration of the connection pool in your software is crucial. Reuse connections, especially with lambda functions, to prevent performance issues.
- Database Cluster Configuration: Set up your database cluster with one master and replica, and implement an autoscale policy for adding replicas under load or upon reaching connection limits.
- Software Configuration for Databases: Ensure your software correctly directs update requests to the master and read requests to the replica. Keep requests in balance.
Looking back, two ideas seem to be growing
- A lot of companies are starting to use this way of building apps. It’s flexible and adjusts easily, which makes it a top pick for building modern apps.
- Going serverless. This idea is becoming very popular. It means developers focus on their work and don’t worry about the tech running it. This makes putting out new things quicker and simplifies the process.
To end, setting up many services on AWS taught us a lot about the problems and the good possibilities in this area. As tech keeps changing, keeping up and trying new ways will help us make the most of AWS and these services.
Stay Tuned with Sirin Software
At Sirin Software, we’re not just about solutions but about transforming challenges into success stories. Our journey in elevating this IoT cloud system underscores our commitment to innovation and excellence. Whether it’s database optimization, server management, or operational reliability, our team is equipped to tackle the complexities of modern digital landscapes. Our recent triumph in optimizing a cloud-based system for IoT products is a testament to this. We redefined scalability with our AWS solution, significantly improving performance and efficiency for a global leader in IoT connectivity.
Stay connected with Sirin Software as we continue to pave the way in network solutions and cloud computing, transforming challenges into opportunities for growth and advancement. Our journey is ongoing, and we invite you to join this exciting path towards a more connected and efficient digital future.
Ready to turn your concept into reality? Contact the experts at Sirin Software and schedule a consultation to bring your project to the forefront of innovation.