7 min read
.
on Jul 5, 2019
EC2 tuning for +1M TCP connections using Linux

You can save on your EC2 server costs. You can get the most out of your EC2 server by editing two system files resulting in increased system concurrency. By making minimal changes to system files, you can increase the number of users serviced concurrently on your EC2 server. This allows for fewer EC2 servers needed to support your user load. You need to change two files: limits.conf and sysctl.conf.


You can support +1M TCP Connections servicing APIs for mobile and web devices


Alpine, Ubuntu, CentOS, RedHat and Debian

The system files that need to find have been standardized file locations across each of the linux distributions. This is nice to see. You can find the two system files here:

/etc/sysctl.conf
/etc/security/limits.conf

Alpine is used often with Docker. The Alpine Linux distribution is a 5MB base image making it appealing for microservices and containers.

Add lines to your limits.conf file.

## /etc/security/limits.conf
## System Limits for FDs
## "nofile" is "Number of Open Files" 
## This is the cap on number of FDs in use concurrently.
## Set nofile to the max value of 1,048,576.
#<user>     <type>    <item>     <value>
*           soft      nofile     1048576
*           hard      nofile     1048576
root        soft      nofile     1048576
root        hard      nofile     1048576

Add lines to your sysctl.conf file.

## /etc/sysctl.conf
## Increase Outbound Connections
## Good for a service mesh and proxies like 
## Nginx/Envoy/HAProxy/Varnish and applications that
## need long-lived connections.
## Careful not to set the range wider as you will impact
## running application ports in heavy usage situations.
net.ipv4.ip_local_port_range = 12000 65535
## Increase Inbound Connections
## Allows for +1M more FDs
## An FD is an integer value used as a traffic I/O pointer 
## on a connection with a Client.  
## The FD Int value is used to traffic packets between 
## User and Kernel Space.
fs.file-max = 1048576

Dockerfile

Here is a Dockerfile with the lines needed to make the OS file modifications. You can copy the RUN lines into your Dockerfile as-is. No need to copy the FROM or CMD commands when copying the RUN commands to your own Dockerfile. However you need the full file as shown to run a standalone test.

# Dockerfile
FROM alpine:3.8
RUN echo 'net.ipv4.ip_local_port_range = 12000 65535' >> /etc/sysctl.conf
RUN echo 'fs.file-max = 1048576' >> /etc/sysctl.conf
RUN mkdir /etc/security/
RUN echo '*                soft    nofile          1048576' >> /etc/security/limits.conf
RUN echo '*                hard    nofile          1048576' >> /etc/security/limits.conf
RUN echo 'root             soft    nofile          1048576' >> /etc/security/limits.conf
RUN echo 'root             hard    nofile          1048576' >> /etc/security/limits.conf
CMD echo '+1M Connections' # your application here

You need to build and run the image and also pass a –ulimit parameter according to the docs like this:

docker build -t one-million ./
docker run --ulimit nofile=1048576:1048576 one-million

Docker Compose

Here is the configuration for supporting over a million TCP connections in a Docker Compose configuration. Note that we are referencing the above Dockerfile in the build section.

 docker-compose.yml
version: "3.7"
services:
  myserver:
    command: echo '+1M TCP Connections' # your application here
    build:
      context: .
      dockerfile: Dockerfile
    ulimits:
      nofile:
        soft: 1048576
        hard: 1048576
    sysctls:
      net.ipv4.ip_local_port_range: 12000 65535

This compose file includes the Dockerfile from above. You can run the docker-compose.yml file with the following shell command. Make sure both the Dockerfile and docker-compose.yml file are in the same path.

docker-compose up --build

How Much Money We Saved

Not only has this made our data API business viable over the years, we would have not been able to compete in the market if not for Linux and the ability to tune it.

The nofile default value is shown with the ulimit command. This value shows 1024 max concurrency. That’s usually enough for the majority of app servers. However as your app server grows in usage, you will need to make decisions to add more servers or change the system configs.

ulimit -a
# ...
# Maximum number of open file descriptors (-n) 1024
# ...

At PubNub we run into network IO bandwidth limits before hitting the max TCP connection level. This is because most of our customers are actively sending and receiving messages on our TCP sockets. We see +250k TCP Connections per EC2 server. For us the operating system TCP Max Connections Default was originally 4096. Not all of our systems need a high TCP connection count. The systems in the direct path of a mobile/web device making an API call can take advantage of increased TCP connection limit. These are the servers that we’ll see a cost savings benefit!

Globally we are running thousands of EC2 servers. Of those servers have taken the advantage of an increased TCP connection limit on 564 ( m4.2xlarge ) of the EC2 servers. If we had not modified the system files to allow for an increased TCP connection limit, we would need more EC2 servers to offer our API.

Without tuning TCP connection limits, the original 564 servers currently at +250K connections would be at the whim of the original 4,096 TCP connections limit. We are running +141 million TCP connections concurrently. To satisfy this with EC2 server hardware at a default limit of 4,096 TCP connections we would need 34,424 EC2 servers. Ouch, that’s a lot of servers. That is notably different from our currently running 564 EC2 servers. Essentially our business would not be viable if for not system file tuning capabilities of Linux. There’s no need to list the dollar value because it just gets silly expensive without tuning.

Application Servers That Support Beyond 10K Connections

Not all applications are capable of leveraging one million TCP connections. If your application must handle this level of concurrency, you may find this as yet another challenge to face. Your CPU/RAM resources will exhaust before approaching 10K connections. This is because traditional applications use thread pools to handle concurrency. However, the resources required to handle over a million connections will exhaust when operating on accepted connections as you can’t spin up enough threads to manage the concurrency.


Use Kernel IO Interrupts APIs like epoll and io_submit to break the 10K barrier


A recent Cloudflare article on epoll and io_submit describes how you might use io_submit in place of epoll. In what scenarios should you use io_submit vs epoll? Next, we need to bring forward ideas on when and where you should use io_submit to gain the CPU benefits. Since Linux kernel 4.18 io_submit and io_getevents can be used to wait for events on network sockets. This is great and could be used as a replacement or in tandem with the epoll event loop.

Reading up on Linux AIO we find some clarity into the io_submit API.

Network applications, like PubNub, that need I/O concurrency must leverage I/O interrupt APIs offered by the OS. There are two compatible options available epoll and io_submit; each useful depending on the data patterns needed.

Today PubNub uses epoll for async I/O and has not yet implemented io_submit. We send and receive copies of data to/from web devices using read() & write() on non-blocking FDs managed in an epoll event loop. Reviewing the bulk operation capability of io_submit gave us ideas for batching I/O operations to connected devices. With io_submit batching we can broadcast/replicate data at a lower CPU price. Databases that offer fault tolerance by spreading data between distributed systems like Cassandra can use io_submit to bulk copy to multiple replicas.

We replicate data between servers and clients using a for loop on write(fd, buf) where the broadcast is performed in user-space. Broadcasting data to various client devices, the CPU gained with io_submit batching may be a trivial CPU gain as looping on an O_NONBLOCK FD already saturates the network bandwidth before burning the CPUs. However, you will find io_submit will reduce CPU if you identify areas where batch write/read are possible. Epoll and io_submit can be used in tandem so don’t feel like you need to pick on or the other, you can use both and decide later to add and mix together.

Frameworks for Scale

You do not need to use the Linux AIO kernel APIs directly. Instead, you can start with pre-built asynchronous frameworks. Modern languages offer various frameworks. Here is a list of languages with popular Async IO frameworks:

  • Rust Tokio
  • C libev and libevent
  • GoLang goroutines
  • Java NIO
  • JavaScript NodeJS mostly defaults async
  • Ruby EventMachine
  • Python gevent

If you are simply operating free open source software applications, you may be in luck. Many FOSS apps support +1M connections as they were built on AIO APIs. Application Servers that Work out-of-the-box for +1M TCP Connections:

  • Nginx
  • Envoy
  • HAProxy

Safe for General Purpose Operations

These modifications are in the safe range for any style of computing. If you make these changes on a system that doesn’t need it, there will not be much to worry about. Editing these files will not affect applications that can get by without the need for more connections.

Google Cloud, Digital Ocean and Azure

These modifications are supported by most cloud providers. The industry trusted clouds will allow you to modify these values and they will support the workloads you need. Amazon Web Services (EC2), Google Cloud, Digital Ocean and Azure will support these settings.

10 Million TCP Connections and Beyond (Bonus)

Yes you can have enough TCP connections to go over the hill and beyond. You probably won’t need to have this many TCP Connections per server as you’ll hit other bottlenecks before needing +1M TCP Connections. However, it is possible to achieve greater than +10 million connections by moving TCP into User Space using Raw Sockets.

TCP / IP Packet Frame

You get full control when managing TCP by using raw sockets. A raw socket is used to receive full unparsed packets. This means packets received at the Ethernet layer will directly pass to the raw socket unaltered. Raw sockets bypass the kernel space TCP/IP processing and send the packets to your user-space application directly. Linux uses a 20-bit integer for FDs maxing out the total FD count at 1,048,576 total. When managing TCP/IP in user-space, you can increase that 20bit integer to 32-bit integer or higher. This gives you an unlimited number of TCP Connections per EC2 server. Wow, that sounds amazing!

Most commonly TCP is run in kernel space and managed by your operating system. Much performance and security tuning have been set as defaults to provide the best tradeoffs for your applications. Typically you are giving up these advantages for far more TCP connections with userspace management. There are more reasons to just keep using Linux’s TCP Stack. You may not want to make this tradeoff. We use the Kernel’s TCP stack.

More From PubNub