Experience in WebFlux (Netty) highload optimization

For the last three weeks our team was brainstorming a strange bug. We have a Spring Boot-based microservice deployed on OpenShift cluster which provides a REST endpoint, retrieving a user data from a NoSQL database. For some unknown reasons we were unable to gain more requests-per-second. The mystic power was stopping us at the level of 250 req/sec. The strange thing was that the microservice was consuming only a half of the given CPU resources, so we had no ideas about the actual reasons of such behaviour.

Under the hood

Here I want to tell you a little about frameworks used there. It is a Spring Boot microservice deploying on embedded Netty server, which uses reactor-netty and of course Java Reactor itself. The microservice is fully non-blocking since our database fortunately has an asynchronous driver. The database is quite fast and has an average latency of 15 milliseconds, it distributed across many shards and should serve about 20k req/sec so it probably wasn’t the reason of our problem.

There are two CPUs and 1 Gb of heap were given to the microservice, but without any luck – our expectations were at the level of 500-600 requests per core, but we were unable to reach even 200 despite a fact that a huge power reserve still remained available on both processor and memory.

A thorny way to first results

The microservice was redesigned for at least 30 times. We tried to change our algorithms, swapped to embedded Undertow and even Tomcat, implemented the native Linux epoll transport (btw it was a very good idea), switched from Java Reactor to plain thread pools and back, but nothing helped us to overcome the limit of 250 req/sec.

We’ve sampled CPU utilization for many times, profiling and tuning out JVM, even swapped from JDK8 to JDK11 since there was a bug in the early HotSpot 8 versions with Runtime.getRuntime().availableProcessors() (see my article in Russian here).

Our first thoughts were that we were suffering from some kind of network issue, and I believed the reason was hidden behind TCP flow control (a fair backpressure at TCP stack – see my article in Russian here). So we did some TCP tuning and tried to increase the net.ipv4.tcp_rmem,net.ipv4.tcp_wmem and some other Linux parameters. I’m sure that in the end it brought us some fruits but we were unable to prove it since we were still unable to overcome 250 req/sec at that moment.

At last we tried to implement an “echo” dummy REST endpoint, which returns request as the response and then we did a load test on the same environment. It showed about 1200 req/sec so we made a hypothesis that because there are no any valueable latencies in the “echo” endpoint there is possibly some limit in the connection queue. So the faster “echo” service serves requests 10 times faster and there are empty spaces remain in the queue, while slower endpoint processes requests longer and there are no empty positions in the queue.

Netty tuning

There is a NettyServerCustomizer functional interface which can be used to customize Netty’s event loop:

/**
 * Mapping function that can be used to customize a Reactor Netty server instance.
 *
 * @author Brian Clozel
 * @see NettyReactiveWebServerFactory
 * @since 2.1.0
 */
@FunctionalInterface
public interface NettyServerCustomizer extends Function<HttpServer, HttpServer> {

}

So we have implemented it in our project during the migration to the native epoll transport, but it also can be used to override other parameters:

public class EventLoopNettyCustomizer implements NettyServerCustomizer {

    private static boolean KEEP_ALIVE = true;

    @Override
    public HttpServer apply(HttpServer httpServer) {
        EventLoopGroup parentGroup = new EpollEventLoopGroup();
        EventLoopGroup childGroup = new EpollEventLoopGroup();

        return httpServer
                .tcpConfiguration(tcpServer ->
                        tcpServer
                                .bootstrap(serverBootstrap -> serverBootstrap
                                        .group(parentGroup, childGroup)
                                        .childOption(ChannelOption.SO_KEEPALIVE, KEEP_ALIVE)
                                       .option(ChannelOption.SO_BACKLOG, 2048)                                                                   
                                         .channel(EpollServerSocketChannel.class))
                );
    }
}

Pay attention to ChannelOption.SO_BACKLOG option. It passed through socket option determining the number of TCP connection queued. Here is a quotefrom Oracle:

The maximum queue length for incoming connection indications (a request to connect) is set to 50. If a connection indication arrives when the queue is full, the connection is refused.

If Netty is unable to obtain the net.core.somaxconn parameter from the underlying OS and you haven’t override SO_BACKLOG then Netty assign them defaults of 128 for Linux and 200 for Windows. Here is a screenshot from NetUtil Netty class:

NetUtil.class

Pay attention that both net.core.somaxconn and ChannelOption.SO_BACKLOG parameters are taking in account and the smallest value will be used, so you have to tune both.

We’ve set them both to 2048 and ran our standard load tests once again. We were excited by the results, we’ve gained 1200 req/sec by editing only two parameters! There were no any additional enhancements. So we finally found this “invisible wall” which was keeping us on the alert for the last three weeks.

Conclusion

There are a lot of parameters which have a huge influence on the application perfomance and one should thoroughly inspect them before using in production environment. Here we’ve observed the net.core.somaxconn which defines the maximum quantity of opened sockets waiting for a connection. You can also find some information regarding net.ipv4.tcp_max_syn_backlog which stands for the max quantity of the stored connection requests which have no ack from the connecting client.