How, Where and When Apache Kafka Writes Data (Page Cache included)

I recently heard that Apache Kafka does not store data on disks. Yeah, I’m not kidding, my mate told me, that Kafka manages most of the operations in memory and doesn’t rely heavily on disks.

That seemed odd to me, and today I will try to investigate Kafka’s persistence process in detail.

I’m Alex Sergeenko and this is “Alex Tried to Understand” series, let’s go.

“Don’t fear the filesystem!” – that’s how the persistence section starts in the official Kafka documentation.

I attempted to emanate a section synopsis and here is what I got.

The most efficient way to use spinning disks is linear writing because it’s pretty snappy and has a high throughput due to the predictable behavior and strong optimization from OS. 

It sounds weird but sequential disk access can be faster than random memory access!

Solid State Drives in turn have much higher random read and write rates compared to HDDs and can be utilized even more efficiently.

But there is no breakthrough here.

Most of the modern operating systems are extensively utilize memory for disk caching. 

For example, page cache or disk cache in Linux is a mechanism that accelerates access to files on non-volatile storage.

Linux stores data in free areas of memory and use it as a cache and all disk reads and writes move through this channel.

Page cache approach in Linux is called write-back cache. If data is written, it’s first written to the Page Cache and labeled as dirty. 

Dirty means that this data is saved in the Page Cache but also needs to be persisted to the underlying storage device. 

Dirty pages are periodically transferred to the storage. So non-dirty pages have identical copies in the underlying storage and dirty pages are not.

File blocks are written to page caches not just during writing but also when reading files.

If you tried to read a file twice, doing one read after another, the second read should be much faster since it is reading directly from the Page Cache and no disk I/O is involved.

Apache Kafka benefits from the Page Cache because it may be considered as a “free-to-use” in-memory cache implementation provided by an operational system that effortlessly lets you use all advantages of caching at zero cost.

But it would be incorrect to say that Kafka doesn’t rely on a filesystem at all.

This situation at the moment may seem confusing since we cannot truly define when Kafka stores data to disks.

If you remember the producer.acks setting than you probably familiar with acknowledgment policies in Kafka. A broker will return an ack to a producer after the needed number of replicas save this change. 

But since we are already familiar with the page cache, does that mean that the change will be written in memory and may be at risk of loss?

These assumptions are at least partially fair. So if all replicas fail at the same time then even when using acks=all you may still lose the update because page caches won’t have time to persist changes to the underlying storage.

Kafka cluster may tolerate N-1 nodes fail. That means that at least one replica from the in-sync replicas list must remain alive. 

The scenario when all replicas go down at the same moment and have no time to do their IOs seems to happen very unlikely.

It is time to make conclusions.

Kafka relies heavily on persistence provided by operational systems.

Most of them are able to use so-called page caches acting as write-back caches. This optimization gives Kafka a handy and fast in-memory caching for free.

When an update gets written to a Kafka node in most cases it means that the change is written to the page cache but maybe not yet written to the underlying storage.

It makes Kafka even faster since no synchronous IO is used on writes but on the other hand, slightly affects fault-tolerance.