15 November 2024
07 Min. Read
All you need to know about Apache Kafka: A Comprehensive Guide
In the early 2010s, LinkedIn was experiencing explosive growth, both in terms of user base and data volume. As the platform expanded, it became increasingly clear that the company's existing messaging and data processing infrastructure was not equipped to handle the scale and complexity of the data being generated.
LinkedIn's engineers were facing challenges like:
data loss and inconsistency
limitations in scaling
loss of messages during real-time processing
and frequent downtime and complexity
Even though the team had implemented messaging systems, like ActiveMQ or RabbitMQ they were not able scale them to meet LinkedIn’s growing demands. And all this led to pointing fingers at the engineering team of LinkedIn.
Led by Jay Kreps, Neha Narkhede, and Jun Rao, the team began to conceptualize a new kind of message broker that could handle massive real-time data streams more effectively than anything that was currently available. The goal was to create a system that could:
Store streams of data safely and reliably on disk and replicate data within the cluster to prevent data loss.
Scale horizontally to handle more data by simply adding more machines to the Kafka cluster.
Process and reprocess stored data as needed, unlike traditional systems where once data was consumed, it was typically gone.
And that’s how Kafka was born, it was built as a distributed system from the ground up, which meant it could handle failures gracefully and ensure high availability and data consistency across large clusters. As soon as it was built, it started serving as the single source of truth for all data flowing through LinkedIn.
And ever since then, Kafka has only seen growth and popularity. It has become so popular that now it has started to overshadow the popularity of its namesake novelist Franz Kafka. Its popularity is evident from that fact that over 500 Fortune companies use Kafka including top seven banks, nine out of the top ten telecom companies, top ten travel companies, eight out of the top ten insurance companies, etc. Netflix, LinkedIn, and Microsoft are few names which process four-comma messages (1,000,000,000,000) per day with Kafka.
Now that we’ve learnt what led to the development of Kafka, let’s dig in deep on its technical side to understand what goes behind the producer-consumer interaction, and let’s use that to make your app’s data processing fast and streamlined too.
What is Apache Kafka?
Although we’ve covered that above, but just to put up here more technically—Apache Kafka is an open-source distributed event streaming platform optimized for high-volume, real-time data.
Designed to handle streams of data at scale, Kafka works as a publish-subscribe messaging system where messages (events) are passed between producers and consumers, enabling data to be ingested, stored, and processed in real-time.
Why Kakfa is a better message queue?
Kafka is more than a messaging queue-it's a distributed event streaming platform. It is massively scalable because it allows data to be distributed across multiple servers, and it's extremely fast because it decouples data streams, which results in low latency. It’s distribution and replications of partitions across many servers, unlike RabbitMQ and ActiveMQ protects it against server failure.
Feature | Apache Kafka | RabbitMQ | ActiveMQ |
Architecture | Distributed, Scalable | Centralized, Easy to Scale | Centralized |
Message Order | Yes | FIFO with limitations | FIFO with limitations |
Throughput | Very High | Moderate | Moderate |
Data Retention | Yes | Limited | Limited |
Use Cases | Real-time Analytics, ETL | Task Queues, Job Scheduling | Application Integration |
Key Concepts in Kafka
Kafka has some famous key terms associated with it, like producer-consumer, topics and cluster. Let’s take a quick sense of all before we move ahead with how all these components work together to process any sort of data:
Producer and Consumer
Producer: Sends records to Kafka topics.
Consumer: Reads records from Kafka topics.
In an e-commerce platform, a producer may be a system generating user behavior data, while the consumer could be a recommendation engine processing these events in real-time.
Topics and Partitions
Topic: A category or feed name to which records are sent.
Partition: Each topic is split into partitions to increase scalability, where each partition can be processed independently.
Netflix processes 2 petabytes of data daily using thousands of Kafka topics and partitions.
Broker and Cluster
Broker: A Kafka server responsible for storing and serving data.
Cluster: A group of brokers working together, providing redundancy and fault tolerance.
Zookeeper
Zookeeper coordinates Kafka brokers and maintains cluster metadata. Apache Kafka relies on Zookeeper for leader election, managing configurations, and maintaining state.
Core Features of Apache Kafka
High Throughput and Low Latency
Kafka’s architecture enables it to process millions of messages per second, with low latency in the milliseconds range, making it ideal for real-time analytics.
Kafka processes 1 trillion messages per day at LinkedIn.
Durability and Fault Tolerance
Kafka provides durability by persisting data across multiple brokers. Data replication and leader-follower roles within partitions ensure fault tolerance.
Scalability
Kafka’s distributed architecture allows it to scale horizontally by adding more brokers to the cluster.
Data Retention
Kafka can retain data for a specified duration, allowing data replay and analysis. Retention policies can be based on time or size.
Stream Processing Capabilities
Kafka Streams, Kafka’s processing API, provides tools to build real-time applications that process data within Kafka topics.
How Apache Kafka Works?
Data processing in Kafka looks not so complex on the surface level, but the deep you go, the more intricate it gets. It majorly follows four steps to process data:
➡️ Publishing Data
➡️ Consuming Data
➡️ Fault Tolerance
➡️ Stream Processing
When a producer sends data to a Kafka topic, it isn't directly delivered to consumers. Instead, the data is stored in topic partitions and remains there until deleted based on a set retention period. Consumers fetch data from the topics they are subscribed to, and each partition is accessed by only one consumer in a group at a time, ensuring load balancing. Consumers monitor which records they have read by tracking their offsets, allowing them to revisit or skip records as needed. Kafka also ensures reliability by replicating each partition across multiple brokers, so if one broker fails, others can take over without data loss.
Additionally, Kafka supports real-time data processing through Kafka Streams, enabling the building of applications where both inputs and outputs are managed within Kafka.
Setting Up Apache Kafka: A Step-by-Step Guide
Prerequisites
Java 8 or higher
Apache Zookeeper
Apache Kafka binary package
Steps:
Install Zookeeper and Kafka
Download and install Zookeeper. Start the Zookeeper server.
Download Kafka and start the Kafka server, specifying the broker configuration.
Create Topics
kafka-topics.sh --create --topic sample-topic --bootstrap-server localhost:9092
Produce and Consume Messages
Start a producer to send messages and a consumer to read messages in real-time.
Scaling Kafka
Add more brokers to the cluster and use partitions to improve throughput.
Conclusion
Apache Kafka has recently undergone significant advancements, notably the release of version 3.9 in early November 2024. This update marks the final major release in the 3.x series and introduces dynamic KRaft quorums, enabling seamless controller node changes without downtime. Additionally, the Tiered Storage feature, which has been in development since Kafka 3.6, is now considered production-ready, offering new tools for managing storage loads.
These developments highlight Kafka's commitment to enhancing scalability, reliability, and ease of management, solidifying its position as a leading event streaming platform. As organizations increasingly rely on real-time data processing, understanding Kafka's evolving capabilities is essential for building robust, future-proof data architectures.
Related to Integration Testing