Guides

What is Apache Kafka?

0 MIN READ • Developer Relations Team on Oct 2, 2023

What is Kafka?

Apache Kafka is a distributed event streaming platform that enables high-throughput, real-time data processing. It operates as a publish-subscribe messaging system using a distributed commit log, allowing multiple producers and consumers to read and write data concurrently. Kafka’s infrastructure ensures scalability, fault tolerance, and durability, making it a core component in modern data systems. Initially developed by LinkedIn and later donated to the Apache Software Foundation, where it became an Apache top-level project.

How does Apache Kafka work?

Kafka follows a client-server model with brokers managing data storage, networking, and distribution across a cluster. Producers write data to topics—logical channels for organizing records—either synchronously for delivery confirmation or asynchronously for low-latency, high-speed transmission. Messages are stored in partitions, which allow horizontal scaling and parallel processing, as multiple consumers can read from different partitions simultaneously. Consumers subscribe to topics via Kafka’s API and are organized into consumer groups, ensuring load balancing and efficient message processing.

To maintain fault tolerance, Kafka replicates partitions across multiple brokers. Each partition has a leader handling read and write operations, while followers replicate data as backups. If a leader fails, a follower takes over, ensuring continuous operation without data loss. Kafka also retains messages for a configurable period, enabling historical data reprocessing. Its optimized networking layer supports low-latency data transmission and reliable delivery, even under high loads.

Kafka enhances performance with batching, compression, and zero-copy data transfers, reducing network overhead and improving efficiency. It integrates seamlessly with external systems via Kafka Connect, facilitating data ingestion and distribution. Kafka Streams enables real-time data processing, while APIs and SDKs support various protocols for custom integrations.

By combining a distributed log-based architecture with efficient networking and storage, Kafka delivers scalable, fault-tolerant, and high-performance data streaming. Its ability to handle massive data ingestion and real-time analytics with minimal latency makes it essential for event-driven applications and enterprise data infrastructure.

What are some of the features of Apache Kafka?

Apache Kafka is a distributed streaming platform that offers a wide range of features, making it highly suitable for building real-time chat and messaging applications.

High Throughput: Kafka is designed to handle large data streams efficiently.
Scalability: Kafka scales horizontally, allowing you to add more brokers to the cluster to handle increasing data loads.
Fault Tolerance: Kafka provides fault tolerance by replicating data across multiple brokers in a cluster. If any broker fails, other replicas can seamlessly access the data.
Durability: Kafka persists data on disk, ensuring data integrity and durability. It allows you to configure the retention period for data, meaning you can store data for as long as you need.
Message Retention: Kafka provides highly configurable retention policies, enabling fine-grained control over how long messages persist. This is particularly valuable for historical data analysis, event replay, or delayed processing applications. However, it’s important to note that while Kafka allows access to retained messages, it does not function as a traditional database with built-in querying capabilities. Consumers are solely responsible for tracking their offsets, meaning efficient offset management and state tracking are critical to avoid data loss or redundant processing. Additionally, long retention periods require careful resource planning, as retained messages consume disk space on brokers. To maintain a balance between historical availability and storage efficiency, production systems should leverage tiered storage, enforce log compaction where applicable, and implement automated monitoring to detect retention-related bottlenecks.
Flexibility: Kafka supports various data formats and protocols, making it flexible for different use cases. It can handle both structured and unstructured data and supports various integration patterns.
Monitoring and Management: Kafka provides a robust set of tools and APIs for monitoring and managing clusters, including metrics, logs, and administrative APIs. This makes it easier to monitor the health and performance of Kafka clusters.
Native Deployment: Kafka is well-suited for cloud-native deployments, as it can be easily deployed and managed in cloud environments such as AWS, Azure, and Google Cloud Platform. It can also integrate with cloud-native services for data processing and analytics.

Disadvantages of using Apache Kafka

While Apache Kafka offers many advantages, there are also considerable disadvantages to consider:

Complexity of Configuration and Management: Setting up and managing a Kafka cluster can be complex and require expertise in distributed systems. Developers need to configure various parameters, such as replication factor, partition count, and retention policies, to optimize the performance and reliability of their Kafka deployments. Monitoring and troubleshooting Kafka can also be challenging, especially in large-scale production environments.
Potentially High Latency: While Kafka is renowned for its high throughput, it can introduce latency in certain scenarios, particularly in high-scale, production environments. Latency can arise due to network congestion, inefficient message serialization and deserialization, or consumer processing overhead. Additionally, throughput performance is heavily influenced by factors like partitioning strategy, broker resource allocation, and overall network conditions. Poor partitioning can lead to imbalanced workloads, increasing consumer lag, while under-provisioned brokers may struggle to handle spikes in traffic. To mitigate these risks, developers must implement optimized partitioning strategies, ensure adequate resource scaling, fine-tune producer and consumer configurations, and monitor system metrics continuously. Effective backpressure management and batching techniques can further reduce latency, ensuring the system maintains real-time performance under high load.
Lack of Real-time Messaging Features: While Kafka excels at handling large volumes of data and providing fault-tolerant data pipelines, it may not offer the same real-time messaging features as specialized messaging platforms like PubNub. PubNub, for example, provides additional features like presence detection and mobile push notifications.
Limited Support for Non-Java Languages: While Kafka provides client libraries for several programming languages, its core functionality and ecosystem primarily focus on Java. Developers working with other languages may face limited support and documentation, making integrating Kafka into their existing tech stack harder.
Resource Intensive: Kafka can be highly resource-intensive, particularly in high-throughput environments or when handling large data volumes. This is primarily due to its replication mechanism and log segment management, which heavily impact disk I/O. Every message is written to disk, replicated across brokers to ensure fault tolerance, and periodically compacted or deleted based on retention policies. High replication factors increase durability but also amplify storage and network overhead, while frequent log segment flushes can degrade write performance. To optimize Kafka’s resource usage in production, engineers should fine-tune key parameters such as batch sizes (e.g., increasing linger.ms and batch.size for more efficient writes), compression (reducing network and storage overhead), and ISR (in-sync replicas) settings (balancing durability against write latency). Proper tuning of disk throughput, memory allocation, and partitioning strategy is essential to maintain performance while preventing excessive resource consumption.
Operational Overhead: Running a Kafka cluster requires ongoing maintenance and monitoring. This includes managing partitions, handling replication, and monitoring performance. This can add operational overhead and require dedicated resources.
Learning Curve: Apache Kafka has a steep learning curve, particularly for developers unfamiliar with distributed systems or event-driven architectures. Understanding its concepts and best practices may take time and effort.
High Initial Setup Cost: Implementing Kafka can require significant upfront costs, especially for organizations that must invest in dedicated hardware or cloud infrastructure to support the Kafka cluster. This can be a barrier for smaller companies or startups with limited resources.
Complex Monitoring and Troubleshooting: Monitoring and troubleshooting Kafka can be challenging due to its distributed nature. Identifying and resolving partitioning, replication, or performance issues can require deep technical expertise and specialized tools.
Dependency on ZooKeeper: Kafka relies heavily on ZooKeeper for cluster coordination and metadata management. This introduces an additional layer of complexity and potential points of failure. Any issues with ZooKeeper can impact the overall stability and availability of the Kafka cluster.
Inflexible Schema Evolution: Kafka's schema evolution capabilities are relatively limited compared to other data streaming platforms. Modifying the schema of existing topics can be challenging and may require complex migration strategies, which can be time-consuming and prone to errors.
Lack of Native Analytics and Querying Capabilities: Kafka is designed as a distributed messaging system and does not provide native analytics or querying capabilities. Developers need to integrate Kafka with other tools or platforms, such as Apache Spark or Elasticsearch, to perform complex data analysis or search operations on the stream of messages.
Limited Support for Message Ordering: Kafka guarantees message ordering within a single partition but not across multiple partitions. This can be challenging for applications that rely on strict message ordering, such as financial systems or event-driven workflows. Developers must carefully design their partitioning strategy to ensure the desired ordering semantics.
Potential Data Duplication: In some scenarios, Kafka may introduce data duplication. This can happen when a producer retries sending a message after a failure, resulting in multiple copies of the message stored in different partitions. Developers need to handle duplicate messages on the consumer side to ensure data consistency and avoid processing the same data multiple times.
Limited backward compatibility, often requiring client code changes when upgrading to newer versions. This can be time-consuming and may introduce compatibility issues if not managed properly. Careful planning and testing are essential to ensure a smooth transition without affecting application stability.
Limited Support for Complex Routing and Transformation: Kafka has limited routing and transformation, requiring custom logic or external tools for complex operations. Adding architectural complexity and extra development effort.
Lack of Built-in Stream Processing: Kafka itself does not process data in transit, but it provides Kafka Streams, a native library for real-time event processing with features like windowing and exactly-once semantics. While Kafka Streams is sufficient for many use cases, complex analytics or large-scale batch processing may still require external tools like Flink or Spark, adding integration overhead. Developers must assess whether Kafka Streams meets their needs or if additional processing layers are necessary.

What are some Kafka use cases?

Apache Kafka is used in various use cases, such as:

Real-time stream processing: Kafka allows applications to process and analyze real-time data streams, making it suitable for use cases such as fraud detection, real-time analytics, and online machine learning.
Log aggregation: Kafka's ability to handle high-throughput data ingestion makes it an ideal choice for log aggregation use cases. It can collect logs from multiple sources and centralize them for further analysis, monitoring, and debugging.
Commit log for distributed systems: Kafka's durability and fault-tolerance features make it an option for building distributed systems. It can serve as a commit log, storing events and ensuring they are replicated across multiple nodes, thus ensuring data integrity and fault tolerance.
Change data capture (CDC): Kafka's ability to capture and stream real-time data changes from databases allows applications to react to data modifications in near realtime. CDC use cases include data synchronization, data warehousing, and building materialized views.
Event sourcing: log-based architecture and the ability to store and replay events make it a good fit for event sourcing patterns. It can capture and store events in a system, enabling audit trails, temporal queries, and state reconstruction.
Metrics and monitoring: Kafka can be a reliable and scalable data pipeline for collecting and processing metrics and monitoring data. It can ingest data from various sources, perform real-time processing, and forward it to monitoring systems for analysis and visualization.
Microservices communication: supports message partitioning to enable efficient communication between microservices. It can be used as a communication channel for asynchronous and event-driven architectures, facilitating decoupling and scaling of microservices.

What is Kafka’s architecture?

The architecture of Apache Kafka is designed to handle large-scale, real-time data streams with high throughput. It follows a distributed and scalable design, allowing it to handle large amounts of data and support high data ingestion rates.

At the core of Kafka's architecture are the following components:

Topics: Topics are the primary data organization unit in Kafka. They represent a category or feed name to which messages are published. Messages published to a topic are stored in an append-only log structure.
Producers: Kafka Producers are the entities responsible for publishing messages to Kafka topics. They write data to Kafka as records consisting of a key, value, and optional metadata. Producers can choose which topic to publish to and specify a partition key to control how records are distributed across partitions.
Brokers: Brokers form the cluster of servers in Kafka. They are responsible for storing and handling the publish-subscribe messaging system. Each broker manages one or more partitions of each topic. Brokers are stateless and can scale horizontally, allowing for high availability and fault tolerance.
Partitions: Topics are divided into multiple partitions, which are ordered, immutable sequences of records. Each partition is hosted on a single broker within the cluster, and multiple partitions allow for parallel processing and increased throughput.
Consumers: Kafka Consumers read messages from Kafka topics. They can subscribe to one or more topics and consume data at their own pace. Consumer groups enable parallel processing of messages, where each consumer within a group reads from a unique partition. This allows for scalable and efficient processing of data.
Connectors: Kafka Connect is a framework for building and running connectors that enable the integration of Kafka with external systems. Connectors allow for easily ingesting data from and outputting data to various sources and sinks.
Streams: Kafka Streams is a client library that allows for building real-time streaming applications that process data in Kafka. It provides an API for consuming, processing, and producing data streams, enabling the creation of applications such as event-driven architectures and real-time analytics.

Integrating PubNub with Kafka

Enterprise technologies like Kafka generate more insights than ever—but how do you act on them?

The PubNub Bridge for Kafka integrates seamlessly with your on-premise systems, providing a secure, scalable, and highly available connection to PubNub. By linking Kafka with PubNub, you can:

Enable mobile app event notifications without coding or opening firewalls, supporting mobile workers and BYOD use cases.
Share event streams across teams without extra routing logic, ensuring collaboration with data access audit trails.

For more information on our PubNub bridge for Kafka, see our developers page.