adjoe Engineers’ Blog
 /  Infrastructure  /  Kafka Producer on Spot Instances with SQS Failover
Resilient Kafka on AWS Spot Instances with SQS Failover
Infrastructure

Kafka Producer on Spot Instances with SQS Failover

Running Kafka on AWS Spot Instances can be a cost-effective solution, but it introduces a critical challenge: spot instance interruptions. These interruptions can disrupt message publishing, potentially causing data loss or delayed processing.

At adjoe, we implemented a dual-path messaging strategy using AWS SQS as a fallback for failed Kafka messages, on top of our framework gosoline. This is a framework for creating Go applications developed by our sister company, JustTrack.

At the moment, the feature is on our downstream repository. In this article, we’ll walk you through how we architected this mechanism, focusing on our producer logic and retry handling. 

Why AWS Spot Instances and why SQS?

Our Programmatic Supply tech team handles tens of millions of requests daily, an inherent challenge in mobile advertising. To meet this demand efficiently (while reducing costs), we rely on spot instances. 

This setup gives us the flexibility to dynamically scale our infrastructure, and support adjoe’s core products: Playtime rewarded ads and programmatic bidding solutions. For this, our architecture needs to be highly adaptive for the mobile advertising ecosystem we operate in.

Challenge

Apache Kafka on spot instances provides cost savings compared to On-Demand Instances, but they can be terminated with minimal warning. This introduces reliability challenges for Apache Kafka, which powers our whole event streaming platform and parts of our data pipeline.

Kafka, designed for high throughput and availability, is sensitive to instance terminations. If a producer is running on a spot instance that gets interrupted, messages may fail to publish. It leads to potential data loss, undermining data pipelines and our ability to extract value.

Additionally, when a new node joins the cluster, it needs to be synced. This operation causes high network bandwidth usage, which in turn leads to higher response latencies between our adjoe SDK and the backend. 

Such latency spikes have triggered alarms and resulted in some data loss in user event streaming. As a consequence, these seemingly minor pod interruptions became far from insignificant for our operations.

Solution

To mitigate this, we integrated AWS SQS as a fallback mechanism. When the Kafka producer fails to send a message due to instance interruptions, it writes the message to an SQS queue. 

This SQS queue acts as a temporary buffer, holding failed messages until the Kafka cluster is back online and ready to process them. Let’s dive into the implementation steps. 

Implementation Overview 

Our implementation follows a modular design, structured around the following components: 

Apache Kafka Producer on Spot Instances - with SQS Failover

Build Apache Kafka Producer on AWS Spot Instances. Use AWS SQS as a fallback for failed Kafka messages.

Primary Output – Kafka: The producer initially attempts to publish messages directly to Kafka.

Fallback SQS: If the primary output fails, the producer sends and redirects the message to an SQS queue.

Retry Daemon: The Kafka producer config includes a retry daemon that periodically checks the SQS message queues and attempts to reprocess messages to Kafka when it becomes available again.

Configuration-Driven: All behavior, including retry logic, output targets, and encoding settings, is configurable, allowing for easy adaptation to different environments.

stream:
  output:
    example_output:
      connection: default
      topic: example-kafka-topic
      type: kafka
      batch_size: 10
      idle_timeout: 1s
      write_timeout: 1s
      balancer: default
  producer:
    example_producer:
      output: example_output
      encoding: kafka/proto-wire-format
      retry:
        enabled: true
        after: 2m
        max_attempts: 5
        queue_id: example-sqs-queue-name
        max_number_of_messages: 10
        wait_time: 20
        runner_count: 1
        client_name: default
        msg_body_encoding: base64
        unmarshaller: msg_base64

Configuration Details

Our Kafka Producer uses configurable settings to define its behavior across various scenarios:

Direct Publishing to Kafka 

Under normal conditions, the producer’s primary objective is to publish messages directly to Kafka. This is our default and most efficient path, optimized for low-latency processing and high throughput. Messages are sent to the configured topic, and successful delivery is acknowledged by the Kafka broker.

Fallback to SQS 

This is where our resilience mechanism kicks in. If the direct publishing attempt to Kafka fails, whether due to a spot instance interruption, network partition, or any issue that makes Kafka unresponsive, the producer doesn’t drop the message. 

Instead, it intelligently redirects the message to a designated AWS SQS queue. This SQS queue acts as a durable, temporary buffer. It reliably holds these “failed” messages, ensuring they are not lost during the Kafka cluster’s unavailability.

Each message sent to SQS includes its original Kafka topic and any other necessary metadata to facilitate reprocessing.

Retry Daemon

Central to our failover strategy is the Retry Daemon, a dedicated component that operates independently. This daemon is continuously polling the SQS queue for messages that were redirected during Kafka outages. 

The daemon picks up messages from SQS and attempts to re-publish them to their original Kafka topics, based on the retry parameters defined in the configuration. 

The mechanism effectively separates the retry logic from the primary producer flow. It ensures the producer can continue processing new incoming messages without blocking. 

The daemon contains robust retry logic, including configurable delays and maximum attempts. It helps prevent infinite loops and eventually routes messages to a Dead Letter Queue if persistent delivery fails, specified by the max_attempts settings in our configuration.

Benefits of Kafka in Spot Instances with SQS Failover

This architecture provides several key benefits:

  • Cost Efficiency: Spot instances reduce infrastructure costs, while SQS provides durable message storage during instance interruptions. As mentioned in our earlier blogs, spot instances pricing is 90% lower than the demand price, leading to AWS cost savings. 
  • Scalability: The retry daemon can handle bursts of messages when Kafka becomes available again, preventing message overload. Upon accommodating our new customers, we observed a potential threefold increase in traffic due to expanded programmatic advertising solutions.
  • Resilience: Messages are never lost; they are safely stored in SQS if Kafka is unreachable.
  • Flexibility: The architecture is fully configurable, allowing for quick adaptation to changing requirements or new output mechanisms.

Challenges & Considerations

While the implementation is robust, it adds some challenges:

Message Ordering

With messages potentially flowing through two distinct paths, direct to Kafka or via SQS as a fallback, maintaining strict global message order becomes a challenge. 

It’s not feasible to perfectly sync two asynchronous streaming systems to guarantee exact ordering. For our current use cases, eventual consistency was acceptable. 

For example, if you’re processing financial transactions or state changes, out-of-order messages could lead to incorrect data. This trade-off between strict ordering and simplified fallback was a key decision point in our design.

Idempotency

Ensuring that messages are not processed multiple times during retries helps maintain data consistency. We address this via unique message IDs processed idempotently by consumers. 

Since a message might be successfully published to Kafka after being sent to SQS, and then reprocessed from SQS, consumers could receive duplicates. 

Our approach to idempotency involves embedding a unique message ID within the message payload. Downstream consumers are designed to check this ID and discard any messages they have already processed. 

It involves using a dedicated key-value store (like Redis) to track processed message IDs. Such a mechanism prevents side effects like double-counting metrics or creating duplicate records, enhancing data consistency.

Latency

SQS introduces additional latency, particularly when Kafka is temporarily unavailable. The added delay is within acceptable bounds for our current SLOs, and we actively monitor queue depths.

This is an inherent characteristic of using an asynchronous queue for failover. While Kafka offers near real-time processing, the SQS fallback introduces a delay for messages that go through this path. 

This latency needs to be accounted for in message processing. We manage this by monitoring the SQS queue depth and the time messages spend in the queue. 

That is why we also have proper alerts configured for dead letter queues. For critical real-time data, we might consider alternative failover strategies or prioritize restoring Kafka quickly. But for many of our use cases, the brief increased latency during an outage is a small price to pay for guaranteed delivery.

Error Handling

Comprehensive error handling is essential to prevent infinite retry loops or data loss. We plan to implement a circuit breaker to prevent overwhelming SQS during prolonged Kafka outages or severe backlogs.

Reliable Kafka, Even on Spot Instances

Using Apache Kafka on spot instances can be risky due to potential interruptions, but by integrating SQS as a fallback mechanism, we can mitigate data loss and maintain message delivery consistency. This solution significantly enhanced the consistency and reliability of adjoe Ads, our programmatic advertising solution, while ensuring cost-effectiveness. We’re able to serve millions of customers across the globe.

Our implementation uses a configurable, modular design that abstracts retry logic into a dedicated daemon, allowing for seamless fallback and recovery. This approach enables us to continue benefiting from cost-effective spot instances without sacrificing data reliability.

If your infrastructure faces similar resilience challenges, we strongly encourage considering such a failover strategy. It proves invaluable for safeguarding your data pipeline and maintaining operational continuity. 

For those building Go applications, our gosoline framework provides a robust foundation to implement these resilient patterns. Feel free to explore our public repository for more insights.

Build products that move markets