adjoe Engineers’ Blog
 /  Backend  /  Running Apache Kafka® on Spot Instances
abstract design with kafka logo embedded
Backend

Running Apache Kafka® on Spot Instances

Apache Kafka is an open-source distributed event-streaming platform. At adjoe we deploy our Kafka cluster on Kubernetes and use it for event streaming – but also in some cases as an event bus. 

Some of the applications that process requests publish messages to Kafka topics. This means that the Kafka brokers should be reliable. Running reliable Kafka deployment can be costly. The high costs come from the way Kafka achieves the resiliency; in order to avoid unplanned downtimes, the data should be replicated across brokers. 

Here at adjoe we always consider the financial impact of our solutions without sacrificing the reliability of our product. Our solutions need to be scalable, reliable, and cost-effective. An easy way of decreasing the costs is using AWS spot instances instead of on-demand instances. Spot instances can be up to 90 percent cheaper than on demand. 

In this article, I will showcase how we managed to run self-managed Apache Kafka on AWS spot instances to cut costs by around 60 percent.

The Setup before the Switch

  • We usually use a replication factor of 3 for our topics with minimum in-sync replicas set to 2. 
  • We use segmentio/kafka-go as our Go Kafka client indirectly by using justtrackio/gosoline. This is a framework for creating Go applications developed by our sister company justtrack.
  • We publish the messages in async mode.
  • Our Kafka and Zookeeper run on Kubernetes.

Which Problems Did We Try to Solve?

When switching from an on-demand deployment to a spot instance deployment, you should expect the nodes to go down at any time. When a node that runs a Kafka broker goes down, all the partitions for which this broker was leader for will become unavailable, a new leader will need to be elected – but this process can sometimes be a bit slow. Some in-flight requests may also exceed the timeout, and some of the error responses are not retryable. 

There are use cases when it would be acceptable for the request to return an error and then be retried. But in some cases, we don’t want to propagate the error back to the user, so we have to guarantee that the messages will eventually be produced. In theory that would mean having to keep the messages in memory until we can write them to Kafka, but if the leader election takes too long, we risk losing those messages due to OOM kill.

The Idea

When a broker goes down, all the partitions for which the broker is a leader will become unavailable until a new leader is elected. Kafka uses a key to partition the messages. There can be multiple strategies, but usually the default partitioner is used. The default partitioner guarantees that all the messages with the same partition key will be assigned to the same partition. 

In our use case, this guarantee is not important, so we asked ourselves: “What would happen if we were to change that behavior, so that when we detect a partition is offline, we try to send the message to a different partition?” And that is what we implemented as an experiment.

Without Active Partition Balancer

diagram showing adjoe running Apache Kafka cluster without active partition balancer

With Active Partition Balancer

diagram showing adjoe running Apache Kafka cluster with active partition balancer

How Does It Work?

First we had to get rid of the async writing because we want to be able to detect if the message we try to write failed or not. This async writing functionality was provided by the Kafka-Go client.

Next we had to implement our own partitioner, which would be aware of errors when we publish a message. Kafka-Go calls this partitioner a Balancer and provides an interface.

type Balancer interface {
   Balance(msg Message, partitions ...int) (partition int)
}

As you can see, this interface takes the message to be produced and a slice of partitions. For example, if your topic has five partitions, the call would look like this:

p := Balance(msg, 0, 1, 2, 3, 4)

If we want to introduce a mechanism that can react when a write request fails, the Balancer should be aware of that. We created a new interface to do this.

type KafkaBalancer interface {
   kafka.Balancer


   OnSuccess(kafka.Message)
   OnError(kafka.Message, error)
}

Now we can notify the Balancer when an error happens. 

Next we created a new Balancer that we call activePartitionBalancer that implements the KafkaBalancer interface. This new Balancer maintains a list of circuit breakers per topic and partition.

How Does activePartitionBalancer Work?

When a new message is about to be balanced, this is how it works.

diagram showing how activePartitionBalancer works
  • When the write message operation fails, the error is passed to the onError function of the Balancer, where it registers the failed attempt.
  • When the write message operation succeeds, the message is passed to the OnSuccess function of the Balancer, where it will reset the partition circuit breaker.

You can find all the implementation details here.

Things We Consider When We Write Cost-Effective Code

  • Try to take advantage of the spot instances whenever possible. 
  • If you doubt that a service can run in spot instances, you can always perform an experiment and evaluate your ideas.
  • Do not settle down – re-evaluate your solutions.
  • Design the code in a way that can withstand unexpected disruptions. See chaos engineering.

Tech Lead (f/m/d)

  • adjoe
  • Programmatic Supply
  • Full-time
adjoe is a leading mobile ad platform developing cutting-edge advertising and monetization solutions that take its app partners’ business to the next level. Part of the applike group ecosystem, adjoe is home to an advanced tech stack, powerful financial backing from Bertelsmann, and a highly motivated workforce to be reckoned with.

Meet Your Team: WAVE Supply Services
In this competitive adtech market, adjoe stands for greater transparency and fairness for app publishers and advertisers – and a more relevant and enjoyable experience for users.
It’s exactly for this that adjoe has built its own programmatic mobile ad platform WAVE which connects app publishers with advertisers. We are working with dozens of advertising networks, measurement providers and other external services with whom we exchange millions of data points every minute. The WAVE Supply Services team is responsible for developing tools through which app publishers can manage their WAVE integration, analyze their ad monetisation performance and assess the ads’ UX through dashboards and APIs.

Join our discussions, explore implementation, and put your problem-solving skills to the test in our cross-functional Programmatic team!

As a part of the WAVE Supply Services team, you’ll be responsible for developing the face of the product: services to set up an SDK, analyze ad revenue and apps’ UX – from gathering the necessary data from our SDK to visualizing it on the dashboard or providing it via APIs.
What You Will Do
  • Build and manage a cross-functional team together with a Product Lead: assist in hiring, provide feedback, set up software development guidelines (code style, best practices).
  • Define the architecture of the software that gives engineers in the team a framework in that they can act and develop.
  • Understand business requirements to come up with solutions and tech specifications for the features.
  • Be a mentor and train developers in the team to make them better in programming, communication, and planning.
  • Be hands-on by developing features on your own (at least 50% of the working time, mostly web frontend).
  • Align with other tech teams on common technologies and tools to be used.
  • Work with statistics from different sources on a regular basis to define data use cases and issues.
  • Who You Are
  • You have a tech degree (computer science, engineering, mathematics or similar). Alternatively 5 years of professional experience in software development.
  • You have 2+ years’ of experience working as a Lead, preferably developing web applications.
  • Full-stack development experience: frontend (React, Angular, or Vue.js), backend (experience in Go or you’re ready to learn it).
  • You have experience in mobile development, e.g. writing an app or working in a cross-functional team.
  • You have gained knowledge of working in Typescript, Redux, Emotion-js, JSS, Tailwind, or a similar framework, Git, and (ideally) Docker.
  • You’re excited to work with data and have experience performing basic data analysis.
  • You have experience hiring people, doing regular 1-1s, creating career plans for developers.
  • Heard of Our Perks?
  • Tech Package: Create game-changing technologies and work with the newest technologies out there.
  • Wealth Building: Benefit from virtual stock options.
  • Work–Life Package: Work remotely for 2 days per week, enjoy flexible working hours and 30 vacation days, work remotely for 3 weeks per year, modern office in the city center, dog-friendly.
  • Relocation Package: Receive visa and legal support, a generous relocation subsidy, and free German classes in the office.
  • Never-Go-Hungry Package: Graze on regular company breakfasts and events, and a selection of free snacks and drinks.
  • Physical & Mental Health Package: In-house gym with a personal trainer, various classes like Yoga with expert teachers & free of charge access to our EAP (Employee Assistance Program) to support your mental health and well-being
  • Activity Package: Enjoy a host of team events, hackathons, and company trips.
  • Career Growth Package: Benefit from a dedicated growth budget to attend relevant conferences and online seminars of your choosing.
  • We’re programmed to succeed

    See vacancies