adjoe Engineers’ Blog

/ Backend / Event-Driven Architecture

Backend

Building Resilient Backend: Lessons from Event-Driven Architecture Approach

Irina Branovic

January 08, 2025

When designing a backend to handle high load, choosing the right architecture approach isn’t just important – it’s mission-critical. At adjoe, our systems manage half a billion requests every single day. At this scale, relying on synchronous communication between services simply isn’t an option. We needed to create a backend that raised the challenge and could:

Communicate asynchronously, ensuring smooth operations even during traffic surges.
Handle traffic spikes seamlessly, processing every request.
Scale effortlessly to meet growing demand.
Maintain loosely coupled services, enabling independent scaling.

The response to these demands? An event-driven backend architecture. Instead of calling services directly, events are published to record what has happened.

Well-designed events act as facts within the service domain – for example, a user clicking on an ad or installing an app. The service that publishes the event doesn’t need to know – or care – what happens when another service consumes it. In this model, the traditional HTTP call to a service is replaced by the simple act of publishing an event.

Event-driven patterns might seem complex at first, especially compared to a simple backend with an asynchronous API and a database. However, once you understand the concept, event-driven architecture actually makes designing distributed systems easier.

Choosing the Right Message Broker

The first step when working with events is selecting a broker, or message queue (the infrastructure component, also known as a pub/sub mechanism), that will handle message publishing. Events are published to the event bus (i.e. the message broker), which allows multiple consumers to subscribe and receive those messages.

Since our infrastructure relies on AWS, we initially chose SNS/SQS as our message broker for its high availability, and practically unlimited scaling possibilities out-of-the-box. Events are simple value objects that represent a system’s state. With the backend built as a pure Go monorepo, event definitions are shared across services as Go structs, which are then marshalled into JSON.

type InstallData struct {
   AppID         string
   DeviceID      string
   UserUUID      string
   CreatedAt     t.AdjoeTime
   Platform      p.Platform  `json:",omitempty"`
   }

Events Handling with AWS SNS/SQS

The producing service publishes events to an SNS topic using the AWS SDK for Go. The implementation includes optional features such as compression, message deduplication, and batching to optimize performance and reduce pricing, because with SNS/SQS you also pay for the amount of data transferred.

Events published to an SNS topic have a limited lifespan—they are either consumed or automatically discarded after expiration.

func (s *SNSClient) PublishMessage(
  topicName string, 
  subject string, 
  message *Message, 
  options ...PublishOptions) (*string, error) {
    
   for _, option := range options {
      option(topicName, message)
   }

   // add .fifo suffix if the queue is using FIFO features    
   isFIFO := message.GroupID != "" || message.DeduplicationID != ""
   topicName = GetTopicNameFIFO(topicName, isFIFO)
   
   // get topic ARN
   hasDeduplicationID := message.DeduplicationID != ""
   arn, err := s.GetTopicArn(topicName, hasDeduplicationID)
   if err != nil {
      return nil, err
   }

   // marshall message to JSON 
   data, err := helper.MarshalJSONWithOptions(message, isFIFO)
   if err != nil {
      return nil, err
   }

Each service that subscribes to a particular event has an SQS queue linked to the SNS topic for that event. The message broker (SNS) continues to deliver the message until it gets an acknowledgment from the consumer. SQS ensures at-least-once delivery, meaning the message will be sent until the service confirms receipt. Additionally, SQS can be set up to process events in the exact order they arrive (FIFO queue) or filter messages based on specific criteria.

On the consumer side, the HTTP handler is replaced by an event handler. This handler processes the event payload and typically stores the data in a DynamoDB table for further use.

 var messageAttributes map[string]*sns.MessageAttributeValue
   var messageString string
  
// zip compress and base64 encode the message, if required 
 if message.IsCompressed && !message.IsBatch {
      messageAttributes = map[string]*sns.MessageAttributeValue{
          "IsCompressed": {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
      }
      data, err = helper.Gzip(data)
      if err != nil {
         return nil, err
      }
      messageString = base64.StdEncoding.EncodeToString(data)
   } else if !message.IsCompressed && message.IsBatch {
      messageAttributes = map[string]*sns.MessageAttributeValue{
         constant.IsBatch: {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
      }
      messageString = string(data)
   } else if message.IsCompressed && message.IsBatch {
      messageAttributes = map[string]*sns.MessageAttributeValue{
            constant.IsCompressed: {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
         constant.IsBatch: {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
      }
      data, err = helper.Gzip(data)
      if err != nil {
         return nil, err
      }
      messageString = base64.StdEncoding.EncodeToString(data)
   } else {
      messageAttributes = map[string]*sns.MessageAttributeValue{
             }
      messageString = string(data)
   }


  // prepare parameters for publishing
   params := &sns.PublishInput{
      Message:           aws.String(messageString),
      MessageAttributes: messageAttributes,
      Subject:           aws.String(subject),
      TopicArn:          arn,
   }


   // FIFO deduplication
   if message.DeduplicationID != "" {
      params.MessageDeduplicationId = &message.DeduplicationID
   }


   // publish message to SNS topic
   result, err := s.svc.Publish(params)
   if err != nil {
      return nil, err
   }
   return result.MessageId, nil
}

Important to note: in an event-driven design, the system operates with eventual consistency. This means that, at any given moment, different services might be in different states depending on which events they’ve processed so far. While eventual consistency is a common challenge in distributed systems – especially in event-driven architectures – it’s an acceptable trade-off for adjoe’s business model.

One challenge of event-driven architecture is ensuring events are consumed and processed successfully. If processing fails, the event will be redelivered until handled, allowing the system to recover over time. However, unprocessed messages lingering too long often indicate a deeper issue.

For example, we are displaying statistical data based on user clicks and installs; if for some reason there are issues with publishing, consuming, or aggregating these events, dashboards will show incorrect data, but eventually it will be updated. Another example is notifying another service that a user install has been received: if event notification fails, maybe the user settings will not be updated in real-time, but that will not break the system.

This makes observability crucial, with the number of unprocessed messages being a key metric to monitor. At adjoe, we address this by configuring AWS SQS queues to send unprocessed messages to dead-letter queues (DLQs), where they can be retried later. We also use CloudWatch metrics and alarms to monitor unprocessed messages in real time, ensuring issues are quickly identified and resolved.

Migrating from AWS Queues to Self-Hosted Kafka

As our event volume grew, AWS pricing for SNS/SQS pub/sub – despite its stability and convenience – became a concern. Just as a rough idea, the backend is managing around 130 different events published to the similar number of SNS topics, with over 300 SQS queues subscribed. This number goes up to 500 if we also count dead-letter SQS. The busiest queues are processing around half a million messages daily. To address rising costs, we began migrating to a self-hosted Kafka setup as our new pub/sub mechanism.

This migration required a few changes. First, we now define events using protobuf schemas, which are published to a schema registry to maintain consistency and compatibility across services.

type InstallHandler struct {
   repository djoemo.RepositoryInterface
}


// init the handler
func NewInstallHandler(repository djoemo.RepositoryInterface) event.Handler {
   return InstallHandler{
      repository: repository,
   }
}


func (h InstallHandler) Handle(ctx context.Context, e event.Event, result chan bool) {


   eventData := e.Data().(*event.InstallData)
   
   // construct the event to be saved
   install := model.Install{
      AppID:     eventData.AppID,
      DeviceID:  eventData.DeviceID,
      UserUUID:  eventData.UserUUID,
      CreatedAt: eventData.CreatedAt,
      Platform:  eventData.Platform,
   }


  // save the event to DynamoDB
   err := h.repository.SaveItemWithContext(ctx, install.Key(), install)
   if err != nil {
      result <- false
      return
   }
   result <- true
}

To replicate SNS/SQS fan-out behavior in Kafka, we create one consumer group per subscriber for each Kafka topic. Each partition acts as an ordered, immutable sequence of events that is continuously appended to. Kafka ensures message ordering within partitions, meaning messages are delivered in the same order they were produced – providing FIFO consumption by default.

However, publishing to Kafka isn’t as reliable as publishing to SNS. To address this, we implemented a fallback mechanism: for each Kafka topic, we define fallback SQS and dead-letter SQS queues to ensure no message is lost in case of publishing failures.

Message deduplication in Kafka requires additional development effort. To address this, we generate a deduplication key for each event and use a Redis cache to identify and skip publishing duplicate events.

//go:generate mockgen -package mocks -source publisher.go -destination ../../../mocks/streaming/kafka/publisher.go
type PublisherI[T any] interface {
   Publish(ctx context.Context, v *T) error
}


type TransformerI[T any] interface {
   Transform(*T) (*Message, error)
}


type Publisher[T any] struct {
   producerKafka kafkaproducer.KafkaProducerI
   transformer   TransformerI[T]


   // optional fields set via options
   cache            cache.CacheClientInterface
   cascadePublisher messaging.SQSClientInterface
   cascadeEventName event.EventName
}


func NewPublisher[T any](
   producerKafka kafkaproducer.KafkaProducerI,
   transformer TransformerI[T],
   opts ...Option[T],
) *Publisher[T] {
   p := &Publisher[T]{
      producerKafka: producerKafka,
      transformer:   transformer,
   }


   for _, opt := range opts {
      opt(p)
   }


   return p
}


func (p *Publisher[T]) Publish(ctx context.Context, v *T) (err error) {
   m, err := p.transformer.Transform(v)
   if err != nil {
      return errors.Wrap(err, "transforming struct")
   }


   defer func() {
      if err != nil {
         err = p.tryCascade(ctx, m, err)
      }
   }()


   key := m.DeduplicationKey()
   if p.cache != nil {
      success, err := p.cache.SetNX(ctx, key, "", m.Expiration()).Result()
      if err != nil {
         return errors.Wrap(err, "saving entity deduplication key to cache")
      }


      if !success {
         log.Debug("skip publishing entity because there has already been an attempt to do it")
         return nil
      }
   }


   err = p.producerKafka.Produce(ctx, m.Key(), m.Message())
   if err == nil {
      return nil
   }


   if p.cache != nil {
      log.WithError(err).Warn("entity publish failed, deleting deduplication key from the cache")
      if _, err := p.cache.Delete(ctx, key).Result(); err != nil {
         return errors.Wrap(err, "deleting entity deduplication key from cache")
      }
   }


   return errors.Wrap(err, "producing to kafka")
}


func (p *Publisher[T]) tryCascade(ctx context.Context, m *Message, err error) error {
   if p.cascadePublisher == nil {
      return err
   }


   e := event.ReserveQueueEvent{
      Message:          m.Message(),
      Key:              m.Key(),
      DeduplicationKey: m.DeduplicationKey(),
      Expiration:       m.Expiration(),
   }


   data, err := json.Marshal(e)
   if err != nil {
      return errors.Wrap(err, "marshaling reserve queue event")
   }


   msg := messaging.Message{Body: data}
   queueName := messaging.QueueName(env.EnvServicePrefix(string(p.cascadeEventName)))


   if _, err := p.cascadePublisher.PublishMessageWithContext(ctx, queueName, msg); err != nil {
      return errors.Wrap(err, "publishing to cascade queue")
   }


   return nil
}

As part of the Kafka refactoring, we integrated Prometheus for monitoring. The Go client for Prometheus automatically provides default metrics, but we also added custom metrics and alerts specifically for the messaging queues.

The Key to Event-Driven Success

If there’s one key lesson we’ve learned from running an event-driven architecture in production, it’s this: prioritize observability from the very start. Debugging incidents in an event-based system without the ability to correlate events is a huge challenge. All three pillars of observability are essential: logging helps you understand why your system is in a certain state, metrics show you how long it’s been there, and tracing reveals what’s affected by that state.

In short, while event-driven architecture provides unparalleled scalability and flexibility, it comes with its own set of challenges. By focusing on observability, avoiding common pitfalls like the “Big Ball of Mud” (unstructured code from poorly defined events), and selecting the right tools and infrastructure, we’ve built a resilient, self-healing system that handles massive traffic loads. The road hasn’t always been smooth, but the rewards are clear: we now have a backend that’s not only adaptable but also capable of scaling with our growing business.

Director of Technology (f/m/d)

adjoe
Demand Solutions
Full-time

Apply for this job

adjoe is a leading mobile ad platform developing cutting-edge advertising and monetization solutions that take its app partners’ business to the next level. Part of the applike group ecosystem, adjoe is home to an advanced tech stack, powerful financial backing from Bertelsmann, and a highly motivated workforce to be reckoned with.

Meet Your Team: Demand Solutions

We are connecting the advertiser to the end users and on the one hand give users ads that are interesting to them and on the other hand find users for advertisers that would like their product. In order to do that we have built APIs that currently serve 2bn requests per day at low latency and on the other hand integrated ML models built by our data science team in order to serve the most interesting ads to our users.

A large part of the demand solutions system is a dashboard that is used by our internal and external users in order to make configuration changes to advertiser’s campaigns but that is mainly used as a tool for analysis. For this our dashboard features sophisticated analysis tools in order to help us and our partners to gain insights into their business and how to improve.

Join a team that is excited about the latest technologies and is highly interested in data, security, cloud computing, and mobile operating systems.

What you will do:

Contribute to the development of our demand solutions backend written all in Go and maintain our microservice architecture used to communicate with our dashboard (based on TypeScript React). To do this, you’ll use event buses like Kafka, SQS/SNS, and Kinesis in order to have reliable asynchronous microservice communication.

Work in a community of developers with whom you’ll share knowledge and contribute to peer code reviews.

Work with modern databases such as DynamoDB and ScyllaDB to deliver few-millisecond response times for our mobile APIs and Druid to support fast and insightful analysis in our dashboard.

Continuously improve our distribution in order to stay up-to-date with the latest changes on the Android and iOS platform.

Be responsible for collecting the billions of daily API events and aggregating them in our Kafka and Kinesis streams with the goal of querying them from the data lake in a matter of seconds.

Design & maintain APIs used by our external partners for reporting and publishing configuration changes for advertisers like for example uploading new images and videos for ads.

Design & maintain APIs used by our mobile end users such that we can serve their traffic at scale.

Be part of an international English-speaking team dedicated to scaling our adtech platform beyond our hundreds of millions of monthly active users.

Who you are:

You have gained several years’ experience as a software architect/senior software developer and/or tech team lead.

You have hands-on experience with Go, which is applied in a real working BE application.

You are an expert in different technologies – for example, in 2 programming languages.

You thoroughly understand cloud infrastructures such as AWS, Azure, or Google Cloud

You feel comfortable managing people and have previously worked in a leadership role in a fast-changing environment

Plus: You have knowledge of JavaScript/TypeScript (preferably of 1 framework such as

React, Vue.js, or Angular)

Heard of Our Perks?

Work-Life Package: 2 remote days per week, 30 vacation days, 3 weeks per year of remote work, flexible working hours, dog-friendly kick-ass office in the center of the city.

Relocation Package: Visa & legal support, relocation bonus, reimbursement of German Classes costs, and more.

Happy Belly Package: Monthly company lunch, tons of free snacks and drinks, free breakfast & fresh delicious pastries every Monday

Physical & Mental Health Package: In-house gym with a personal trainer, various classes like Yoga with expert teachers & free of charge access to our EAP (Employee Assistance Program) to support your mental health and well-being

Activity Package: Regular team and company events, and hackathons.

Education Package: Opportunities to boost your professional development with courses and training directly connected to your career goals

Wealth building: virtual stock options for all our regular employees