Building Resilient Backend: Lessons from Event-Driven Architecture Approach

April 08, 2025

When designing a backend to handle high load, choosing the right architecture approach isn’t just important – it’s mission-critical. At adjoe, our systems manage half a billion requests every single day. At this scale, relying on synchronous communication between services simply isn’t an option. We needed to create a backend that raised the challenge and could:

Communicate asynchronously, ensuring smooth operations even during traffic surges.
Handle traffic spikes seamlessly, processing every request.
Scale effortlessly to meet growing demand.
Maintain loosely coupled services, enabling independent scaling.

The response to these demands? An event-driven backend architecture. Instead of calling services directly, events are published to record what has happened.

Well-designed events act as facts within the service domain – for example, a user clicking on an ad or installing an app. The service that publishes the event doesn’t need to know – or care – what happens when another service consumes it. In this model, the traditional HTTP call to a service is replaced by the simple act of publishing an event.

Event-driven patterns might seem complex at first, especially compared to a simple backend with an asynchronous API and a database. However, once you understand the concept, event-driven architecture actually makes designing distributed systems easier.

Choosing the Right Message Broker

The first step when working with events is selecting a broker, or message queue (the infrastructure component, also known as a pub/sub mechanism), that will handle message publishing. Events are published to the event bus (i.e. the message broker), which allows multiple consumers to subscribe and receive those messages.

Since our infrastructure relies on AWS, we initially chose SNS/SQS as our message broker for its high availability, and practically unlimited scaling possibilities out-of-the-box. Events are simple value objects that represent a system’s state. With the backend built as a pure Go monorepo, event definitions are shared across services as Go structs, which are then marshalled into JSON.

type InstallData struct {
   AppID         string
   DeviceID      string
   UserUUID      string
   CreatedAt     t.AdjoeTime
   Platform      p.Platform  `json:",omitempty"`
   }

Events Handling with AWS SNS/SQS

The producing service publishes events to an SNS topic using the AWS SDK for Go. The implementation includes optional features such as compression, message deduplication, and batching to optimize performance and reduce pricing, because with SNS/SQS you also pay for the amount of data transferred.

Events published to an SNS topic have a limited lifespan—they are either consumed or automatically discarded after expiration.

func (s *SNSClient) PublishMessage(
  topicName string, 
  subject string, 
  message *Message, 
  options ...PublishOptions) (*string, error) {
    
   for _, option := range options {
      option(topicName, message)
   }

   // add .fifo suffix if the queue is using FIFO features    
   isFIFO := message.GroupID != "" || message.DeduplicationID != ""
   topicName = GetTopicNameFIFO(topicName, isFIFO)
   
   // get topic ARN
   hasDeduplicationID := message.DeduplicationID != ""
   arn, err := s.GetTopicArn(topicName, hasDeduplicationID)
   if err != nil {
      return nil, err
   }

   // marshall message to JSON 
   data, err := helper.MarshalJSONWithOptions(message, isFIFO)
   if err != nil {
      return nil, err
   }

Each service that subscribes to a particular event has an SQS queue linked to the SNS topic for that event. The message broker (SNS) continues to deliver the message until it gets an acknowledgment from the consumer. SQS ensures at-least-once delivery, meaning the message will be sent until the service confirms receipt. Additionally, SQS can be set up to process events in the exact order they arrive (FIFO queue) or filter messages based on specific criteria.

On the consumer side, the HTTP handler is replaced by an event handler. This handler processes the event payload and typically stores the data in a DynamoDB table for further use.

 var messageAttributes map[string]*sns.MessageAttributeValue
   var messageString string
  
// zip compress and base64 encode the message, if required 
 if message.IsCompressed && !message.IsBatch {
      messageAttributes = map[string]*sns.MessageAttributeValue{
          "IsCompressed": {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
      }
      data, err = helper.Gzip(data)
      if err != nil {
         return nil, err
      }
      messageString = base64.StdEncoding.EncodeToString(data)
   } else if !message.IsCompressed && message.IsBatch {
      messageAttributes = map[string]*sns.MessageAttributeValue{
         constant.IsBatch: {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
      }
      messageString = string(data)
   } else if message.IsCompressed && message.IsBatch {
      messageAttributes = map[string]*sns.MessageAttributeValue{
            constant.IsCompressed: {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
         constant.IsBatch: {
            DataType:    aws.String("String"),
            StringValue: aws.String("true"),
         },
      }
      data, err = helper.Gzip(data)
      if err != nil {
         return nil, err
      }
      messageString = base64.StdEncoding.EncodeToString(data)
   } else {
      messageAttributes = map[string]*sns.MessageAttributeValue{
             }
      messageString = string(data)
   }


  // prepare parameters for publishing
   params := &sns.PublishInput{
      Message:           aws.String(messageString),
      MessageAttributes: messageAttributes,
      Subject:           aws.String(subject),
      TopicArn:          arn,
   }


   // FIFO deduplication
   if message.DeduplicationID != "" {
      params.MessageDeduplicationId = &message.DeduplicationID
   }


   // publish message to SNS topic
   result, err := s.svc.Publish(params)
   if err != nil {
      return nil, err
   }
   return result.MessageId, nil
}

Important to note: in an event-driven design, the system operates with eventual consistency. This means that, at any given moment, different services might be in different states depending on which events they’ve processed so far. While eventual consistency is a common challenge in distributed systems – especially in event-driven architectures – it’s an acceptable trade-off for adjoe’s business model.

One challenge of event-driven architecture is ensuring events are consumed and processed successfully. If processing fails, the event will be redelivered until handled, allowing the system to recover over time. However, unprocessed messages lingering too long often indicate a deeper issue.

For example, we are displaying statistical data based on user clicks and installs; if for some reason there are issues with publishing, consuming, or aggregating these events, dashboards will show incorrect data, but eventually it will be updated. Another example is notifying another service that a user install has been received: if event notification fails, maybe the user settings will not be updated in real-time, but that will not break the system.

This makes observability crucial, with the number of unprocessed messages being a key metric to monitor. At adjoe, we address this by configuring AWS SQS queues to send unprocessed messages to dead-letter queues (DLQs), where they can be retried later. We also use CloudWatch metrics and alarms to monitor unprocessed messages in real time, ensuring issues are quickly identified and resolved.

Migrating from AWS Queues to Self-Hosted Kafka

As our event volume grew, AWS pricing for SNS/SQS pub/sub – despite its stability and convenience – became a concern. Just as a rough idea, the backend is managing around 130 different events published to the similar number of SNS topics, with over 300 SQS queues subscribed. This number goes up to 500 if we also count dead-letter SQS. The busiest queues are processing around half a million messages daily. To address rising costs, we began migrating to a self-hosted Kafka setup as our new pub/sub mechanism.

This migration required a few changes. First, we now define events using protobuf schemas, which are published to a schema registry to maintain consistency and compatibility across services.

type InstallHandler struct {
   repository djoemo.RepositoryInterface
}


// init the handler
func NewInstallHandler(repository djoemo.RepositoryInterface) event.Handler {
   return InstallHandler{
      repository: repository,
   }
}


func (h InstallHandler) Handle(ctx context.Context, e event.Event, result chan bool) {


   eventData := e.Data().(*event.InstallData)
   
   // construct the event to be saved
   install := model.Install{
      AppID:     eventData.AppID,
      DeviceID:  eventData.DeviceID,
      UserUUID:  eventData.UserUUID,
      CreatedAt: eventData.CreatedAt,
      Platform:  eventData.Platform,
   }


  // save the event to DynamoDB
   err := h.repository.SaveItemWithContext(ctx, install.Key(), install)
   if err != nil {
      result <- false
      return
   }
   result <- true
}

To replicate SNS/SQS fan-out behavior in Kafka, we create one consumer group per subscriber for each Kafka topic. Each partition acts as an ordered, immutable sequence of events that is continuously appended to. Kafka ensures message ordering within partitions, meaning messages are delivered in the same order they were produced – providing FIFO consumption by default.

However, publishing to Kafka isn’t as reliable as publishing to SNS. To address this, we implemented a fallback mechanism: for each Kafka topic, we define fallback SQS and dead-letter SQS queues to ensure no message is lost in case of publishing failures.

Message deduplication in Kafka requires additional development effort. To address this, we generate a deduplication key for each event and use a Redis cache to identify and skip publishing duplicate events.

//go:generate mockgen -package mocks -source publisher.go -destination ../../../mocks/streaming/kafka/publisher.go
type PublisherI[T any] interface {
   Publish(ctx context.Context, v *T) error
}


type TransformerI[T any] interface {
   Transform(*T) (*Message, error)
}


type Publisher[T any] struct {
   producerKafka kafkaproducer.KafkaProducerI
   transformer   TransformerI[T]


   // optional fields set via options
   cache            cache.CacheClientInterface
   cascadePublisher messaging.SQSClientInterface
   cascadeEventName event.EventName
}


func NewPublisher[T any](
   producerKafka kafkaproducer.KafkaProducerI,
   transformer TransformerI[T],
   opts ...Option[T],
) *Publisher[T] {
   p := &Publisher[T]{
      producerKafka: producerKafka,
      transformer:   transformer,
   }


   for _, opt := range opts {
      opt(p)
   }


   return p
}


func (p *Publisher[T]) Publish(ctx context.Context, v *T) (err error) {
   m, err := p.transformer.Transform(v)
   if err != nil {
      return errors.Wrap(err, "transforming struct")
   }


   defer func() {
      if err != nil {
         err = p.tryCascade(ctx, m, err)
      }
   }()


   key := m.DeduplicationKey()
   if p.cache != nil {
      success, err := p.cache.SetNX(ctx, key, "", m.Expiration()).Result()
      if err != nil {
         return errors.Wrap(err, "saving entity deduplication key to cache")
      }


      if !success {
         log.Debug("skip publishing entity because there has already been an attempt to do it")
         return nil
      }
   }


   err = p.producerKafka.Produce(ctx, m.Key(), m.Message())
   if err == nil {
      return nil
   }


   if p.cache != nil {
      log.WithError(err).Warn("entity publish failed, deleting deduplication key from the cache")
      if _, err := p.cache.Delete(ctx, key).Result(); err != nil {
         return errors.Wrap(err, "deleting entity deduplication key from cache")
      }
   }


   return errors.Wrap(err, "producing to kafka")
}


func (p *Publisher[T]) tryCascade(ctx context.Context, m *Message, err error) error {
   if p.cascadePublisher == nil {
      return err
   }


   e := event.ReserveQueueEvent{
      Message:          m.Message(),
      Key:              m.Key(),
      DeduplicationKey: m.DeduplicationKey(),
      Expiration:       m.Expiration(),
   }


   data, err := json.Marshal(e)
   if err != nil {
      return errors.Wrap(err, "marshaling reserve queue event")
   }


   msg := messaging.Message{Body: data}
   queueName := messaging.QueueName(env.EnvServicePrefix(string(p.cascadeEventName)))


   if _, err := p.cascadePublisher.PublishMessageWithContext(ctx, queueName, msg); err != nil {
      return errors.Wrap(err, "publishing to cascade queue")
   }


   return nil
}

As part of the Kafka refactoring, we integrated Prometheus for monitoring. The Go client for Prometheus automatically provides default metrics, but we also added custom metrics and alerts specifically for the messaging queues.

The Key to Event-Driven Success

If there’s one key lesson we’ve learned from running an event-driven architecture in production, it’s this: prioritize observability from the very start. Debugging incidents in an event-based system without the ability to correlate events is a huge challenge. All three pillars of observability are essential: logging helps you understand why your system is in a certain state, metrics show you how long it’s been there, and tracing reveals what’s affected by that state.

In short, while event-driven architecture provides unparalleled scalability and flexibility, it comes with its own set of challenges. By focusing on observability, avoiding common pitfalls like the “Big Ball of Mud” (unstructured code from poorly defined events), and selecting the right tools and infrastructure, we’ve built a resilient, self-healing system that handles massive traffic loads. The road hasn’t always been smooth, but the rewards are clear: we now have a backend that’s not only adaptable but also capable of scaling with our growing business.

Building Resilient Backend: Lessons from Event-Driven Architecture Approach

Choosing the Right Message Broker

Events Handling with AWS SNS/SQS

Migrating from AWS Queues to Self-Hosted Kafka

The Key to Event-Driven Success

There are no available entries.

We are programmed to success

Summarize at