adjoe Engineers’ Blog
 /  Infrastructure  /  Cut Observability Costs by 80%
Infrastructure

How to: Cut Observability Costs by 80% Without Losing Visibility

Observability helps maintain a healthy distributed system, but as you scale, it can quickly become one of your highest engineering expenses. At adjoe, we always consider the financial impact of our architecture without sacrificing reliability.

In our programmatic advertising platform, adjoe Ads, we are responsible for deciding in real time which ad will be shown to a user. 

At the current scale, adjoe receives more than 150 million ad requests daily. To make the best recommendation, our backend architecture relies heavily on machine learning models and a highly distributed microservices setup.

However, this scale comes with an enormous amount of data volume.
Our systems produce around 5 billion Kafka events and 8 billion logs daily, along with1 million active metric series. At one point, we watched our logging infrastructure costs increase by 6x in a single week.

In this article, I’ll cover how our team completely overhauled the observability stack across four pillars: Logs, Metrics, Traces, and Profiling, to regain deep technical insights and cut our observability bills drastically.

The Setup Before the Switch

Initially, we relied heavily on managed services and standard enterprise solutions to get off the ground quickly. But as traffic surged, so did our cloud costs.

We had to take immediate, quick-win actions, such as reducing our log retention from 30 days down to 7 days and demoting non-critical logs to debug levels.

While this stopped the bleeding, having limited retention severely impacted our ability to investigate issues. We needed a sustainable, long-term solution.

Pillar 1: Logs 

Logs - How to Cut Observability Costs by 80% Without Losing Visibility

We run our microservices on AWS Elastic Kubernetes Service (EKS) cluster. To gather the logs, we use Fluent Bit as our lightweight log processor, utilizing the tail plugin to read directly from the /var/log/containers/* paths on our EC2 nodes.

Initially, Fluent Bit routed everything to Elasticsearch. While Elasticsearch is incredibly fast for querying, indexing 8 billion structured logs a day is resource heavy and expensive. 

Cut Observability Costs by 80% Without Losing Visibility

We experimented with Grafana Loki, which achieved significant storage savings by using S3, but developers missed the fast querying capabilities of Elasticsearch.

Eventually, we discovered OpenObserve, an open-source alternative that stores logs as Parquet files in S3.

We implemented a hybrid approach. 

  • Filtering at the Edge: We use grep filters and custom Lua scripts within Fluent Bit to drop unnecessary logs and retag specific streams before they hit the buffering stage.

  • Cheap Long-Term Storage: The bulk of our logs, including debug logs, are routed via Fluent-bit’s HTTP output plugin to OpenObserve.
Cut Observability Costs by 80% - adjoe tech blog 2026 - Optimize observability costs - Logs

🏆 The Win

We brought our log retention back up to 60 days and achieved a 60% cost reduction overall. Better yet, because OpenObserve uses Parquet files, the same format as our data lake. We can now seamlessly join system logs with business events.

Pillar 2: Metrics

Metrics - How to Cut Observability Costs by 80% Without Losing Visibility

Since we use a managed Kubernetes Cluster (EKS) on AWS, starting with CloudWatch for our metrics was the natural choice. It integrates natively and gives you base metrics for free. However, CloudWatch operates on a push-based model where you pay for every PutMetricData API call and every custom dashboard/alert.

To avoid the high costs, we migrated our custom application metrics to a self-hosted Prometheus setup. Prometheus uses a pull-based approach (scraping designated endpoints periodically), which eliminates per-request API billing.

Migrating to the new metric backend was straightforward. Because our framework uses an interface to abstract the writing logic, we simply needed to implement a compatible Prometheus writer.

type Writer interface {
    GetPriority() int
    Write(ctx context.Context, batch Data)
    WriteOne(ctx context.Context, data *Datum)
}

🏆 The Win 

Moving to Prometheus resulted in a 40% reduction in our custom metric costs. Furthermore, it made us vendor-agnostic and allowed us to easily monitor open-source deployments like our TensorFlow Serving instances.

Pillar 3: Traces

Traces - How to Cut Observability Costs by 80% Without Losing Visibility

In a distributed environment, knowing that an error occurred isn’t enough; you need to know where it originated. Tracing helps us visualize service dependencies and identify bottlenecks (e.g., waiting on an external S3 write vs. a slow database query).

We had previously used AWS X-Ray, which was available by our framework, but knowing our expected traffic load for the new bidding product, we opted for Grafana Tempo

Tempo is open-source, supports OpenTelemetry standards, and stores its trace data cheaply in S3
The only thing we had to do was to implement a compatible tracer for the framework we use. For implementation details, you can view the tracer_otel.go of the gosoline framework created by our sister company.

🏆 The Win

We estimate an 80% cost saving by using Tempo backed by S3, compared to what we would have spent scaling AWS X-Ray.

Pillar 4: The Missing Link – Continuous Profiling

Profiling - How to Cut Observability Costs by 80% Without Losing Visibility

Logs, metrics, and traces are fantastic, but they don’t always give you the full picture. For instance, detecting the root cause of a memory leak or a CPU spike from an inefficient function is incredibly difficult with traditional tools.

To solve this, we introduced the fourth pillar of observability: Continuous Profiling using Grafana Pyroscope

Unlike traditional profiling tools that require manual triggering during a load test, continuous profiling is always on in production. It operates with low overhead and gives us code-level insights via flame graphs. 

We deployed Pyroscope using the Grafana Pyroscope helm chart. Sending the profiles from your go application is pretty straightforward process, you can use the pyroscope go client.

func Run(ctx context.Context) error {
    runtime.SetMutexProfileFraction(5)
    runtime.SetBlockProfileRate(5)

    p, err := pyroscope.Start(pyroscope.Config{
        ApplicationName: "bidder",
        ServerAddress:   "http://pyroscope-distributor.pyroscope.svc.cluster.local.:4040",
        Tags: map[string]string{
            "hostname": os.Getenv("HOSTNAME"),
            "env":      os.Getenv("ENV"),
        },
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileInuseSpace,
        },
    })

    if err != nil {
        return err
    }

    <-ctx.Done()
    p.Flush(false)

    return p.Stop()
}

🏆 The Win

Profiling directly helps us reduce our overall infrastructure costs. We can proactively find and fix CPU-heavy or memory-leaking code snippets and optimize our applications to use fewer compute resources. 

Additionally, having code-level insights greatly reduces our investigation times, especially for the tricky ones that are hard to reproduce.

Cost-Effective Observability: Things to Consider

Reduce Observability Costs by 80% Without Losing Visibility

Observability is the key to driving efficiency in modern IT systems. But as we scale to process billions of events and logs daily. It quickly becomes a delicate trade-off between costs and benefits. We cannot afford to lose visibility into our algorithms, which means we must constantly adapt our architecture as our traffic grows.

We try to take advantage of cost-effective infrastructure, like AWS spot instances and S3-backed storage. 

If you doubt that a service can run on spot instances or alternative open-source backends, you can always experiment and evaluate your ideas.

Do not settle with your initial, easy-to-start setup. Re-evaluate your tech solutions to match your current scale. Design your architecture that can withstand unexpected disruptions, and always look for ways to unlock better visibility for less costs. 

We’re continuously exploring  new ways to reduce spend, check out adjoe engineer’s blog for more insights! 

Build products that move markets

Your Skills Have a Place at adjoe

Find a Position