adjoe Engineers’ Blog
 /  Data Science & Analytics  /  Anomaly Detection System
Data Science & Analytics

Behind the Scenes: How We Built an Anomaly Detection System

Identifying the Need for an Anomaly Detection System 

Managing data from 500+ global app publishers comes with challenges—one of the biggest is detecting anomalies quickly and accurately. 

With exploding data volumes and granularity, monitoring the smallest data slices for anomalies is critical before it impacts performance.

To solve this, adjoe has developed a powerful anomaly detection algorithm using machine learning and statistical analysis. It identifies and reports data irregularities in real-time, alerting our teams before small issues turn into major bottlenecks. 

With every distribution of a campaign, the user experiences a view, followed by a click on an app and ultimately ends up installing the app. The health of this funnel, i.e metrics like the view-to-click and  click-to-install rates, help us measure the performance of adjoe’s services, indicating how ads engage users and drive app installations.

Our monitoring system detects anomalies in these conversion metrics across slices of data. A “slice” refers to specific segments, such as our advertiser apps, marketing campaigns, software versions (SDK), countries, and advertisers. 

Let’s explore how this system helps us spot unusual data patterns and improve accuracy. 

Defining the Approach for Detecting Anomalies 

First, we need to define tumbling windows of our KPIs and take the latest window. We’ll call it “recent distribution” and the rest are “old distributions”. 

Each “recent” value will represent a certain percentile of the corresponding old distribution.

The extreme values we see in the old distribution at the top and lower X% can be considered extraordinary. 

The task is now to define for each range of recent-windows this X. If the recent values are below the X-percentile of the old distribution for the given slice, consider it to be an outlier. Once we flag and identify outliers, we can report it to the responsible team via simple webhooks. 

Overcoming the Challenges

One of the key challenges during this exercise was ensuring that each data slice contained enough samples, with the old and new data sets representing the same distribution. 

To achieve this, we implemented a loop that generated additional randomly sampled data for each slice. During each iteration, we used the Kolmogorov–Smirnov (KS) test to assess whether the two data sets came from the same distribution.

Image source: Wikipedia.
Illustration of the Kolmogorov–Smirnov statistic. The red line is a model CDF, the blue line is an empirical CDF, and the black arrow is the KS statistic.

Intuitively, it provides a method to qualitatively answer the question “How likely is it that we would see two sets of samples like this if they were drawn from the same (but unknown) probability distribution?” 

The p-value from the KS test helps us determine if there is a significant difference between the two data distributions. A small p-value indicates that the two groups likely come from different populations, meaning their medians, variability, or overall distribution shapes may differ. Conversely, a high p-value suggests that the two groups are likely from the same distribution.

Our goal was to ensure that at least 95% of the slices showed a high p-value, indicating that the old and new data were sufficiently similar. If the p-value for any slice was low, we continued to run additional iterations of bootstrapping to increase the sample size until we achieved this threshold. At which point, the scale of bootstrap needed to have “enough” data is decided.

Based on the insights from this research phase, we proceeded with collecting the data and defined and stored the “recent” and “old” datasets for further data analysis. 

Next comes the issue of identifying the X% of our data that we consider abnormal, given the highly varying behaviour of our data. We need to consider all mathematical behavioral patterns for each KPI and ensure that there’s not a lot of noise, and we are alerted at the right time. 

So, we use a combination of 2 different statistical methods. 

  1. Z test: The test calculates a Z-score, which indicates how many standard deviations the sample mean is from the population mean. If this Z-score falls beyond a certain threshold (determined by your chosen significance level), you can conclude that the difference is statistically significant.
Image source: Wikipedia
  1. Interquartile range: The interquartile range (IQR) is a measure of statistical dispersion. It represents the range within which the central 50% of data values fall. It is calculated by subtracting the first quartile (Q1) from the third quartile (Q3).
Image source: Wikipedia
Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ2) Population

Comparison of the 2 methods

FeatureZ-TestIQR
Assumption of normalityAssumes normal distributionDoes not assume normality, works with skewed data
Sensitivity to outlier detectionSensitive to outliers (distorts mean and std dev)Robust to outliers (focuses on quartiles)
Data distributionBest for normally distributed dataWorks well with non-normal/skewed data
Precision of anomaly measureProvides a precise measure (Z-score)Only indicates if a data point is an outlier (binary)
Ease of implementationRequires calculation of mean and std devSimple calculation of quartiles
Sample size requirementBetter with large sample sizesWorks with small sample sizes
Threshold flexibilityUses a standard significance level (e.g., 0.05)Fixed 1.5 * IQR threshold, may need adjustment based on sensitivity of data
Works with heavy-tailed dataStruggles with heavy-tailed dataCan handle heavy-tailed distributions well

If the value of the KPI for the recent window fails either of these 2 tests, we say that it is an anomalous value. For the KPI to fail these tests, the Z score has to be > 2 or <-2 or the value has to be outside the 2*IQR range.

Why Use Both?

  1. Coverage of Different Data Types:
    • The Z-score works well for normally distributed data, so it helps identify outliers in symmetric distributions.
    • The IQR works better when data is skewed or has non-normal distributions, as it doesn’t rely on the assumption of normality.
  2. Increasing Accuracy with Dual Testing
    • Using both tests adds an extra layer of validation. If a value passes one test but fails the other, it can indicate that there might be something interesting about the value. It reduces the likelihood of false positives and false negatives.
  3. Robustness:
    • In some cases, data might have a mix of normal and non-normal distributions. The Z-score might flag some points that the IQR would not, and vice versa. By applying both, you can be more confident in identifying true outliers.

Here’s how we achieve the aforementioned 2-test solution: 

        def bootstrap_sample(data, n_iterations):
            bootstrapped = []
            for _ in range(n_iterations):
                sample = np.random.choice(data, size=len(data), replace=True)
                bootstrapped.append(np.mean(sample))  
            return bootstrapped


        def bootstrap_kpi(df1, df2, n_iterations):
            results = []

            for _, row in df1.iterrows():
                dimensions_filter = (df2['country'] == row['country']) & (df2['sdkhash'] == row['sdkhash']) & \
                                    (df2['appid'] == row['appid']) & (df2['partner_name'] == row['partner_name'])
                
                kpi_values = df2[dimensions_filter][kpi].values
                original_kpi_value = row[kpi]

                bootstrapped_means = bootstrap_sample(kpi_values, n_iterations)
                mean_kpi = np.mean(bootstrapped_means)
                bootstrap_std = np.std(bootstrapped_means)

                z_score = (original_kpi_value - mean_kpi) / bootstrap_std
                if np.abs(z_score) > 2:
                    z_flag = False
                else:
                    z_flag = True
                
                sorted_bootstrap_means = np.sort(bootstrapped_means)
                Q1 = np.percentile(sorted_bootstrap_means, 25)  
                Q3 = np.percentile(sorted_bootstrap_means, 75) 
                IQR = Q3 - Q1
                iqr_lower_bound = Q1 - 2 * IQR
                iqr_upper_bound = Q3 + 2 * IQR
                if original_kpi_value > iqr_lower_bound and original_kpi_value < iqr_upper_bound:
                    within_iqr_range = True 
                else:
                    within_iqr_range = False 

                result = {
                    'date' : row['date'],
                    'country': row['country'],
                    'sdkhash': row['sdkhash'],
                    'appid': row['appid'],
                    'partner_name': row['partner_name'],
                    'actual_value': original_kpi_value,
                    'bootstrapped_mean': mean_kpi,
                    'within_z_score_range': z_flag,
                    'within_iqr_range' : within_iqr_range
                }
            
                results.append(result)

            return pd.DataFrame(results)
        result_df = bootstrap_kpi(recent_data, old_data, n_iterations)

Taking Actions: Automating Alerts for Anomaly Detection

Once the system has flagged the respective data slices as displaying anomalous behaviour, we make sure that the responsible person, whether the dev team or the account manager, is alerted to the issues.

It is done using a simple webhook:

			if messages:
                message_body = { "text": "\n".join(messages)  }
                webhook_url = {webhook_url}
                response = requests.post(webhook_url, json=message_body)
               
                if response.status_code == 200:
                    print(f"Alert sent successfully to Google Chat! {message_body}")
            	  else:
                    print(f"Failed to send message. Status code: {response.status_code}")

Looking at the Future  

At adjoe, our backend processes thousands of views every minute. It makes anomaly detection essential to maintain healthy systems at scale. Our approach is designed for anyone looking to strengthen their detection systems—without needing deep adtech expertise. 

The anomaly detection process teaches you how to adapt to the diverse behavior of each data slice for every KPI. The alarms work as intended at the moment, but as the old saying goes, “You’ll never reach perfection because there’s always room for improvement. Yet, along the way to perfection, you’ll get better.”  

We aim to keep on monitoring the results of this project and improve upon them to ensure the stability of our business ecosystem. 

In the future, we’re going one step further and designing an anomaly detection algorithm that is triggered every few minutes. The system is intended to take care of outlier detection for our core KPIs, such as views, clicks, and postbacks

We expect to receive alerts as soon as something goes wrong and we see any of these KPIs deviate significantly from their usual levels. As we continue refining our methods, our goal remains the same: staying ahead of anomalies to keep the performance strong and predictable. 

Now that we’ve covered how to analyze data patterns, let’s make it fun to share with your team! Check out our expert post on data visualization tools for AdTech.

References: 

  1. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html
  2. https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
  3. https://www.graphpad.com/guides/prism/latest/statistics/interpreting_results_kolmogorov-smirnov_test.htm 
  4. https://www.digitalocean.com/community/tutorials/bootstrap-sampling-in-python
  5. https://www.geeksforgeeks.org/z-test/

Demand Solutions

Director of Technology (f/m/d)

  • Full-time,
  • Hamburg

Manual Tester (f/m/d)

  • Full-time,
  • Hamburg

We are programmed to success

See vacancies