How Does a Deep and Cross Network Enable Us to Replace Over 100 Individual Models

Mohamadhosein Dehghani

February 13, 2025

Summarize at

ChatGPT Perplexity Grok Co-pilot Claude

Welcome to part two of our article on Neural and Cross Networks.

In this section, I’ll share how we developed a single model that replaced over 100 individual models—enhancing accuracy, speed, and coverage while significantly reducing maintenance costs.

Our Models Architecture

At adjoe, we process millions of predictions daily across 100+ countries. Managing separate models fine-tuned for different data slices eventually became unsustainable due to growing complexity. We needed a scalable, generalized solution—and that’s where deep and cross networks came into the picture.

Let’s look at how our user engagement prediction system previously worked in detail before jumping into our new approach.

We used to have over 100 models based on gradient-boosted trees. Each model was responsible for a specific slice of the data. This was necessary because each slice of data has its own unique properties. As an example, each ad targets a different audience, and standard models couldn’t effectively capture the differences between these audiences. Using traditional model architectures, a single model couldn’t be trained to handle everything. The tree-based models were chosen for their speed and efficiency.

However, one challenge with this approach was that we needed sufficient data from each slice to effectively train a model. If a new data slice appeared, we couldn’t make any predictions until enough data was gathered. But at the same time, we couldn’t create a single model to handle everything either.

Looking at Users and App Install Data

Let’s analyze one example of input features of our model.

In this case, we aimed to estimate the value on the Y-axis, representing the probability of installation (in percents). The X-axis shows the number of ads a user has been exposed to. As we can see, the likelihood of installation decreases as the number of ads viewed increases. This is intuitive: excessive ad exposure typically leads to diminishing interest, with users becoming less engaged and less likely to install the advertised app.

The blue line represents users who have never installed an app from an ad, while the red line indicates users who have previously installed one app by clicking on an ad. As shown, users with prior installs have nearly double the likelihood of installation at every point on the graph.

Furthermore, the likelihood of installation increases with each additional app installed. For instance, after a user installs three apps, their probability of installation could double again. The data reveals a clear pattern: after two installs, the likelihood doubles, and after three, it doubles once more. This highlights the core issue discussed in my previous article (predicting x²).

We can see that previous installs have a multiplicative effect on the installation likelihood. An interesting pattern emerges: users with three previous installs have the highest chance of installing among all the groups shown. However, even at their peak, this likelihood is still lower than that of users with two previous installs.

We expect the purple line to have the dotted-purple-line trajectory, but what the model takes from the data is that the green line has the highest value.

This discrepancy arises because we don’t have enough data for users who have installed three apps. It’s clear that you can’t install three apps after only seeing two ads, even if you install every app shown. The data just doesn’t align. Additionally, it’s uncommon for a user to see three ads and install all of them. As a result, we need more time to collect enough data for this scenario.

As you may notice, there is a lack of sufficient data for users who have made three installs. If we had more data, it would likely be positioned higher on the graph. With limited information, traditional models, including neural networks, might simplify assumptions, such as treating two installs as more valuable than three. This highlights the importance of refining our approach to ensure accurate predictions. This is a common pitfall of standard models. In contrast, deep and cross networks, especially with their automatic feature crossing, are designed to avoid this mistake.

Training Our Deep-and-Cross-Based Model

Now equipped with the power of deep and cross networks, we are no longer forced to make many different models so we chose to train a single deep-and-cross-based model using all the data we had. By combining different slices as an added feature to model input, we created a model that could make predictions across all possible scenarios, demonstrating the advanced capabilities of machine learning in adtech.

What benefits did this switch bring us?

This model considers all of the available data, leading to improved accuracy and better overall metrics. We saw an increase in both aggregated metrics and for each individual slice.
Having a single model to maintain significantly reduced costs, both in terms of infrastructure and the time developers spent on maintenance.
With just one model, we could batch all our ads and requests together and process them in a single operation. This batching led to faster inference time.
Previously, when we onboarded a new app advertiser, we couldn’t start making predictions right away due to the lack of data. However, because this model was trained on all available data and was designed to handle a wide range of scenarios, we could quickly start using it for any new case. The new model’s versatility allowed us to significantly increase coverage without any delays.

Final Thoughts: What to Consider When Using DCNs

When you want to use DCNs, there are some key points to keep in mind.

Don’t overdo it. While adding layers to deep and cross networks might seem tempting, more isn’t always better. The vanishing gradient problem, common in neural networks, is even more pronounced here. Based on both my experience and the literature, keeping the depth to three or four layers tends to give the best results. Adding more layers can actually hurt performance.
These models are fast at making predictions, but the training process can be much slower. Expect potential delays during training.
There are variations of DCN worth considering. What we’ve covered is based on DCNv2, but for better results, check out GDCN (gated deep and cross networks). A new version of the DCN paper (v3) has also been released, which I haven’t covered here—be sure to check it out.

If you are working with structured data and especially if you have the feeling that your models can learn more from your data, deep and cross networks might be your solution.

Sources:

Ruoxi Wang, Gang Fu, Bin Fu, and Mingliang Wang.
Deep & Cross Network for Ad Click Predictions.
Stanford University and Google Inc.
Fangye Wang, Tun Lu, Hansu Gu, Dongsheng Li, Peng Zhang, and Ning Gu.
Towards Deeper, Lighter and Interpretable Cross Network for CTR Prediction.
Fudan University, Microsoft Research Asia, and Seattle.
Honghao Li, Hanwei Li, Yiwen Zhang, Lei Sang, Yi Zhang, and Jieming Zhu.
DCNv3: Towards Next Generation Deep Cross Network for Click-Through Rate Prediction.
Anhui University and Huawei Noah’s Ark Lab.

Data Science & Analytics1

Role

Team

Location

Data Science Lead (f/m/d)

Playtime Data Science,

Hamburg

We’re programmed to succeed

See vacancies

Contents

Summarize at

ChatGPT Perplexity Grok Co-pilot Claude