adjoe Engineers’ Blog
 /  Data Science  /  Topic Modeling App Genres with NLP & Latent Dirichlet Allocation
purple abstract decorative text with code snippets
Data Science

Topic Modeling App Genres with NLP & Latent Dirichlet Allocation

I’m Björn-Elmar Macek and I’ve been working as a data scientist at adjoe for seven years. 

Next to data engineering and carrying out analyses, my main focus is on creating models that help us better understand our users. One of our projects involved topic modeling with NLP – particularly, latent dirichlet allocation – I’ll demonstrate how we did this by:

  • giving you a brief introduction to our adtech product
  • explaining why we need certain app usage information from our users
  • taking you through the topic modeling process – from preparation to our findings

What Is Playtime and Its App Usage Data?

Let’s start with an introduction to how our Playtime product works. It’s a rewarded ad unit that mobile users choose to engage with. It serves (mostly) gaming ads that we consider to be of interest to these users. 

Since we reward users for their engagement (based on time spent in the app or levels reached), our users need to accept permissions to benefit from our rewarding mechanism. These permissions allow us to gain deeper insights into their likes and dislikes, which can in turn help us to serve them ads for games they would enjoy.

 Why Is This App Usage Information Important? 

Think of app usage information as a list of IDs that are used by Google to identify apps. This list is enriched by some additional information such as the timestamp identifying when an app was last used. WhatsApp messenger, for example, has the ID com.whatsapp; YouTube’s ID is com.google.android.youtube. 

These app lists work as a kind of fingerprint; they can help you understand your users’ preferences. To do this, it’s important for us to be able to group a user’s existing applications into categories – such as app genre. You can scrape this and other information, such as the description from the app store, using libraries like Google-Play-Scraper. Even despite the app store providing this information, genres often contain a wide range of apps that are quite different. 

Take the lifestyle genre, for instance. This category contains the following three entries, which are quite different:

  • Kasa Smart allows you to configure and control your smart home devices
  • Pinterest allows users to post, browse, and pin images to their own boards
  • H&M is a shopping app for clothes and accessories

To improve how we categorize apps, we decided to use app descriptions from the Google Play Store for more granular classification. This is when we decided to use natural language processing with Python.

Preparing Data before Topic Modeling

Before our team started topic modeling, we had to make sure we first cleaned the data. First of all, we wanted to focus on English and filter out all the rows of our scraped data. These contained app descriptions written in another language. We used polyglot for language detection and applied it to the description column (descr).

import pandas as pd
import regex
from polyglot.detect import Detector

# remove non ASCII characters
def remove_bad_chars(text):
   return regex.compile(r"\p{Cc}|\p{Cs}").sub("", text)

# detect the (most likely) language
def detectLang(x):
   languages = Detector(remove_bad_chars(x), quiet=True).languages
   if (len(languages) == 0):
       return "__"
   max_conf = max([lan.confidence for lan in languages])
   lang = [lan.code for lan in languages if lan.confidence == max_conf][0]
   return lang

data = pd.read_json("playstore_raw.json", lines=True)
data["lang"] = data.descr.apply(lambda x: detectLang(x))
data_en = data[data.lang == "en"]

At this point, we had the plain English words and eliminated special characters, such as smiley faces. Moving forward, we decided to only have nouns, named entities, and verbs – and to get rid of all other words, such as adjectives, adverbs, auxiliary words, numbers, etc. 

Keeping other word types might, of course, have been beneficial (depending on the use case), but we decided to go with these. The next step involved using the pre-trained language model en_core_web_sm in the spaCy library.

import spacy


def getWordType(words, allowed, nlp):
   res = []
   doc = nlp(words)
   for token in doc:
       if token.pos_ in allowed:
           res.append(token.text)
   return ' '.join(res)

nlp = spacy.load("en_core_web_sm")

data_en["text_nouns_propns_verbs"] = data_en.descr.apply(lambda x: getWordType(str(x), ["NOUN", "PROPN", "VERB"], nlp))

At this stage, we were nearly done – we just had to complete a final stemming step. This normalized different forms of the same word: “play” and “plays,” for example. Although both strings are not the same, they reference the same word. We used the NLTK library’s snowball stemmer to do this.

import nltk


def stemWordArray(x, sno):
   return ' '.join([sno.stem(i) for i in x])

sno = nltk.stem.SnowballStemmer('english')
data_en["text_nouns_propns_verbs_stemmed"] = data_en.text_nouns_propns_verbs.apply(lambda x: stemWords(str(x), sno))

The data was then ready to use for training.

Topic Modeling with LDA

You can employ a wide range of algorithms out there. The approaches we consider here expect a set of documents (also known as “corpus”) as input. 

In our example, each app description (descr-column) is considered a document. A document is interpreted in its bag-of-words representation: Just think of it as a simple vector, in which every word is represented in exactly one dimension, and the value in that dimension is equal to the number of times this word occurs in the respective document. 

It’s natural that these bag-of-word models are very sparse. This means that as only limited words appear in one sentence/document, then our array may contain many zeros. Depending on how these vectors are used, this is an unpleasant property that we will automatically overcome when applying topic modeling.

Nonnegative matrix factorization interprets the corpus as a matrix, in which each document is contained in a row with its bag-of-words. This matrix is decomposed into two matrices, which provides

  • insights into the topics/genres that were assigned to a document/app (“MAPPING APP to GENRE” in the figure below).
  • the indication of the extent to which a word insinuates that an app has a genre (“MAPPING WORD to GENRE”).

The diagram below illustrates the process.

diagram of nonnegative matrix factorization

Latent Dirichlet Allocation (LDA) achieves something very similar to what you can see in the diagram above. Its basic assumption is that each document consists of a set of topics (in our case a topic corresponds to a genre), while a topic is actually considered an assignment of probabilities to words. When given a number of final topics (often denoted as “k”) and a corpus, it will estimate “the probability” of each word being associated with a certain topic in a way that makes the existence of the documents in the corpus most likely.

In our example here, we use Latent Dirichlet Allocation to try to identify subgenres within the adventure game genre. We actually used the Gensim implementation here.

from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
import pandas as pd

def takeSecond(elem):
   return elem[1]

GENRE = "GAME_ADVENTURE"
data = data_en[data_en.genreId==GENRE]

texts = data.text_nouns_propns_verbs_stemmed.apply(lambda x: x.split(" "))
common_dictionary = Dictionary(texts)
common_corpus = [common_dictionary.doc2bow(text) for text in texts]
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=6, passes=50)



data["clusters"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True))
data["cluster"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True)[0][0])
data["cluster_prob"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True)[0][1])

What Happened Next?

Since it’s not easy to give you quick insights into the quality of the results, I can show you some screenshots of the apps that have been grouped together.

cluster of games under subcategory adventure games via topic modeling
cluster of racing games and hidden object and riddle games under topic modeling

One of the subsequent subgenres hardly contained any relevant apps, and the racing game subgenre also contained a few outliers, such as the rollercoaster game, which did not belong there. I must mention that there were generally several apps that were assigned to multiple genres (the probability of belonging to each genre being comparably low). We in the end only considered apps that could clearly belong to one subgenre.

For Google Play Store’s lifestyle category mentioned toward the beginning of this article,  we could identify the following subgenres:

  • Spirituality
  • Wallpaper & Themes (for smartphones)
  • DIY and inspirational apps (knitting, crafting, interior and garden design)
  • Fashion, hairstyle, tattoos, make-up selfcare
  • Horoscope
  • Hinduism and Islam

We found overall that Latent Dirichlet Allocation did a nice job of creating a topic model for apps based on their descriptions. We were able to refine the Google Play Store categories and gain a deeper understanding of what a user is interested in.

What’s Next in Topic Modeling?

Going forward, the team still has things to do. We need to further improve the coverage of apps for which we can provide a proper subgenre. This also means we need to detect clusters that we know exist but have not yet been identified. 

Our next step is to keep different kinds of words in the description and try out other topic modeling techniques like LDA2Vec, which combines the strength of understanding plain text that Word2Vec models have with the effective LDA topic modeling.

Senior Data Scientist (f/m/d)

  • adjoe
  • Playtime Data Science
  • Full-time
adjoe is a leading mobile ad platform developing cutting-edge advertising and monetization solutions that take its app partners’ business to the next level. Part of the applike group ecosystem, adjoe is home to an advanced tech stack, powerful financial backing from Bertelsmann, and a highly motivated workforce to be reckoned with.

Meet Your Team: Playtime
Playtime is a time- and event-based ad unit that continuously rewards users with in-app currency – for the time they spend and events completed while playing mobile games. We connect advertisers to 200+ million Playtime users and serve 2bn requests per day at low latency. We ensure that all parties involved have a positive experience. Advertisers get more users for their apps. Monetizers earn revenue for users on their platforms. Users play fun games while simultaneously getting rewarded. Our data science team powers the engine that distributes our ads. They solve multiple tasks such as developing algorithms to provide the most relevant ads for users, predicting user interests and inclinations, and dynamically adjusting pricing based on these predictions. Because the user base is very diverse we are using deep learning models that we could show to be able to serve the best ads to users.

Within the Playtime team you will be responsible for services and models that we use to provide automated solutions for our advertisers.
What you will do:
  • Build, maintain & develop new and existing models (classifications, recommendations, etc.).
  • Dive into state-of-the-art algorithms and deep learning models to create recommendation systems, predict user behavior, optimize user retention, advertiser’s ROAS (return on advertising spend) and other dynamic values.
  • Drill into data from various sources to generate insights and discuss them with your colleagues.
  • Act as an advocate for data-related topics in the company and become the go to person in your area of expertise.
  • Who you are:
  • You have 5+ years of professional experience in the Data Science field.
  • You have a strong knowledge of Python, R, Scala, Julia or similar typical programming languages for Data Science.
  • You have experience drilling into large amounts of data coming from various sources – including AWS Athena, Kafka, Spark, Flink, S3, MySQL.
  • You have experience developing deep learning models and have applied them already in a production environment with large amount of traffic (>1 million predictions per day)
  • You are able to dive deep into mathematical foundations and explain complex topics in a simple way.
  • You are a strong team player and enjoy helping others.
  • Plus: Tech & Infrastructure knowledge: Airflow, hosting models, deploying models, etc. Plus: Experience in the AdTech industry.

  • Fuel for the Journey: Benefits to Support Your Ambitions
  • Invest in Your Future: Regular feedback and our development program support your growth, helping you expand your skill set and achieve your career goals.
  • Easy Arrival to adjoe: From signing to settling in Hamburg, we’ve got you covered. Need a visa? No problem. Ready to build your new life and career at (company name) in Hamburg? We support every ambition—from learning German to a relocation bonus that helps you settle in and make Hamburg feel like home.
  • Live Your Best Life, at Work and Beyond: We work in a hybrid setup with 3 core office days, plus flexible working hours. Enjoy 30 vacation days, 3 weeks of remote work per year, and free access to an in-house gym with lots of different fitness classes and mental health support through our Employee Assistance Program (EAP).
  • Thrive Where You Work: Enjoy the Alster lake view from our central office with top-notch equipment, fun open spaces, and a large variety of snacks and drinks.
  • Join the Community! Participate in regular team and company events, including hackathons and social gatherings. We work together, and we celebrate together, too.
  • Senior Data Scientist (f/m/d)

    • adjoe
    • Playtime Data Science
    • Full-time
    adjoe is a leading mobile ad platform developing cutting-edge advertising and monetization solutions that take its app partners’ business to the next level. Part of the applike group ecosystem, adjoe is home to an advanced tech stack, powerful financial backing from Bertelsmann, and a highly motivated workforce to be reckoned with.

    Meet Your Team: Playtime
    Playtime is a time- and event-based ad unit that continuously rewards users with in-app currency – for the time they spend and events completed while playing mobile games. We connect advertisers to 200+ million Playtime users and serve 2bn requests per day at low latency. We ensure that all parties involved have a positive experience. Advertisers get more users for their apps. Monetizers earn revenue for users on their platforms. Users play fun games while simultaneously getting rewarded. Our data science team powers the engine that distributes our ads. They solve multiple tasks such as developing algorithms to provide the most relevant ads for users, predicting user interests and inclinations, and dynamically adjusting pricing based on these predictions. Because the user base is very diverse we are using deep learning models that we could show to be able to serve the best ads to users.

    Within the Playtime team you will be responsible for services and models that we use to provide automated solutions for our advertisers.
    What you will do:
  • Build, maintain & develop new and existing models (classifications, recommendations, etc.).
  • Dive into state-of-the-art algorithms and deep learning models to create recommendation systems, predict user behavior, optimize user retention, advertiser’s ROAS (return on advertising spend) and other dynamic values.
  • Drill into data from various sources to generate insights and discuss them with your colleagues.
  • Act as an advocate for data-related topics in the company and become the go to person in your area of expertise.
  • Who you are:
  • You have 5+ years of professional experience in the Data Science field.
  • You have a strong knowledge of Python, R, Scala, Julia or similar typical programming languages for Data Science.
  • You have experience drilling into large amounts of data coming from various sources – including AWS Athena, Kafka, Spark, Flink, S3, MySQL.
  • You have experience developing deep learning models and have applied them already in a production environment with large amount of traffic (>1 million predictions per day)
  • You are able to dive deep into mathematical foundations and explain complex topics in a simple way.
  • You are a strong team player and enjoy helping others.
  • Plus: Tech & Infrastructure knowledge: Airflow, hosting models, deploying models, etc. Plus: Experience in the AdTech industry.

  • Fuel for the Journey: Benefits to Support Your Ambitions
  • Invest in Your Future: Regular feedback and our development program support your growth, helping you expand your skill set and achieve your career goals.
  • Easy Arrival to adjoe: From signing to settling in Hamburg, we’ve got you covered. Need a visa? No problem. Ready to build your new life and career at (company name) in Hamburg? We support every ambition—from learning German to a relocation bonus that helps you settle in and make Hamburg feel like home.
  • Live Your Best Life, at Work and Beyond: We work in a hybrid setup with 3 core office days, plus flexible working hours. Enjoy 30 vacation days, 3 weeks of remote work per year, and free access to an in-house gym with lots of different fitness classes and mental health support through our Employee Assistance Program (EAP).
  • Thrive Where You Work: Enjoy the Alster lake view from our central office with top-notch equipment, fun open spaces, and a large variety of snacks and drinks.
  • Join the Community! Participate in regular team and company events, including hackathons and social gatherings. We work together, and we celebrate together, too.
  • Are you data-driven – and driven?

    See vacancies