adjoe Engineers’ Blog
 /  Data Science  /  Topic Modeling App Genres with NLP & Latent Dirichlet Allocation
purple abstract decorative text with code snippets
Data Science

Topic Modeling App Genres with NLP & Latent Dirichlet Allocation

I’m Björn-Elmar Macek and I’ve been working as a data scientist at adjoe for seven years. 

Next to data engineering and carrying out analyses, my main focus is on creating models that help us better understand our users. One of our projects involved topic modeling with NLP – particularly, latent dirichlet allocation – I’ll demonstrate how we did this by:

  • giving you a brief introduction to our adtech product
  • explaining why we need certain app usage information from our users
  • taking you through the topic modeling process – from preparation to our findings

What Is Playtime and Its App Usage Data?

Let’s start with an introduction to how our Playtime product works. It’s a rewarded ad unit that mobile users choose to engage with. It serves (mostly) gaming ads that we consider to be of interest to these users. 

Since we reward users for their engagement (based on time spent in the app or levels reached), our users need to accept permissions to benefit from our rewarding mechanism. These permissions allow us to gain deeper insights into their likes and dislikes, which can in turn help us to serve them ads for games they would enjoy.

 Why Is This App Usage Information Important? 

Think of app usage information as a list of IDs that are used by Google to identify apps. This list is enriched by some additional information such as the timestamp identifying when an app was last used. WhatsApp messenger, for example, has the ID com.whatsapp; YouTube’s ID is com.google.android.youtube. 

These app lists work as a kind of fingerprint; they can help you understand your users’ preferences. To do this, it’s important for us to be able to group a user’s existing applications into categories – such as app genre. You can scrape this and other information, such as the description from the app store, using libraries like Google-Play-Scraper. Even despite the app store providing this information, genres often contain a wide range of apps that are quite different. 

Take the lifestyle genre, for instance. This category contains the following three entries, which are quite different:

  • Kasa Smart allows you to configure and control your smart home devices
  • Pinterest allows users to post, browse, and pin images to their own boards
  • H&M is a shopping app for clothes and accessories

To improve how we categorize apps, we decided to use app descriptions from the Google Play Store for more granular classification. This is when we decided to use natural language processing with Python.

Preparing Data before Topic Modeling

Before our team started topic modeling, we had to make sure we first cleaned the data. First of all, we wanted to focus on English and filter out all the rows of our scraped data. These contained app descriptions written in another language. We used polyglot for language detection and applied it to the description column (descr).

import pandas as pd
import regex
from polyglot.detect import Detector

# remove non ASCII characters
def remove_bad_chars(text):
   return regex.compile(r"\p{Cc}|\p{Cs}").sub("", text)

# detect the (most likely) language
def detectLang(x):
   languages = Detector(remove_bad_chars(x), quiet=True).languages
   if (len(languages) == 0):
       return "__"
   max_conf = max([lan.confidence for lan in languages])
   lang = [lan.code for lan in languages if lan.confidence == max_conf][0]
   return lang

data = pd.read_json("playstore_raw.json", lines=True)
data["lang"] = data.descr.apply(lambda x: detectLang(x))
data_en = data[data.lang == "en"]

At this point, we had the plain English words and eliminated special characters, such as smiley faces. Moving forward, we decided to only have nouns, named entities, and verbs – and to get rid of all other words, such as adjectives, adverbs, auxiliary words, numbers, etc. 

Keeping other word types might, of course, have been beneficial (depending on the use case), but we decided to go with these. The next step involved using the pre-trained language model en_core_web_sm in the spaCy library.

import spacy


def getWordType(words, allowed, nlp):
   res = []
   doc = nlp(words)
   for token in doc:
       if token.pos_ in allowed:
           res.append(token.text)
   return ' '.join(res)

nlp = spacy.load("en_core_web_sm")

data_en["text_nouns_propns_verbs"] = data_en.descr.apply(lambda x: getWordType(str(x), ["NOUN", "PROPN", "VERB"], nlp))

At this stage, we were nearly done – we just had to complete a final stemming step. This normalized different forms of the same word: “play” and “plays,” for example. Although both strings are not the same, they reference the same word. We used the NLTK library’s snowball stemmer to do this.

import nltk


def stemWordArray(x, sno):
   return ' '.join([sno.stem(i) for i in x])

sno = nltk.stem.SnowballStemmer('english')
data_en["text_nouns_propns_verbs_stemmed"] = data_en.text_nouns_propns_verbs.apply(lambda x: stemWords(str(x), sno))

The data was then ready to use for training.

Topic Modeling with LDA

You can employ a wide range of algorithms out there. The approaches we consider here expect a set of documents (also known as “corpus”) as input. 

In our example, each app description (descr-column) is considered a document. A document is interpreted in its bag-of-words representation: Just think of it as a simple vector, in which every word is represented in exactly one dimension, and the value in that dimension is equal to the number of times this word occurs in the respective document. 

It’s natural that these bag-of-word models are very sparse. This means that as only limited words appear in one sentence/document, then our array may contain many zeros. Depending on how these vectors are used, this is an unpleasant property that we will automatically overcome when applying topic modeling.

Nonnegative matrix factorization interprets the corpus as a matrix, in which each document is contained in a row with its bag-of-words. This matrix is decomposed into two matrices, which provides

  • insights into the topics/genres that were assigned to a document/app (“MAPPING APP to GENRE” in the figure below).
  • the indication of the extent to which a word insinuates that an app has a genre (“MAPPING WORD to GENRE”).

The diagram below illustrates the process.

diagram of nonnegative matrix factorization

Latent Dirichlet Allocation (LDA) achieves something very similar to what you can see in the diagram above. Its basic assumption is that each document consists of a set of topics (in our case a topic corresponds to a genre), while a topic is actually considered an assignment of probabilities to words. When given a number of final topics (often denoted as “k”) and a corpus, it will estimate “the probability” of each word being associated with a certain topic in a way that makes the existence of the documents in the corpus most likely.

In our example here, we use Latent Dirichlet Allocation to try to identify subgenres within the adventure game genre. We actually used the Gensim implementation here.

from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
import pandas as pd

def takeSecond(elem):
   return elem[1]

GENRE = "GAME_ADVENTURE"
data = data_en[data_en.genreId==GENRE]

texts = data.text_nouns_propns_verbs_stemmed.apply(lambda x: x.split(" "))
common_dictionary = Dictionary(texts)
common_corpus = [common_dictionary.doc2bow(text) for text in texts]
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=6, passes=50)



data["clusters"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True))
data["cluster"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True)[0][0])
data["cluster_prob"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True)[0][1])

What Happened Next?

Since it’s not easy to give you quick insights into the quality of the results, I can show you some screenshots of the apps that have been grouped together.

cluster of games under subcategory adventure games via topic modeling
cluster of racing games and hidden object and riddle games under topic modeling

One of the subsequent subgenres hardly contained any relevant apps, and the racing game subgenre also contained a few outliers, such as the rollercoaster game, which did not belong there. I must mention that there were generally several apps that were assigned to multiple genres (the probability of belonging to each genre being comparably low). We in the end only considered apps that could clearly belong to one subgenre.

For Google Play Store’s lifestyle category mentioned toward the beginning of this article,  we could identify the following subgenres:

  • Spirituality
  • Wallpaper & Themes (for smartphones)
  • DIY and inspirational apps (knitting, crafting, interior and garden design)
  • Fashion, hairstyle, tattoos, make-up selfcare
  • Horoscope
  • Hinduism and Islam

We found overall that Latent Dirichlet Allocation did a nice job of creating a topic model for apps based on their descriptions. We were able to refine the Google Play Store categories and gain a deeper understanding of what a user is interested in.

What’s Next in Topic Modeling?

Going forward, the team still has things to do. We need to further improve the coverage of apps for which we can provide a proper subgenre. This also means we need to detect clusters that we know exist but have not yet been identified. 

Our next step is to keep different kinds of words in the description and try out other topic modeling techniques like LDA2Vec, which combines the strength of understanding plain text that Word2Vec models have with the effective LDA topic modeling.

Data Science Lead (Programmatic Team) (f/m/d)

  • adjoe
  • BI & Data Science
  • Full-time
adjoe is a leading mobile ad platform developing cutting-edge advertising and monetization solutions that take its app partners’ business to the next level. Part of the applike group ecosystem, adjoe is home to an advanced tech stack, powerful financial backing from Bertelsmann, and a highly motivated workforce to be reckoned with.

Meet Your Team: WAVE Data Science

Did you know that in-app ads are sold in a real-time auction before they get rendered in thousands of mobile apps? Dozens of ad networks compete for every single view, choosing their best ad and deciding what price to bid in just a couple hundred milliseconds. We at adjoe have developed our own ad network that takes part in this competition, fighting against giants like Google, Meta and TikTok to present its own ads. To stand a chance in this fierce competition, the WAVE Data Science team builds creative algorithms using technologies ranging from simple linear regression to advanced deep learning models. Everything we do is based on solid research about user behavior as well as publisher and advertiser analysis to build competitive bidding algorithms that balance the advertisers’ goals, the publishers’ expectations and of course the user experience. To make our inventions come to life we use state-of-the art technology and work closely with the product and business teams to shape the future of adjoe’s core business and technologies.
What You Will Do
  • Grow and manage a data science team: hire new team members, organize routines in the team, set up guidelines (code style, documentation, best practices).
  • Understand business goals to come up with suitable data solutions to make WAVE a strong competitor in the AdTech industry.
  • Be hands-on by working on data science tasks on your own (at least 50% of the working time).
  • Be a mentor and train data scientists in the team to make them better in analysis, programming, communication, and planning.
  • Be the owner of critical data science documentation (experiments, implementation details).
  • Align with the WAVE product team and adjoe’s Cloud Infrastructure team on priorities, implementation details, tools and best practices.
  • Contribute to the company’s strategic planning. Work on a long-term roadmap and quarterly OKRs.
  • Who You Are
  • You have a tech degree (mathematics, physics, computer science, or similar). Alternatively 5 years of professional experience in data science.
  • You have 2+ years’ of experience working as a Lead (hiring people, doing regular 1-1s, creating career plans).
  • You have a strong knowledge of Python, R, Scala, Julia or similar typical programming languages for Data Science.
  • You have excellent analytical skills and for every problem you see more than just one solution.
  • You have experience drilling into large amounts of data coming from various sources – including AWS Athena, Kafka, Spark, Flink, S3, MySQL.
  • You are able to dive deep into mathematical foundations and explain complex topics in a simple way.
  • You are an excellent cross-team communicator.
  • Plus: you have experience in building a Data Science tech stack (e.g. Airflow, Spark, Kafka, Flink) and can choose the right technologies for the job together with MLOps and Data Engineers.
  • Heard of our Perks?
  • Work-Life Package: 2 remote days per week, 30 vacation days, 3 weeks per year of remote work, flexible working hours, dog-friendly kick-ass office in the center of the city.
  • Relocation Package: Visa & legal support, relocation bonus, reimbursement of German Classes costs and more.
  • Happy Belly Package: Monthly company lunch, tons of free snacks and drinks, free breakfast & fresh delicious pastries every Monday
  • Physical & Mental Health Package: In-house gym with personal trainer, various classes like Yoga with expert teachers.
  • Activity Package: Regular team and company events, hackathons.
  • Education Package: Opportunities to boost your professional development with courses and trainings directly connected to your career goals 
  • Wealth building: virtual stock options for all our regular employees.
  • Free of charge access to our EAP (Employee Assistance Program) which is a counseling service designed to support your mental health and well-being.
  • Skip writing cover letters. Tell us about your most passionate personal project, your desired salary and your earliest possible start date. We are looking forward to your application!

    We welcome applications from people who will contribute to the diversity of our company.

    Are you data-driven – and driven?

    See vacancies