adjoe Engineers’ Blog
 /  Data Science & Analytics  /  App User Feedback Data Analysis
Data Science & Analytics

The Feedback Loop: Drive App Growth with User Complaints 

As data analysts, how do you handle & turn user feedback into meaningful insights? 

Reports show that 96% of unhappy customers don’t file a complaint, assuming the time and effort won’t lead to quality support. Most users churn without saying a word, leaving hidden issues buried in unstructured data.

With 500,000 app installs daily, adjoe deals with an ocean of users in the in-app advertising ecosystem. To provide a flawless user experience, we prioritize enhancing the layer of user engagement & customer satisfaction.

On average, we receive complaints from 1-2% of the app users. Despite this small amount of feedback, we want to ensure that users’ pain points are addressed and shared with our partners & app publishers on a macro scale.

This article shows how to offer the best customer experience by reviewing user feedback data, either filed by users or shared by the app publisher, and taking the necessary actions to resolve them.

We’ll cover the steps taken by our data analytics team, looking at a code showcase where we dive into the data challenges and solutions. We hope these processes will be useful for your product or service. 

Our Approach: App User Complaint Data Analysis

Step 1: Data collection

Here, we set up an automated process that imports the data from our publishers regularly. At this stage, we are not doing any kind of pre-processing of the imported data; we just ensure that we have the data we need. I have shared some details about the architecture behind this import automation further in the article

Step 2: Data cleaning and pre-processing

This is the step where we parse the data, load the JSON, and normalize our text.

According to a data integration survey, respondents spent 50% to 90% of their time just on the Extract, Transform, and Load (ETL) process. 

Which means, you’re losing valuable time on cleaning and preparing data rather than analyzing it. Unstructured and inconsistent data (e.g., user-generated text, JSON blobs) makes this process slow, error-prone, and hard to scale.

In the next section, we’ll tackle this and show how to automate the data parsing and normalization process using Python.

Step 3: Gather relevant behavioral metrics

In addition to getting just the complaint text submitted by the user, we also look at behavioural metrics, product feedbacks, feature requests, and so on.

Surprise! This is also automated. Our previously mentioned Airflow DAG takes care also of this. Once we have the information regarding the user, and the app, we then interact with our datalake using AWS Athena. 

Some of the important KPIs we collect include the app and campaign information, days since user postback, and eventually in the further steps we then classify these complaints into different categories of “reason” of the issue.

Doing this helps us to understand the core reason why the complaint was raised. 

Step 4: Analysing user feedback 

“Data and analytics manage data to support all its uses, to improve decisions, business processes, and outcomes.”Gartner 

Keeping this classic definition in mind, we begin looking at the data that we have collected.

Our primary goal? Improving user experience. How we do this is by looking at trends and noticing the increasing number of complaints for any app. Once we identify this, we start digging deeper into why this could be happening. 

Surprise! We have automated this classification of complaints. 

Combine that with powerful visualizations, and what you get is an easy-to-go-through, yet insightful dashboard that helps us pinpoint the issue.

👉 Read more about creating data visualisations tips and tricks from one of our other blog posts here.

Step 5: Raising alarms

Now that we have identified the root cause of our problems, we relay this information to the responsible team, specifically the account managers in case of integration problems or the tech teams in case of technical issues. 

With the approach outlined, let’s get into the details. 

Collecting & Parsing User Feedback Data 

One of the biggest challenges as a data analyst is data preparation, which involves collecting and analyzing user feedback. 

Combine that with the fact that we are talking about data collected from human interaction and not computer-generated facts and figures, and the result can look like this: 

Collecting & Parsing User Feedback Data - How to handle user feedback data? Analyze User complaints and manage the data. 
adjoe expert engineer-blog.

This image shows you a fully processed request. Below we talk about how to parse the files, map the information and collect more data, but before that I find it important to share that it’s not just about gathering and processing everything we want to, but we also need to keep in mind the performance of the system.

To optimize our process, we have in place a logging system. This system logs all the complaints that were already processed by us.

While importing the file we hit our logs and check if the tickets being read into the memory are already processed or not. Why it’s important for us to do an incremental load only, is because this then eventually leads to faster processing, cheaper queries and lesser load on ther server. 

Surprise! This is not the only automation in place. Everything we discussed is fully automated in the sense that, once we receive the files from our publishers every week, we then trigger the job that automatically reads all the files, processes them and updates the final result as well. This is done via Airflow DAGs interacting with Athena and S3 buckets, all at once

In the end, we have a Quicksight dashboard, here’s a snapshot of what the summary page looks like: 

How to Handle User Feedback Data? Collecting & Parsing User Feedback Data 

How to Parse User Conversations with Publisher Chatbot?

When the user starts to file a complaint, they have an interaction with an automated chatbot. 

Our publishers share the entire conversation with us. Our job is to parse this conversation and extract the information that is important for the analysis.

The challenge? Each publisher is sharing an entirely different format with us. Some need just the app name matching, while others need us to parse an entire conversation.

Our very dynamic system reads the files and adapts to whatever steps needs to be taken and processes each file independently.

 Define mappings for multilingual versions of field prompts
column_groups = {
    'appname': [
        'Enter Game Name', 'Which game?', 'What is the game name?',
        'Entrez le nom du jeu', 'Welches Spiel?', 'Geben Sie den Spielnamen ein',
        'Introducir el nombre del juego', 'Voer de spelnaam in', '¿Que juego?',
        'Quel jeu ?', 'Welk spel?', '¿Cómo se llama el juego?',
        'Wie lautet der Name des Spiels?', 'Quel est le nom du jeu ?',
        'Qual é o nome do jogo?', 'Hoe heet het spel?', 'Qual è il nome del gioco?'
    ],
    'milestone_claimed': [
        'What milestone have you achieved?', 'Quelle étape avez-vous franchie ?',
        'Welchen Meilenstein haben Sie erreicht?', '¿Qué hito has alcanzado?',
        'Welke mijlpaal heb je bereikt?', '¿Qué hito has alcanzado?',
        'Quelle étape avez-vous franchie ?', 'Welke mijlpaal heb je bereikt?',
        'Quelle étape avez-vous franchie ?', 'Que marco você alcançou?'
    ]
}

try:
    # Identify the column containing messages (should match 'Message' in name)
    main_col = [col for col in df.columns if 'Message' in col]
    if len(main_col) != 1:
        raise IOError("Unsupported Column Layout: Expected exactly one Messages column.")

    # Function to parse message JSON from message string
    def parse_messages(x):
        if pd.isna(x):
            # Return default structure if message is missing
            return {
                'ID': None,
                'BODY': '<br>Enter Game Name<br><strong>Unknown App</strong><br><br>What milestone have you achieved?<br><strong>Unknown</strong>',
                'CREATED_AT': None
            }
        try:
            # Truncate at last 'helpshift' entry to get valid JSON and parse
            json_str = x[:x.rfind('helpshift') + 9] + '"}]'
            return json.loads(json_str)[0]
        except Exception:
            # Fallback to raw message if parsing fails
            return {
                'ID': None,
                'BODY': x,
                'CREATED_AT': None
            }

    # Apply message parser to the message column and normalize JSON structure
    df_expanded = df[main_col[0]].apply(parse_messages).apply(pd.json_normalize)
    df_expanded = pd.concat(df_expanded.tolist(), axis=0).set_index(df.index)

    # Drop original message column and merge expanded data
    if any('BODY' in col for col in df.columns):
        df_expanded = pd.concat([df.drop(columns='BODY'), df_expanded], axis=1)
    elif any('Message' in col for col in df.columns):
        df_expanded = pd.concat([df.drop(columns='Message'), df_expanded], axis=1)

    # Check for possible message columns (BODY variants)
    body_cols = [f'BODY ({i})' for i in range(1, 21)] + ['BODY']

    # Flatten list of multilingual keys
    all_multilingual_keys = set(sum(column_groups.values(), []))

    # Function to find the most relevant body text containing known multilingual prompts
    def find_relevant_body_text(row):
        for col in body_cols:
            if col in row and pd.notna(row[col]):
                text = row[col].lower()
                if any(key.lower() in text for key in all_multilingual_keys):
                    return row[col] 
        return np.nan

    # Add column with the most relevant parsed body from message
    df_expanded['relevant_body'] = df_expanded.apply(find_relevant_body_text, axis=1)

    # Function to parse HTML-like body content and extract (prompt, value) pairs
    def parse_body(html):
        if pd.isna(html):
            return {}
        soup = BeautifulSoup(html, "html.parser")
        pairs = []
        header = ''
        for tag in soup.contents:
            if tag.name is None:
                # It's a text node (prompt)
                header = tag.strip()
            elif tag.name == "strong":
                # It's a value associated with last header
                value = tag.get_text(strip=True)
                pairs.append((header, value))
        return dict(pairs)

    # Expand parsed body dictionary into separate columns
    df_expanded_body = df_expanded['relevant_body'].apply(parse_body).apply(pd.Series)

    # Combine expanded data with parsed fields
    df = pd.concat([df_expanded, df_expanded_body], axis=1)

    # Combine multilingual columns into unified English-named columns
    for english_col, multilingual_cols in column_groups.items():
        if english_col not in df.columns:
            df[english_col] = np.nan
        for col in multilingual_cols:
            if col in df.columns:
                # Fill missing values in English column with available translations
                df[english_col] = df[english_col].combine_first(df[col])

    # Rename standardized columns
    df.rename(columns={
        'Player_ID': 'userid', 
        'Player ID': 'userid', 
        'CREATED_AT': 'complaint_at',
        'Created At': 'complaint_at',
        'Date': 'complaint_at',
        'ID': 'ticketid',
        'BODY': 'complaint_text'
    }, inplace=True)

    # Deduplicate duplicate columns by keeping the first non-empty one
    duplicate_cols = df.columns[df.columns.duplicated()].unique()
    for col in duplicate_cols:
        cols_with_name = df.loc[:, df.columns == col]
        good_col = None

        # Pick first column that is not entirely NaN/None
        for i in range(cols_with_name.shape[1]):
            col_data = cols_with_name.iloc[:, i]
            if not col_data.isna().all() and not all(val is None for val in col_data):
                good_col = col_data
                break
        if good_col is None:
            good_col = cols_with_name.iloc[:, 0]

        # Drop all duplicates and reassign good column
        df = df.drop(columns=col)
        df[col] = good_col

    # Add empty columns for platform and last_milestone (if needed for schema)
    df['platform'] = np.nan
    df['last_milestone'] = np.nan

    # Function to robustly convert date-like values into proper date objects
    def convert_to_date(val):
        try:
            dt = pd.to_datetime(val, format='mixed', errors='coerce')
            if pd.isna(dt) and isinstance(val, (int, float)):
                val = float(val)
                if val > 1e18:
                    dt = pd.to_datetime(val, format='mixed', unit='ns', errors='coerce')
                elif val > 1e14:
                    dt = pd.to_datetime(val, format='mixed', unit='ms', errors='coerce')
                elif val > 1e10:
                    dt = pd.to_datetime(val, format='mixed', unit='us', errors='coerce')
                else:
                    dt = pd.to_datetime(val, format='mixed', unit='s', errors='coerce')
            return dt.date() if pd.notna(dt) else dt
        except:
            return dt

    # Convert complaint date to datetime (and then to date only)
    try:
        df['complaint_at'] = convert_to_date(df['complaint_at'])
    except:
        print("invalid date")
    df['complaint_at'] = df['complaint_at'].dt.date

    # Standardize user IDs (lowercase)
    df['userid'] = df['userid'].str.lower()

    # Reorder and select final columns
    df = df[['ticketid', 'userid', 'complaint_at', 'appname', 'platform',
             'last_milestone', 'milestone_claimed', 'complaint_text', 'relevant_body']]

    # Return final cleaned DataFrame
    return df

As a result, we get a clean dataframe with only the relevant information needed to move forward with the analysis.

Once we have this, as a next step, we need to map the app names to the corresponding app IDs.

Multi-level Approach: Data Mapping App Names and App IDs 

The data we receive from some of our publishers includes the app name regarding which the complaint was made. This value is entered by the user, which increases the possibility of human errors. 

Some examples of the app names entered by the user for the app “Coin Master”: “coinmaster”, “conmaster”, “coinmasters”, “coin and master”, “master coin”, among others.

We need to match these app names with the corresponding app ID in our data lake to be able to process them further.

We implemented a 3-step matching process for this:

  1. Pre-processing

As a rule of thumb, we first normalize all the app names by removing all punctuation and special characters. 

We remove trailing and leading whitespaces and convert them to a lowercase string, and do the same to all the appnames that we pull from our data lake. 

def normalize(name):

            return re.sub(r'[^\w\s]','',str(name)).lower().strip().replace(' ','')
  
  1. Full match: at first, we try to match the names as-is in case the user entered the correct name.
  2. App Title Match using normalized app title on another platform. In this step, we attempt to match the ids based on the bundle_ids of the app. The bundle id of the app is consistent across platforms, meaning that even if the app title may differ on a platform, making it difficult for us to map it, we can use the bundle id – app title mapping from the other platform to attempt to get the relevant information 
  1. Fuzzy match

We do this in 2 sequential steps:

Step 1:

We attempt to match the names using Fuzzy matching or Approximate string matching.

We first filter our dataset for the platform in question and then use the process.extractBests() function we do the matching. 

What’s interesting here is that we keep the score_cutoff dynamic; Shorter names require a stricter match (higher cutoff) 

from thefuzz import process
def find_matches(app_name, platform):
            possible_titles = df[df['platform'] == platform]
            matches = process.extractBests(
                app_name, 
                possible_titles['re_app_title'], 
                score_cutoff=max(60, 80 - len(app_name)),
                limit=1
            )
            if matches:
                matched_title = matches[0][0]
                matched_row = possible_titles[possible_titles['re_app_title'] == matched_title]
                return matched_row['appid'].values[0] if not matched_row.empty else np.nan
            return np.nan

// The “limit = 1” ensures that the function returns the best match only.

Step 2: 

If we get the results from step 1 and see that there are still some apps which were not mapped earlier we make it a little bit more lenient and use a different type of scorer to map these, and a different method called partialRatio which is a part of the FuzzyWuzzy library.

def fuzzy_match_names_partial_ratio(df1, df2, threshold=100):
                matched_ids = []
            
                for _, row in df1.iterrows():
                    appname = row['re_appname']
                    platform = row['platform']
            
                    df2_filtered = df2[df2['platform'] == platform]
            
                    if df2_filtered.empty:
                        matched_ids.append(None)
                        continue
            
                    # Use partial_ratio as the scorer here
                    match = process.extractOne(
                        query=appname,
                        choices=df2_filtered['re_app_title'],
                        scorer=fuzz.partial_ratio
                    )
            
                    if match and match[1] >= threshold:
                        matched_name = match[0]
                        matched_id = df2_filtered[df2_filtered['re_app_title'] == matched_name]['appid'].values[0]
                        matched_ids.append(matched_id)
                    else:
                        matched_ids.append(None)
            
                # Add matched_id column to df1
                df1['appid'] = matched_ids
                return df1

Once we have the data parsed & processed, resulting in a single data set with all the relevant information, we proceed to collect further information needed for analysis.

We then interact with our data lake to extract various KPIs per user. 

For our use case, the most important ones are: 

  • User’s last achieved milestone
  • The list of events that the user achieved and what the user was rewarded for
  • Number of postbacks per app, and any fraud-related information
  • Multiple boolean information on whether the postbacks were delayed for the app, or if the event achieved by the user is received after the event-sharing window, among others. 

The result is a visual dashboard that gives us trends such as:

  • Number of daily complaints by the date of complaints, with the date of postback
  • Number of user complaints by the last milestone achieved
  • Complaint rate by the reason for the user complaint
  • The complaint rate by the number of days since postback, etc. 

Configure Automate Alerts for Rising App User Complaints

After classifying the complaints into different categories depending on what the issue seems to be, whether integration problems or possible fraud, for example, we add automated comments on an app level.

These comments are, in a nutshell, the “run-rate” of complaints per app. 

This run-rate is calculated by looking at the difference between the number of complaints per app and per reason (as classified earlier) in the last 7 days vs. the last 30 days

We have 2 alarm rules set up. The 1st one looks at 4 different factors: if the previously mentioned run-rate is >10% AND the number of complaints in the run-rate difference is >=5 AND the number of complaints per 1000 postbacks is > 1 AND we notice that users have been achieving in-game milestones, we say that this app has an alarming increase in the number of complaints and the app-event setup needs to be looked at.

Whereas the 2nd type of alarm is only slightly different. Here, all the initial 3 rules apply as well, but if the users complaining seem not to have been achieving any in-game milestones, we infer that this would point to a postback issue and hence the tech integration is what needs to be assessed. 

Read here: How adjoe uses an in-house anomaly detection system to monitor and flag some of our other core KPIs. 

Looking into the Future

With over 200+ advertiser partnerships and growing by the day, managing user feedback becomes a critical part of the adtech business. 

To improve the quality of our user journey, we aim to add more modules to our data analysis. As a next step, we plan to collect more information about the health of the app, in addition to the user interaction with our system. 

Not only the analysis, but we also want to be more proactive in raising alarms. We aim to make our automated alarms more sophisticated, covering different metrics to identify and report further possible bugs in the system.

Stay tuned for our data analytics journey on the engineer’s blog, where we cover interesting adtech solutions.

Build products that move markets

Data Science & Analytics1

Your Skills Have a Place at adjoe

Find a Position