/ Data Science & Analytics / Collaborative Cloud Infrastructure for Data Science Teams

Building Bridges: Collaborative Cloud Infrastructure for Data Science Teams

May 02, 2025

adjoe operates in over 150 countries, processing 600 TB of data to train models and handling 5 billion API requests daily. Naturally, this demands robust cloud engineering for scalability and uptime, and advanced data science infrastructure for faster analytics.

That’s why adjoe’s Cloud Engineering team shared how we were able to redefine the cloud infrastructure for our data science team at the bit summit in Hamburg.

We designed a fresh AWS-based environment to improve and speed up the machine learning development procedures for our BI & Data Science team.

In this article, our goal is to discuss follow-up improvements that we made to this environment and some lessons learned throughout the past couple of months.

Evolving Through Collaboration

After releasing this environment, we ensured that there was clear communication between the data science and the cloud engineering teams. We support the work of data scientists and analysts seamlessly. Throughout this journey, lots of feedback was given in a manner that helped us make those improvements.

Here’s what we did for follow-up improvements:

Development Environment for the JupyterHub Solution

Having the JupyterHub instances on Kubernetes helped our teams develop, train, and analyze their models and data quickly by providing a lot of flexibility.

To improve this process, we implemented a way to connect to the remote instances from the local environment. It helps using the local IDEs and all it’s power to enhance the work.

It was as simple as generating an API token from the remote server and connecting it via IDE using the URL and the token.

But we needed more than this. Since we had a one-way sync between Git and the remote server, which can also include some static files and notebooks created or uploaded by the users inside the instance.

We wanted to ensure that this sync would happen also from the local to the remote server.

For that, we wrote a script that would sync the files via the cp command in kubectl. We then run it on each save of the notebooks and scripts from the local machine via the Run on Save VSCode extension.

To make the cp command lighter, we are using the templated variables provided by the extension to find out which file has changed and just sync that one.

We also added a functionality to run some terminal commands from the local machine on the remote machine. It would enable users to run their commands directly from their IDE. For this feature, we used the VSCode tasks functionality and also kubectl.

Development Procedure for the Staging Environment of Airflow

Old vs New Setup for Testing DAGs on Airflow Staging

As discussed in our talk, we provided a way to sync different branches from the Airflow Dags Repository on Gitlab for the Staging and Production instances of Airflow. It works perfectly for the production environment as we merge to the main branch when the dags are fully tested and approved, in a separate merge request.

But having the develop branch always synced to the staging instance made the users’ testing process a little bit slow. Because they would always need to merge any small changes they made to the develop branch, and create another merge request from a different branch if a new change was needed.

We wanted to make this process faster and more efficient.

So we implemented a solution that would work by creating a branch out of the main branch (which always contains the latest code) and testing the DAG changes directly from the feature branch.

But how? The initial idea was to somehow sync the DAGs from the newly created branch directly to the Staging Airflow.

For this, we wrote a custom side-container for the Scheduler of Airflow and a custom CI job for the Merge Request Push Pipelines. It would push the DAGs from the branch to an S3 bucket with the name of the branch as the folder name, and the side container would periodically fetch the changes.

One thing we needed to ensure was that the DAG IDs would also be modified by having a prefix/suffix. Since the DAGs are distinguishable from each other (Airflow doesn’t care about the folders when parsing the DAGs).

We also wrote an init container for the workers of Airflow since they would need to load the DAGs. We wanted the workers to only load the necessary DAGs, not all the different branches. For that, we exported an Environment Variable that would contain the DAG ID and based on that we would sync only the necessary branch folder inside the workers of the DAG.

We also have some clean-up procedures to make sure that the different folders won’t pile up inside the bucket, by having lifecycle policies and removing the folder when a branch gets merged to main.

Make GPU tasks Scale Horizontally using KubeRay

One of the improvements that we were looking for to run the GPU tasks smoothly is to try out an open-source tool called KubeRay.

KubeRay is the operator to manage Ray applications on Kubernetes. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

It helps distribute GPU workloads among different pods, which would speed up the training hours and also helps us manage the training on spot instances.

We provided a Ray cluster that would be in charge of distributing Ray jobs among different pods. Using the exported web dashboards, users can follow their job status and take control over it.

The BI & DS team is currently testing this feature, and the adjoe engineering team is excited to share more learning as we discover in the upcoming articles.

Increase Cost Transparency by KubeCost

KubeCost dashboard example from KubeCost.com

One of the feedback we got along the way was about cost transparency. Both the BI & DS and the Cloud Engineering team wanted to optimize the costs from their ends.

We provided some tooling for that purpose, but felt the need to sync on those options. So we wrote different documentation and organized a knowledge sharing session to discuss which tools we could use for cost monitoring.

We are currently using Kubecost as our primary tool to monitor costs within our EKS cluster.

Our services, which include the EKS cluster as well as JupyterHub, Airflow, Spark, MLflow, and others, require regular cost monitoring by all data scientists. It ensures we have a clear understanding of the ongoing costs associated with our workloads over time.

There are also ready-to-use cost templates on AWS that we have to monitor the AWS costs for the DS environment. It contains the costs for Spot & On-demand instance costs, different Storage Volume types costs, and also some network costs.

Since one of the most-used AWS services by the BI & DS team is Athena (to run different analyses), we prepared some weekly reports for the costs of different queries per workgroup. Users can check which queries are generating the most costs and how we can improve them.

By providing all the tooling and documentation, we were able to ensure that unpredictable costs aren’t hidden in our workloads, and users analyze spend and lower the costs of their workload. The adjoe cloud engineering team is also monitoring and helping them with optimizing the workloads and costs.

Aligning Data Science and Cloud Engineering

The intersection of great tech and collaboration drives everything we build. The reason for these successful improvements is the ongoing feedback loop between the BI & Data Science team and the Cloud Engineering team.

Before focusing on technical support, it’s important to make sure there’s a clear and easy way for teams to share feedback and work together.

One of the main lessons here isn’t just about the technical side, but about how important teamwork and good communication are—something we want to keep building and improving over time. For more insights on building great tech solutions, explore the adjoe engineer’s blog

Engineering1

Role

Team

Location

(Senior) DevOps Engineer (f/m/d)

Cloud Engineering,

Hamburg

Your Skills Have a Place at adjoe

Find a Position

Contents