Skip to content

Curiousily

Reproducible Machine Learning and Experiment Tracking Pipeline with Python and DVC

Deep Learning, Machine Learning, DVC, Reproducibility5 min read

Share

TL;DR Learn how to build a reproducible ML pipeline using DVC and Python. You’ll build an end-to-end example with 2 experiments and compare model evaluation metrics between them.

In this tutorial, you’ll build a complete reproducible ML pipeline with Python and DVC. The approach is ML library/toolkit agnostic, but we’ll use scikit-learn.

  • Source code on GitHub

Here’s what we’ll go over:

  • Why your work must be reproducible?
  • Overview of DVC
  • Create a new ML project from scratch
  • Add the first (baseline) experiment
  • Added DVC to the project
  • Build a complete ML pipeline
  • Add seconds experiment
  • Compare the evaluation metrics between experiments

Reproducibility crisis?

Imagine that a paper is proposing a new method for solving a task and the main objective is improved by 10%. WOW! New SOTA! Or is it?

Reproducing the experiments is the only way to see for yourself. As a bonus, you’ll get a deeper understanding of the method. But how easy it is to do it?

Unfortunately, many authors don’t include their source code when publishing a paper. The reproducibility crisis is real! To comab this, some major ML conferences (NeurIPS and ICML) have some requirements to ensure reproducibility. The reproducibility checklist is one effort to summarise the main points. Things are getting better, but improvements are still needed.

Experimenting with ML boils down to writing and reading (a lot of) code. And what do you do when you want to find the truth? You go to the source! The source code (Yeah, I am watching too much Dom Mazzetti).

Reproducibility in the real world (a.k.a. your work)

All of this is great, but should you care? After all, you’re using ML in the real world!

You should care even more! The only good way to check if a piece of code is doing what the author intended is to show it to a lot of people. ML projects involve a lot more than “regular” code, though. Making your experiments hard to reproduce is a sure way to make someone give up on the review and go with a “f*ck it, I am out”.

Ok, how do you make your experiments reproducible?

Reproducing ML experiments with DVC

DVC stands for Data Version Control. It is a free and open-source project that helps you version control your experiments, store large files (on a variety of storage services), track metrics, and build completely reproducible pipelines.

Remotes

DVC doesn’t really handle big file storage for you. It stores metafiles that point to the location of the files. Those places are known as remotes. Here are some of the remotes that DVC supports:

  • Amazon s3
  • Microsoft Azure Blob Storage
  • Google Drive
  • Google Cloud Storage
  • SSH
  • Hadoop Distributed File System (hdfs)
  • HTTP and HTTPS protocols
  • Directory on your file system (local)

End-to-end example

We’ll have a look at a complete ML experiment and integrate it with DVC.

The data we’re going to use is listings of Udemy courses - 3.682 courses listings from 4 different subjects. The objective is to predict the number of students for each course.

Pretty much every ML pipeline can be boiled down to the following steps (this can be a never-ending cycle):

  • Create dataset
  • Create features
  • Train a model
  • Evaluate the model
  • Deploy the model (if better than previous)

In this example, we’ll skip the deployment altogether and focus on experimenting.

The first experiment

One of the good things about DVC is that you can put off the integration until the very end of your first experiment. We’ll do just that - start with a plain old Python project.

Here’s the initial file structure:

1.
2├── assets (dir)
3├── Pipfile
4├── Pipfile.lock
5└── studentpredictor (dir)

The studentpredictor directory will hold the source code, while assets will contain data and DVC related files.

We’ll manage the dependencies using Pipenv. Here are the contents of the Pipfile:

1[[source]]
2name = "pypi"
3url = "https://pypi.org/simple"
4verify_ssl = true
5
6[dev-packages]
7black = "==19.10b0"
8isort = "*"
9flake8 = "*"
10
11[packages]
12dvc = "*"
13gdown = "*"
14pandas = "*"
15scikit-learn = "*"
16
17[requires]
18python_version = "3.8"

Run this command in the root of your project once you add the file:

1pipenv install --dev

We’ll store the config as source code in the studentpredictor/config.py file:

1from pathlib import Path
2
3
4class Config:
5 RANDOM_SEED = 42
6 ASSETS_PATH = Path("./assets")
7 ORIGINAL_DATASET_FILE_PATH = ASSETS_PATH / "original_dataset" / "udemy_courses.csv"
8 DATASET_PATH = ASSETS_PATH / "data"
9 FEATURES_PATH = ASSETS_PATH / "features"
10 MODELS_PATH = ASSETS_PATH / "models"
11 METRICS_FILE_PATH = ASSETS_PATH / "metrics.json"

Create your dataset

The first step is to get the dataset. I’ve already uploaded the CSV file to Google Drive. Add the studentpredictor/create_dataset.py file with the following contents:

1import gdown
2import numpy as np
3import pandas as pd
4from sklearn.model_selection import train_test_split
5
6from config import Config
7
8np.random.seed(Config.RANDOM_SEED)
9
10Config.ORIGINAL_DATASET_FILE_PATH.parent.mkdir(parents=True, exist_ok=True)
11Config.DATASET_PATH.mkdir(parents=True, exist_ok=True)
12
13gdown.download(
14 "https://drive.google.com/uc?id=1gkYBOIMm8pAGunRoI3OzQHQrgOdaRjfC",
15 str(Config.ORIGINAL_DATASET_FILE_PATH),
16)
17
18df = pd.read_csv(str(Config.ORIGINAL_DATASET_FILE_PATH))
19
20df_train, df_test = train_test_split(
21 df, test_size=0.2, random_state=Config.RANDOM_SEED,
22)
23
24df_train.to_csv(str(Config.DATASET_PATH / "train.csv"), index=None)
25df_test.to_csv(str(Config.DATASET_PATH / "test.csv"), index=None)

We make all necessary directories and split the data into train and test. The resulting data frames are saved as CSV.

Create features

We’ll do some simple feature engineering to keep this part easy to understand. Create the studentpredictor/create_features.py file and fill it with this:

1from datetime import date
2
3import pandas as pd
4
5from config import Config
6
7Config.FEATURES_PATH.mkdir(parents=True, exist_ok=True)
8
9train_df = pd.read_csv(str(Config.DATASET_PATH / "train.csv"))
10test_df = pd.read_csv(str(Config.DATASET_PATH / "test.csv"))
11
12
13def extract_features(df):
14 df["published_timestamp"] = pd.to_datetime(df.published_timestamp).dt.date
15 df["days_since_published"] = (date.today() - df.published_timestamp).dt.days
16 return df[["num_lectures", "price", "days_since_published", "content_duration"]]
17
18
19train_features = extract_features(train_df)
20test_features = extract_features(test_df)
21
22train_features.to_csv(str(Config.FEATURES_PATH / "train_features.csv"), index=None)
23test_features.to_csv(str(Config.FEATURES_PATH / "test_features.csv"), index=None)
24
25train_df.num_subscribers.to_csv(
26 str(Config.FEATURES_PATH / "train_labels.csv"), index=None
27)
28test_df.num_subscribers.to_csv(
29 str(Config.FEATURES_PATH / "test_labels.csv"), index=None
30)

The only real feature we’re creating is the days_since_published. We get it from the published date of the course. We’re saving the features and labels as CSV files.

Train a model

We’ll start with a baseline model. In this case - Linear Regression. Put this into studentpredictor/train_model.py:

1import pickle
2
3import pandas as pd
4from sklearn.linear_model import LinearRegression
5
6from config import Config
7
8Config.MODELS_PATH.mkdir(parents=True, exist_ok=True)
9
10X_train = pd.read_csv(str(Config.FEATURES_PATH / "train_features.csv"))
11y_train = pd.read_csv(str(Config.FEATURES_PATH / "train_labels.csv"))
12
13model = LinearRegression()
14model = model.fit(X_train, y_train.to_numpy().ravel())
15
16pickle.dump(model, open(str(Config.MODELS_PATH / "model.pickle"), "wb"))

We dump the trained model with pickle. Ready to evaluate that bad boy!

Evaluation

We’ll focus on two metrics RMSE and R2R^2. Here is the studentpredictor/evaluate_model.py file:

1import json
2import pickle
3
4import pandas as pd
5from sklearn.metrics import mean_squared_error
6
7from config import Config
8
9X_test = pd.read_csv(str(Config.FEATURES_PATH / "test_features.csv"))
10y_test = pd.read_csv(str(Config.FEATURES_PATH / "test_labels.csv"))
11
12model = pickle.load(open(str(Config.MODELS_PATH / "model.pickle"), "rb"))
13
14r_squared = model.score(X_test, y_test)
15
16y_pred = model.predict(X_test)
17rmse = mean_squared_error(y_test, y_pred)
18
19with open(str(Config.METRICS_FILE_PATH), "w") as outfile:
20 json.dump(dict(r_squared=r_squared, rmse=rmse), outfile)

We’re writing the resulting metrics in a JSON file. How we’re going to use that? More on that later.

The project structure should now look like this:

1.
2├── assets (dir)
3├── Pipfile
4├── Pipfile.lock
5└── studentpredictor
6 ├── config.py
7 ├── create_dataset.py
8 ├── create_features.py
9 ├── evaluate_model.py
10 └── train_model.py

Adding DVC

You’ll interact with DVC mostly via the CLI. It is a tool that plays nice with GIT (understands tags and branches) and is language agnostic.

Initialize DVC

1dvc init

and add remote storage (local in this case)

1dvc remote add -d localremote /tmp/dvc-storage

disable analytics (optional)

1dvc config core.analytics false

This is a good place for a checkpoint:

1git add .
2git commit -m "Add DVC config"
3git push

Building a pipeline

We’re ready to build the pipeline. DVC creates a graph with dependencies and outputs for each stage.

We’ll use dvc run to make each step reproducible. Let’s start with the dataset:

1dvc run -f assets/data.dvc \
2 -d studentpredictor/create_dataset.py \
3 -o assets/data \
4 python studentpredictor/create_dataset.py

Let’s dissect what is happening here:

  • -f assets/data.dvc saves the metafile used by DVC to reproduce this step
  • -d studentpredictor/create_dataset.py adds this script as a dependency for this step
  • -o assets/data tells that the outputs will be stored in that directory

Finally, we invoke the script that will do the actual work.

The stage for feature creation looks like this:

1dvc run -f assets/features.dvc \
2 -d studentpredictor/create_features.py \
3 -d assets/data \
4 -o assets/features \
5 python studentpredictor/create_features.py

Importantly, we add assets/data as a dependency for this step. This will force the execution of the previous step if something has changed.

You can probably figure out the training stage:

1dvc run -f assets/models.dvc \
2 -d studentpredictor/train_model.py \
3 -d assets/features \
4 -o assets/models \
5 python studentpredictor/train_model.py

The final stage - evaluation:

1dvc run -f assets/evaluate.dvc \
2 -d studentpredictor/evaluate_model.py \
3 -d assets/features \
4 -d assets/models \
5 -M assets/metrics.json \
6 python studentpredictor/evaluate_model.py

You’ll note that this step doesn’t specify outputs. But we have -M assets/metrics.json? This tells DVC that this is a metrics file (JSON and text files are currently supported).

Your first DVC pipeline is complete. Let’s save the progress:

1git add .
2git commit -m "Linear Regression experiment with DVC"
3git push

We’ll also create a tag for the experiment (you’ll see why in a second):

1git tag -a "lr-experiment" -m "Experiment with Linear Regression"

Now we can use some DVC magic to see the evaluation metrics for our model:

1dvc metrics show -T

This should output something like this:

1lr-experiment:
2 assets/metrics.json:
3 r_squared: 0.03570513102945361
4 rmse: 6777.509886999257

Experimenting with Random Forest

Why did we do all this work? Was it all worth it?

Let’s start a second experiment with Random Forest regressor. Replace the contents of studentpredictor/train_model.py:

1import pickle
2
3import pandas as pd
4from sklearn.ensemble import RandomForestRegressor
5
6from config import Config
7
8Config.MODELS_PATH.mkdir(parents=True, exist_ok=True)
9
10X_train = pd.read_csv(str(Config.FEATURES_PATH / "train_features.csv"))
11y_train = pd.read_csv(str(Config.FEATURES_PATH / "train_labels.csv"))
12
13model = RandomForestRegressor(
14 n_estimators=150, max_depth=6, random_state=Config.RANDOM_SEED
15)
16model = model.fit(X_train, y_train.to_numpy().ravel())
17
18pickle.dump(model, open(str(Config.MODELS_PATH / "model.pickle"), "wb"))

Let’s reproduce the complete pipeline using the new regressor:

1dvc repro assets/evaluate.dvc

DVC is smart enough to rerun only the steps that have changed and rewrite its internal graph.

Let’s save the second experiment:

1git add .
2git commit -m "Add Random Forest experiment"
3git push

and create a tag for it:

1git tag -a "rf-experiment" -m "Experiment with Random Forest"

We can now compare the two experiments:

1dvc metrics show -T
1lr-experiment:
2 assets/metrics.json:
3 r_squared: 0.03570513102945361
4 rmse: 6777.509886999257
5rf-experiment:
6 assets/metrics.json:
7 r_squared: 0.15391037892455683
8 rmse: 6348.533500735664

You can do the same thing with branches, too (if that is your thing).

Summary

You can now build a complete reproducible ML pipelines with Python and DVC. Note that you can do it with any ML library/toolkit. How would you apply this to your experiments?

Here’s what we did:

  • Why your work must be reproducible?
  • Overview of DVC
  • Create a new ML project from scratch
  • Add the first (baseline) experiment
  • Added DVC to the project
  • Build a complete ML pipeline
  • Add seconds experiment
  • Compare the evaluation metrics between experiments

Do you make your experiment reproducible? How do you do it? How do you track your metrics? I am waiting for your answers in the comments below!

References

Share

Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me