Building an End-to-End Machine Learning Prototype

TL;DR Build a complete Machine Learning project skeleton that serves as a baseline for future improvements

Every Machine Learning project follows a similar set of steps to (hopefully) deliver the goods. Many attempts will fail, but the successful ones go through the following steps, iteratively:

The lifecycle of a Machine Learning project

  • Planning/choosing a goal

  • Data collection & labelling

  • Creating features and preprocessing

  • Training and optimization

  • Deployment and testing

Planning

Choosing what to work on and what is the measurement of success is the most important part of the project.

Get help here! Ask domain experts, business people, meditate on it. Do spend some time to think about it. But don’t linger for too long. Analysis paralysis is a very common phenomenon in the real world!

Prototyping a baseline model

Once you’ve chosen your goal, you’re ready to get your hands dirty.

Jupyter notebooks are a great prototyping/experimentation tool. You can use a notebook(s) to get some quick ideas about the feasibility and performance of a model.

After you get some results, you’ll proceed to create a full-blown project containing the baseline model. This will be a lot of work, but some rewards you might expect are bug fixes, new ideas, and developing something that (hopefully) has a real-world impact.

Next, we’ll go over an example task of automating the decision of whether a bank customer has good or bad credit risk.

Data collection

If you haven’t solved any real-world ML problems yet, you might believe that most datasets get stored as CSV files. While this format is great when learning, you’ll often need to understand SQL, pickle, HDF5, Parquet, and many more.

Labeling

Sometimes your data will have labels, but it might not be exactly the data you need. Other times, you’ll miss labels altogether. What can you do?

Creating a labeling infrastructure will be time well spent, for sure. Increasing the data size and reducing the noise in your data (having less wrong data) will dramatically increase the predictive power of your model(s).

You will always want more and cleaner data. So keep getting it, slowly but surely. At some point, you’ll notice you start getting diminishing returns. Depending on the problem you’re solving, it might be a good idea to focus on other issues in the project.

Looking at the data

This might sound boring and complete nonsense but go through different examples from the data. Can you figure out the labels for each one? Are the labels correct? Are the labels consistent?

Remember, feeding your model with crappy data will give you garbage results. At this stage of the project, you don’t want to deal with crap - there will be plenty later on.

Get the data

We’ll build a complete prototype of a system that predicts the credit risk for a given bank customer. The data describes the clients using a set of attributes of mixed types (similar to a real-world scenario).

We’ll use the dataset download utility from scikit learn:

from sklearn import datasets

features, targets = datasets.fetch_openml(
    name="credit-g", 
    version=1, 
    return_X_y=True, 
    as_frame=True
)

features.shape, targets.shape
((1000, 20), (1000,))

1,000 examples are really that much. Hopefully, you’ll have a lot more data when solving real-world problems. Let’s have a look:

features.head()
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker
0 <0 6.0 critical/other existing credit radio/tv 1169.0 no known savings >=7 4.0 male single none 4.0 real estate 67.0 none own 2.0 skilled 1.0 yes yes
1 0<=X<200 48.0 existing paid radio/tv 5951.0 <100 1<=X<4 2.0 female div/dep/mar none 2.0 real estate 22.0 none own 1.0 skilled 1.0 none yes
2 no checking 12.0 critical/other existing credit education 2096.0 <100 4<=X<7 2.0 male single none 3.0 real estate 49.0 none own 1.0 unskilled resident 2.0 none yes
3 <0 42.0 existing paid furniture/equipment 7882.0 <100 4<=X<7 2.0 male single guarantor 4.0 life insurance 45.0 none for free 1.0 skilled 2.0 none yes
4 <0 24.0 delayed previously new car 4870.0 <100 1<=X<4 3.0 male single none 4.0 no known property 53.0 none for free 2.0 skilled 2.0 none yes
targets.head()
0    good
1     bad
2    good
3    good
4     bad
Name: class, dtype: category
Categories (2, object): ['good', 'bad']

Let’s have a quick look at the class distribution:

targets.value_counts()
good    700
bad     300
Name: class, dtype: int64

That should tell you that we can’t trust a simple accuracy score to evaluate our model. But we’ll delve deeper into this in later parts.

Note on reproducibility

Making your results completely reproducible can be hard work. You might start with something like this:

import random
import numpy as np
import torch

RANDOM_SEED = 42

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED + 1)
torch.manual_seed(RANDOM_SEED + 2)
<torch._C.Generator at 0x7f1968267ae0>

But randomness can bite you when you least expect it:

  • Your dataset generation scripts might not generate the same data

  • Some records might be updated/deleted

  • Training on different/multiple GPUs can produce different results.

Can you fix all of this? Probably, but some steps are more important to reproduce than others. For example, you better have the same dataset on which you’re testing your models.

Feature engineering

One of the advantages of Deep Neural Networks is to automate the process of feature engineering. At least, that was the grand promise.

In practice, adding manual features might significantly improve the performance of your model. But creating good features is black magic. Ideas for those come almost always from spending absurd amounts of time with the raw data.

Start by thinking of a couple of features and encode them. Use classical ML algorithms (like Random Forest) to evaluate their importance. Those features will be prime candidates for inclusion in your Deep Learning model later on.

FEATURES = [
  "duration", 
  "credit_amount", 
  "age", 
  "existing_credits", 
  "residence_since"
]
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(
  features[FEATURES],
  targets,
  test_size=0.2
)

label_encoder = preprocessing.LabelEncoder()
label_encoder = label_encoder.fit(y_train)

X_train = X_train.to_numpy()
y_train = label_encoder.transform(y_train)

X_test = X_test.to_numpy()
y_test = label_encoder.transform(y_test)
X_train.shape, y_train.shape
((800, 5), (800,))
X_test.shape, y_test.shape
((200, 5), (200,))

Training and evaluation

Training a Deep Neural Net using any of the popular libraries for Deep Learning is relatively straightforward. That is given you keep playing with toy examples.

In practice, the training might include a lot of hacks that change the generic process just a bit - enough to introduce bugs and write tons of incomprehensive code.

Using a library like scikit-learn is a great first choice for building a baseline model. It takes very little time, the code is easier to understand and you can gain a lot of insight into the problem you’re solving.

Here’s a quick example of how you can use a Random Forest classifier:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200)
model = model.fit(X_train, y_train)
model.score(X_test, y_test)
0.68

But you might’ve heard that Neural Networks are better. They can be but require deeper understanding.

Training a Deep Neural Net in PyTorch

Let’s create a simple Neural Network using PyTorch:

from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader

class CreditTypeDataset(Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels
      
  def __getitem__(self, index):
    return (
      torch.from_numpy(self.features[index]).float(), 
      self.labels[index]
    )

  def __len__(self):
    return len(self.features)

The dataset is a helper that allows us to convert our dataset to Tensors. Wrapping that into DataLoader lets you get batches, shuffle the data, and more:

dataset = CreditTypeDataset(X_train, y_train)
data_loader = DataLoader(dataset, batch_size=4, shuffle=True)
features_batch, labels_batch = next(iter(data_loader))

Let’s look at a sample batch:

features_batch
tensor([[6.0000e+00, 1.4555e+04, 2.3000e+01, 1.0000e+00, 2.0000e+00],
        [2.4000e+01, 2.3330e+03, 2.9000e+01, 1.0000e+00, 2.0000e+00],
        [3.0000e+01, 3.4410e+03, 2.1000e+01, 1.0000e+00, 4.0000e+00],
        [1.5000e+01, 1.4780e+03, 3.3000e+01, 2.0000e+00, 3.0000e+00]])
labels_batch
tensor([0, 1, 0, 1])

Our baseline model is a really simple one:

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class CreditTypeClassifierNet(nn.Module):

  def __init__(self, n_features, n_credit_types):
    super(CreditTypeClassifierNet, self).__init__()
    self.fc1 = nn.Linear(n_features, n_features * 2)
    self.fc2 = nn.Linear(n_features * 2, n_credit_types)

  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

  def create_optimizer(self):
    return optim.Adam(self.parameters(), lr=0.01)

  def create_criterion(self):
    return nn.CrossEntropyLoss(reduction="none")

The two additional methods create_optimizer() and create_criterion() help us keep those two components around, wherever the model is needed.

model = CreditTypeClassifierNet(len(FEATURES), 2)
model
CreditTypeClassifierNet(
  (fc1): Linear(in_features=5, out_features=10, bias=True)
  (fc2): Linear(in_features=10, out_features=2, bias=True)
)

Training

Libraries like Keras give you an easy to use training interface for TensorFlow. There are libraries that give you something similar for PyTorch, too! But I am a fan of going raw as much as possible (just make sure you stay safe). If you’re not - you can use something like Pytorch Lightning.

We’ll start with a couple of helpers:

class Phase:
  TRAIN = "train"
  TEST = "test"

PHASES = [Phase.TRAIN, Phase.TEST]

class Evaluator:

  def __init__(self, criterion):
    self.criterion = criterion

  def eval(self, model, X, y, phase: Phase):
    with torch.set_grad_enabled(phase == Phase.TRAIN):
      model = model.train() if phase == Phase.TRAIN else model.eval()
      outputs = model(X)
      loss = self.criterion(outputs, y)

      _, predictions = outputs.max(dim=1)
      correct_count = torch.sum(predictions == y)

      return loss, correct_count

The current phase dictates the behavior of both helpers. The job of the Evaluator’ is to calculate the current loss and correct predictions (Note that this will only work for classification tasks).

from collections import defaultdict
from dataclasses import dataclass

@dataclass
class Progress:
  losses: list
  correct_predictions: int

  def average_loss(self):
    return torch.cat(self.losses, dim=0).mean().item()

  def accuracy(self, dataset_size):
    return self.correct_predictions.double() / dataset_size

class ProgressLogger:

  def __init__(self, dataset_sizes):
    self.progress = defaultdict(
      lambda: {Phase.TRAIN: Progress([], 0), Phase.TEST: Progress([], 0)}
    )
    self.dataset_sizes = dataset_sizes

  @staticmethod
  def _round(value, precision=3):
    return np.round(value, precision)

  def save_progress(self, epoch, phase, loss, correct_predictions):
    self.progress[epoch][phase].losses.append(loss.detach())
    self.progress[epoch][phase].correct_predictions += correct_predictions

  def log(self, epoch):
    print(f"Epoch {epoch + 1}")

    train_progress = self.progress[epoch][Phase.TRAIN]
    train_loss = ProgressLogger._round(train_progress.average_loss())
    train_accuracy = ProgressLogger._round(train_progress.accuracy(self.dataset_sizes[Phase.TRAIN]))
    print(f"Train: loss {train_loss} accuracy {train_accuracy}")

    test_progress = self.progress[epoch][Phase.TEST]
    test_loss = ProgressLogger._round(test_progress.average_loss())
    test_accuracy = ProgressLogger._round(test_progress.accuracy(self.dataset_sizes[Phase.TEST]))
    print(f"Test: loss {test_loss} accuracy {test_accuracy}")
    print()

The ProgressLogger stores the losses and correct predictions for each training example. The log() method print outs the evaluations for the current epoch and training phase.

class Trainer:

  def __init__(self, train_dataset, test_dataset, batch_size):

    train_data_loader = DataLoader(train_dataset, batch_size=batch_size)
    test_data_loader = DataLoader(test_dataset, batch_size=1)

    self.data_loaders = {
        Phase.TRAIN: train_data_loader,
        Phase.TEST: test_data_loader
    }

    self.dataset_sizes = {
        Phase.TRAIN: len(train_dataset),
        Phase.TEST: len(test_dataset)
    }

    self.logger = ProgressLogger(self.dataset_sizes)

  def train(self, model, n_epochs):
    optimizer = model.create_optimizer()
    evaluator = Evaluator(model.create_criterion())
    
    for epoch in range(n_epochs):
    
      for phase in PHASES:

        for inputs, labels in self.data_loaders[phase]:

          loss, correct_count = evaluator.eval(model, inputs, labels, phase)          
          self.logger.save_progress(epoch, phase, loss, correct_count)
          
          if phase == Phase.TRAIN:

            optimizer.zero_grad()
            loss.mean().backward()

            optimizer.step()

      self.logger.log(epoch)

    return model

The Trainer is where all the magic happens. It requires train and test datasets and delivers a trained model. Note that we’re using no reduction during loss calculation. We store the loss for each example in the dataset. This helps us calculate the average loss (the loss for the epoch) when logging our progress.

train_dataset = CreditTypeDataset(X_train, y_train)
test_dataset = CreditTypeDataset(X_test, y_test)

trainer = Trainer(train_dataset, test_dataset, batch_size=8)
model = trainer.train(model, n_epochs=10)
Epoch 1
Train: loss 10.909 accuracy 0.576
Test: loss 10.847 accuracy 0.675

Epoch 2
Train: loss 7.691 accuracy 0.589
Test: loss 10.621 accuracy 0.35

Epoch 3
Train: loss 6.96 accuracy 0.61
Test: loss 8.438 accuracy 0.365

Epoch 4
Train: loss 7.173 accuracy 0.578
Test: loss 16.883 accuracy 0.335

Epoch 5
Train: loss 5.299 accuracy 0.601
Test: loss 7.084 accuracy 0.675

Epoch 6
Train: loss 5.07 accuracy 0.596
Test: loss 0.937 accuracy 0.63

Epoch 7
Train: loss 6.687 accuracy 0.59
Test: loss 5.513 accuracy 0.39

Epoch 8
Train: loss 6.026 accuracy 0.612
Test: loss 4.589 accuracy 0.39

Epoch 9
Train: loss 4.321 accuracy 0.605
Test: loss 6.343 accuracy 0.385

Epoch 10
Train: loss 4.126 accuracy 0.621
Test: loss 3.21 accuracy 0.675

Evaluation

How well will your model do in production? To answer this question, you need answers to the following two:

  • What resources (CPU, GPU, RAM and disk space) do my model need to run? What is the expected response time?

  • How well will the predicted values match the real ones?

You can usually answer the first question using a variety of tools from the software development world (like time, top and htop). However, you need to take into account the size of the input data. If you’re loading into memory large text or image data, they might overflow and crash your program. Make sure you know the bounds of your data.

How good your model predictions are? A wide variety of statistical tests are available to evaluate the performance of different models. And they are very good at what they do. But having large amounts of data changes the game a little bit. You can use simple tools like accuracy, confusion matrix, precision, recall and apply appropriate thresholding. Proper evaluation of your model can be done only if you’re intimately familiar with the domain.

One critical step in the process is looking at errors. Where your model makes errors? You should manually go through some errors to get a feel for them. How do you solve those? One simple and effective way to make your model better - add more data, matching the conditions where the model makes errors.

Deployment

Deploying your model allows you to get your work to your users. It might be that millions will use it (given you work at a company like Google) or just you. Either way, you’ll need to make your model available for others.

The most common way of deploying your model is behind a REST API. You can also embed it into a user’s device (building a mobile app for iOS or Android).

Most likely, your model will not be the only thing that requires computational resources. How much time does it take to make a prediction? How much RAM/VRAM is needed? How can you avoid blocking the user while your model is making its prediction? We’ll answer those questions in the next part(s).

Next, we’ll pack our prototype into a complete Python project. We’ll deploy the baseline model and serve its predictions using a REST API.

What to do when your model isn’t working

While you can use many technical tricks to improve your model, you should start with something simpler. Ask for help! But not anyone, ask a domain expert for his input. How does he/she make predictions for this task? What the data required for those predictions? How accurate he/she is?

Use the insights to build a better model and gather more data. Hopefully, this should give you an acceptable performance before doing hardcore optimization.