{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "02.building-a-prototype.ipynb", "provenance": [], "collapsed_sections": [], "authorship_tag": "ABX9TyMFVaIGYTI7PcTbwKmLwaLW" }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "o1NCK5Omuu9c" }, "source": [ "# Building an End-to-End Machine Learning Prototype\n", "\n", "> TL;DR Build a complete Machine Learning project skeleton that serves as a baseline for future improvements\n", "\n", "* [Run the notebook in your browser (Google Colab)](https://colab.research.google.com/drive/16AOmMyAInAcLRo0_wuAST9-HpAixb-XO?usp=sharing)\n", "\n", "Every Machine Learning project follows a similar set of steps to (hopefully) deliver the goods. Many attempts will fail, but the successful ones go through the following steps, iteratively:\n", "\n", "> The lifecycle of a Machine Learning project\n", "\n", "- Planning/choosing a goal\n", "- Data collection & labelling\n", "- Creating features and preprocessing\n", "- Training and optimization\n", "- Deployment and testing\n", "\n", "## Planning\n", "\n", "Choosing what to work on and what is the measurement of success is the most important part of the project.\n", "\n", "Get help here! Ask domain experts, business people, meditate on it. Do spend some time to think about it. But don't linger for too long. Analysis paralysis is a very common phenomenon in the real world!\n", "\n", "### Prototyping a baseline model\n", "\n", "Once you've chosen your goal, you're ready to get your hands dirty.\n", "\n", "Jupyter notebooks are a great prototyping/experimentation tool. You can use a notebook(s) to get some quick ideas about the feasibility and performance of a model.\n", "\n", "After you get some results, you'll proceed to create a full-blown project containing the baseline model. This will be a lot of work, but some rewards you might expect are bug fixes, new ideas, and developing something that (hopefully) has a real-world impact.\n", "\n", "Next, we'll go over an example task of automating the decision of whether a bank customer has good or bad credit risk.\n", "\n", "## Data collection\n", "\n", "If you haven't solved any real-world ML problems yet, you might believe that most datasets get stored as CSV files. While this format is great when learning, you'll often need to understand SQL, pickle, HDF5, Parquet, and many more. \n", "\n", "### Labeling\n", "\n", "Sometimes your data will have labels, but it might not be exactly the data you need. Other times, you'll miss labels altogether. What can you do?\n", "\n", "Creating a labeling infrastructure will be time well spent, for sure. Increasing the data size and reducing the noise in your data (having less wrong data) will dramatically increase the predictive power of your model(s).\n", "\n", "You will always want more and cleaner data. So keep getting it, slowly but surely. At some point, you'll notice you start getting diminishing returns. Depending on the problem you're solving, it might be a good idea to focus on other issues in the project.\n", "\n", "#### Looking at the data\n", "\n", "This might sound boring and complete nonsense but go through different examples from the data. Can you figure out the labels for each one? Are the labels correct? Are the labels consistent?\n", "\n", "Remember, feeding your model with crappy data will give you garbage results. At this stage of the project, you don't want to deal with crap - there will be plenty later on.\n", "\n", "### Get the data\n", "\n", "We'll build a complete prototype of a system that predicts the credit risk for a given bank customer. [The data](https://www.openml.org/d/31) describes the clients using a set of attributes of mixed types (similar to a real-world scenario).\n", "\n", "We'll use the dataset download utility from scikit learn:" ] }, { "cell_type": "code", "metadata": { "id": "zLWr26p1hL3N", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "92887666-76db-431e-dea9-4532d3c607b1" }, "source": [ "from sklearn import datasets\n", "\n", "features, targets = datasets.fetch_openml(\n", " name=\"credit-g\", \n", " version=1, \n", " return_X_y=True, \n", " as_frame=True\n", ")\n", "\n", "features.shape, targets.shape" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((1000, 20), (1000,))" ] }, "metadata": { "tags": [] }, "execution_count": 1 } ] }, { "cell_type": "markdown", "metadata": { "id": "Effp7EARN0gp" }, "source": [ "1,000 examples are really that much. Hopefully, you'll have a lot more data when solving real-world problems. Let's have a look:" ] }, { "cell_type": "code", "metadata": { "id": "g0Gh4sIYjIYp", "colab": { "base_uri": "https://localhost:8080/", "height": 258 }, "outputId": "01ad3a34-2070-4d3a-f6e6-19d11893ee1f" }, "source": [ "features.head()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
checking_statusdurationcredit_historypurposecredit_amountsavings_statusemploymentinstallment_commitmentpersonal_statusother_partiesresidence_sinceproperty_magnitudeageother_payment_planshousingexisting_creditsjobnum_dependentsown_telephoneforeign_worker
0<06.0critical/other existing creditradio/tv1169.0no known savings>=74.0male singlenone4.0real estate67.0noneown2.0skilled1.0yesyes
10<=X<20048.0existing paidradio/tv5951.0<1001<=X<42.0female div/dep/marnone2.0real estate22.0noneown1.0skilled1.0noneyes
2no checking12.0critical/other existing crediteducation2096.0<1004<=X<72.0male singlenone3.0real estate49.0noneown1.0unskilled resident2.0noneyes
3<042.0existing paidfurniture/equipment7882.0<1004<=X<72.0male singleguarantor4.0life insurance45.0nonefor free1.0skilled2.0noneyes
4<024.0delayed previouslynew car4870.0<1001<=X<43.0male singlenone4.0no known property53.0nonefor free2.0skilled2.0noneyes
\n", "
" ], "text/plain": [ " checking_status duration ... own_telephone foreign_worker\n", "0 <0 6.0 ... yes yes\n", "1 0<=X<200 48.0 ... none yes\n", "2 no checking 12.0 ... none yes\n", "3 <0 42.0 ... none yes\n", "4 <0 24.0 ... none yes\n", "\n", "[5 rows x 20 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 2 } ] }, { "cell_type": "code", "metadata": { "id": "-OwBps8WjnFd", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "75bcd159-8c48-4949-ca2c-a3cb22774458" }, "source": [ "targets.head()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 good\n", "1 bad\n", "2 good\n", "3 good\n", "4 bad\n", "Name: class, dtype: category\n", "Categories (2, object): ['good', 'bad']" ] }, "metadata": { "tags": [] }, "execution_count": 3 } ] }, { "cell_type": "markdown", "metadata": { "id": "EbYXMMhHVBVm" }, "source": [ "Let's have a quick look at the class distribution:" ] }, { "cell_type": "code", "metadata": { "id": "BPGdO9NDjRvj", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "a19cd49b-3b20-4ef5-cb50-0f8cbe3c6abe" }, "source": [ "targets.value_counts()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "good 700\n", "bad 300\n", "Name: class, dtype: int64" ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "cell_type": "markdown", "metadata": { "id": "_OdRJPU6b6Ps" }, "source": [ "That should tell you that we can't trust a simple accuracy score to evaluate our model. But we'll delve deeper into this in later parts." ] }, { "cell_type": "markdown", "metadata": { "id": "UKdbVd65hRB3" }, "source": [ "### Note on reproducibility\n", "\n", "Making your results completely reproducible can be hard work. You might start with something like this:" ] }, { "cell_type": "code", "metadata": { "id": "NjHmi34rhUHs", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "a4c6da13-198f-4247-bf82-1f4c3ed21210" }, "source": [ "import random\n", "import numpy as np\n", "import torch\n", "\n", "RANDOM_SEED = 42\n", "\n", "random.seed(RANDOM_SEED)\n", "np.random.seed(RANDOM_SEED + 1)\n", "torch.manual_seed(RANDOM_SEED + 2)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 5 } ] }, { "cell_type": "markdown", "metadata": { "id": "KGaMew3Pfj1u" }, "source": [ "But randomness can bite you when you least expect it: \n", "\n", "- Your dataset generation scripts might not generate the same data\n", "- Some records might be updated/deleted \n", "- Training on different/multiple GPUs can produce different results. \n", "\n", "Can you fix all of this? Probably, but some steps are more important to reproduce than others. For example, you better have the same dataset on which you're testing your models." ] }, { "cell_type": "markdown", "metadata": { "id": "LPsMlYol0xR8" }, "source": [ "## Feature engineering\n", "\n", "One of the advantages of Deep Neural Networks is to automate the process of feature engineering. At least, that was the grand promise.\n", "\n", "In practice, adding manual features might significantly improve the performance of your model. But creating good features is black magic. Ideas for those come almost always from spending absurd amounts of time with the raw data.\n", "\n", "Start by thinking of a couple of features and encode them. Use classical ML algorithms (like Random Forest) to evaluate their importance. Those features will be prime candidates for inclusion in your Deep Learning model later on." ] }, { "cell_type": "code", "metadata": { "id": "VN9NAzHGRmv-" }, "source": [ "FEATURES = [\n", " \"duration\", \n", " \"credit_amount\", \n", " \"age\", \n", " \"existing_credits\", \n", " \"residence_since\"\n", "]" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "VjpbflwHTiIk" }, "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn import preprocessing\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " features[FEATURES],\n", " targets,\n", " test_size=0.2\n", ")\n", "\n", "label_encoder = preprocessing.LabelEncoder()\n", "label_encoder = label_encoder.fit(y_train)\n", "\n", "X_train = X_train.to_numpy()\n", "y_train = label_encoder.transform(y_train)\n", "\n", "X_test = X_test.to_numpy()\n", "y_test = label_encoder.transform(y_test)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "AdNUJ8vcV6tG", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "11f83cf7-d3d2-4e80-fb04-7dfc8aae3dce" }, "source": [ "X_train.shape, y_train.shape" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((800, 5), (800,))" ] }, "metadata": { "tags": [] }, "execution_count": 8 } ] }, { "cell_type": "code", "metadata": { "id": "H8JOKrC_Ba5p", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "ddb2ffe0-1d26-4534-c669-cb4d5c3305f7" }, "source": [ "X_test.shape, y_test.shape" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((200, 5), (200,))" ] }, "metadata": { "tags": [] }, "execution_count": 9 } ] }, { "cell_type": "markdown", "metadata": { "id": "AuCyNBEtCJKN" }, "source": [ "## Training and evaluation\n", "\n", "Training a Deep Neural Net using any of the popular libraries for Deep Learning is relatively straightforward. That is given you keep playing with toy examples.\n", "\n", "In practice, the training might include a lot of hacks that change the generic process just a bit - enough to introduce bugs and write tons of incomprehensive code.\n", "\n", "Using a library like scikit-learn is a great first choice for building a baseline model. It takes very little time, the code is easier to understand and you can gain a lot of insight into the problem you're solving. \n", "\n", "Here's a quick example of how you can use a Random Forest classifier:" ] }, { "cell_type": "code", "metadata": { "id": "AtLtUfIacGkY", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9e5bdf83-092c-4f2a-e2ae-b5d309eded86" }, "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "model = RandomForestClassifier(n_estimators=200)\n", "model = model.fit(X_train, y_train)\n", "model.score(X_test, y_test)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0.68" ] }, "metadata": { "tags": [] }, "execution_count": 10 } ] }, { "cell_type": "markdown", "metadata": { "id": "tNjnpKzk2Q5b" }, "source": [ "But you might've heard that Neural Networks are better. They can be but require deeper understanding." ] }, { "cell_type": "markdown", "metadata": { "id": "1t107ZfNb3NC" }, "source": [ "### Training a Deep Neural Net in PyTorch\n", "\n", "Let's create a simple Neural Network using PyTorch:" ] }, { "cell_type": "code", "metadata": { "id": "sdmVbSnCt5uk" }, "source": [ "from torch.utils.data.dataset import Dataset\n", "from torch.utils.data import DataLoader\n", "\n", "class CreditTypeDataset(Dataset):\n", " def __init__(self, features, labels):\n", " self.features = features\n", " self.labels = labels\n", " \n", " def __getitem__(self, index):\n", " return (\n", " torch.from_numpy(self.features[index]).float(), \n", " self.labels[index]\n", " )\n", "\n", " def __len__(self):\n", " return len(self.features)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zCnBWgxrp9Dt" }, "source": [ "The dataset is a helper that allows us to convert our dataset to Tensors. Wrapping that into DataLoader lets you get batches, shuffle the data, and more:" ] }, { "cell_type": "code", "metadata": { "id": "fxTyOFkSt7-2" }, "source": [ "dataset = CreditTypeDataset(X_train, y_train)\n", "data_loader = DataLoader(dataset, batch_size=4, shuffle=True)\n", "features_batch, labels_batch = next(iter(data_loader))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Fd_JIMZ9qEWi" }, "source": [ "Let's look at a sample batch:" ] }, { "cell_type": "code", "metadata": { "id": "LflHvaHluQbZ", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "6f0e7c5d-a0a2-4c59-88dc-56b9b4a9c1ac" }, "source": [ "features_batch" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "tensor([[6.0000e+00, 1.4555e+04, 2.3000e+01, 1.0000e+00, 2.0000e+00],\n", " [2.4000e+01, 2.3330e+03, 2.9000e+01, 1.0000e+00, 2.0000e+00],\n", " [3.0000e+01, 3.4410e+03, 2.1000e+01, 1.0000e+00, 4.0000e+00],\n", " [1.5000e+01, 1.4780e+03, 3.3000e+01, 2.0000e+00, 3.0000e+00]])" ] }, "metadata": { "tags": [] }, "execution_count": 13 } ] }, { "cell_type": "code", "metadata": { "id": "wSa9OdhIuSDp", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "3c557fb6-d2da-4d49-db12-8a9b68125389" }, "source": [ "labels_batch" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "tensor([0, 1, 0, 1])" ] }, "metadata": { "tags": [] }, "execution_count": 14 } ] }, { "cell_type": "markdown", "metadata": { "id": "WCqkhI6iz5ma" }, "source": [ "Our baseline model is a really simple one:" ] }, { "cell_type": "code", "metadata": { "id": "YEanO7_kb4Ka" }, "source": [ "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import torch.optim as optim\n", "\n", "class CreditTypeClassifierNet(nn.Module):\n", "\n", " def __init__(self, n_features, n_credit_types):\n", " super(CreditTypeClassifierNet, self).__init__()\n", " self.fc1 = nn.Linear(n_features, n_features * 2)\n", " self.fc2 = nn.Linear(n_features * 2, n_credit_types)\n", "\n", " def forward(self, x):\n", " x = F.relu(self.fc1(x))\n", " x = self.fc2(x)\n", " return x\n", "\n", " def create_optimizer(self):\n", " return optim.Adam(self.parameters(), lr=0.01)\n", "\n", " def create_criterion(self):\n", " return nn.CrossEntropyLoss(reduction=\"none\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "IonPG-NR1WBN" }, "source": [ "The two additional methods `create_optimizer()` and `create_criterion()` help us keep those two components around, wherever the model is needed." ] }, { "cell_type": "code", "metadata": { "id": "Si1cR-r5cnRo", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "aa427e09-c3b6-4194-daf7-8a44dd245c94" }, "source": [ "model = CreditTypeClassifierNet(len(FEATURES), 2)\n", "model" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "CreditTypeClassifierNet(\n", " (fc1): Linear(in_features=5, out_features=10, bias=True)\n", " (fc2): Linear(in_features=10, out_features=2, bias=True)\n", ")" ] }, "metadata": { "tags": [] }, "execution_count": 16 } ] }, { "cell_type": "markdown", "metadata": { "id": "AhvazoFHz_oh" }, "source": [ "### Training\n", "\n", "Libraries like Keras give you an easy to use training interface for TensorFlow. There are libraries that give you something similar for PyTorch, too! But I am a fan of going raw as much as possible (just make sure you stay safe). If you're not - you can use something like [Pytorch Lightning](https://pytorchlightning.ai/).\n", "\n", "We'll start with a couple of helpers:" ] }, { "cell_type": "code", "metadata": { "id": "F9c1_YnIZRUq" }, "source": [ "class Phase:\n", " TRAIN = \"train\"\n", " TEST = \"test\"\n", "\n", "PHASES = [Phase.TRAIN, Phase.TEST]\n", "\n", "class Evaluator:\n", "\n", " def __init__(self, criterion):\n", " self.criterion = criterion\n", "\n", " def eval(self, model, X, y, phase: Phase):\n", " with torch.set_grad_enabled(phase == Phase.TRAIN):\n", " model = model.train() if phase == Phase.TRAIN else model.eval()\n", " outputs = model(X)\n", " loss = self.criterion(outputs, y)\n", "\n", " _, predictions = outputs.max(dim=1)\n", " correct_count = torch.sum(predictions == y)\n", "\n", " return loss, correct_count" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TE_1JSZrD7Za" }, "source": [ "The current phase dictates the behavior of both helpers. The job of the `Evaluator`' is to calculate the current loss and correct predictions (Note that this will only work for classification tasks)." ] }, { "cell_type": "code", "metadata": { "id": "t9iV5bB6f7yL" }, "source": [ "from collections import defaultdict\n", "from dataclasses import dataclass\n", "\n", "@dataclass\n", "class Progress:\n", " losses: list\n", " correct_predictions: int\n", "\n", " def average_loss(self):\n", " return torch.cat(self.losses, dim=0).mean().item()\n", "\n", " def accuracy(self, dataset_size):\n", " return self.correct_predictions.double() / dataset_size\n", "\n", "class ProgressLogger:\n", "\n", " def __init__(self, dataset_sizes):\n", " self.progress = defaultdict(\n", " lambda: {Phase.TRAIN: Progress([], 0), Phase.TEST: Progress([], 0)}\n", " )\n", " self.dataset_sizes = dataset_sizes\n", "\n", " @staticmethod\n", " def _round(value, precision=3):\n", " return np.round(value, precision)\n", "\n", " def save_progress(self, epoch, phase, loss, correct_predictions):\n", " self.progress[epoch][phase].losses.append(loss.detach())\n", " self.progress[epoch][phase].correct_predictions += correct_predictions\n", "\n", " def log(self, epoch):\n", " print(f\"Epoch {epoch + 1}\")\n", "\n", " train_progress = self.progress[epoch][Phase.TRAIN]\n", " train_loss = ProgressLogger._round(train_progress.average_loss())\n", " train_accuracy = ProgressLogger._round(train_progress.accuracy(self.dataset_sizes[Phase.TRAIN]))\n", " print(f\"Train: loss {train_loss} accuracy {train_accuracy}\")\n", "\n", " test_progress = self.progress[epoch][Phase.TEST]\n", " test_loss = ProgressLogger._round(test_progress.average_loss())\n", " test_accuracy = ProgressLogger._round(test_progress.accuracy(self.dataset_sizes[Phase.TEST]))\n", " print(f\"Test: loss {test_loss} accuracy {test_accuracy}\")\n", " print()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "yy8Elpv57EvE" }, "source": [ "The `ProgressLogger` stores the losses and correct predictions for each training example. The `log()` method print outs the evaluations for the current epoch and training phase." ] }, { "cell_type": "code", "metadata": { "id": "cYLsgRRZCYLH" }, "source": [ "class Trainer:\n", "\n", " def __init__(self, train_dataset, test_dataset, batch_size):\n", "\n", " train_data_loader = DataLoader(train_dataset, batch_size=batch_size)\n", " test_data_loader = DataLoader(test_dataset, batch_size=1)\n", "\n", " self.data_loaders = {\n", " Phase.TRAIN: train_data_loader,\n", " Phase.TEST: test_data_loader\n", " }\n", "\n", " self.dataset_sizes = {\n", " Phase.TRAIN: len(train_dataset),\n", " Phase.TEST: len(test_dataset)\n", " }\n", "\n", " self.logger = ProgressLogger(self.dataset_sizes)\n", "\n", " def train(self, model, n_epochs):\n", " optimizer = model.create_optimizer()\n", " evaluator = Evaluator(model.create_criterion())\n", " \n", " for epoch in range(n_epochs):\n", " \n", " for phase in PHASES:\n", "\n", " for inputs, labels in self.data_loaders[phase]:\n", "\n", " loss, correct_count = evaluator.eval(model, inputs, labels, phase) \n", " self.logger.save_progress(epoch, phase, loss, correct_count)\n", " \n", " if phase == Phase.TRAIN:\n", "\n", " optimizer.zero_grad()\n", " loss.mean().backward()\n", "\n", " optimizer.step()\n", "\n", " self.logger.log(epoch)\n", "\n", " return model" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TjQCY-eDMho0" }, "source": [ "The `Trainer` is where all the magic happens. It requires train and test datasets and delivers a trained model. Note that we're using no reduction during loss calculation. We store the loss for each example in the dataset. This helps us calculate the average loss (the loss for the epoch) when logging our progress." ] }, { "cell_type": "code", "metadata": { "id": "WulZLFHmECmq", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "92eff243-7b6b-4a34-8458-199ecc2ef1a6" }, "source": [ "train_dataset = CreditTypeDataset(X_train, y_train)\n", "test_dataset = CreditTypeDataset(X_test, y_test)\n", "\n", "trainer = Trainer(train_dataset, test_dataset, batch_size=8)\n", "model = trainer.train(model, n_epochs=10)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Epoch 1\n", "Train: loss 10.909 accuracy 0.576\n", "Test: loss 10.847 accuracy 0.675\n", "\n", "Epoch 2\n", "Train: loss 7.691 accuracy 0.589\n", "Test: loss 10.621 accuracy 0.35\n", "\n", "Epoch 3\n", "Train: loss 6.96 accuracy 0.61\n", "Test: loss 8.438 accuracy 0.365\n", "\n", "Epoch 4\n", "Train: loss 7.173 accuracy 0.578\n", "Test: loss 16.883 accuracy 0.335\n", "\n", "Epoch 5\n", "Train: loss 5.299 accuracy 0.601\n", "Test: loss 7.084 accuracy 0.675\n", "\n", "Epoch 6\n", "Train: loss 5.07 accuracy 0.596\n", "Test: loss 0.937 accuracy 0.63\n", "\n", "Epoch 7\n", "Train: loss 6.687 accuracy 0.59\n", "Test: loss 5.513 accuracy 0.39\n", "\n", "Epoch 8\n", "Train: loss 6.026 accuracy 0.612\n", "Test: loss 4.589 accuracy 0.39\n", "\n", "Epoch 9\n", "Train: loss 4.321 accuracy 0.605\n", "Test: loss 6.343 accuracy 0.385\n", "\n", "Epoch 10\n", "Train: loss 4.126 accuracy 0.621\n", "Test: loss 3.21 accuracy 0.675\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "GOZBSAhhj4do" }, "source": [ "### Evaluation\n", "\n", "How well will your model do in production? To answer this question, you need answers to the following two:\n", "\n", "- What resources (CPU, GPU, RAM and disk space) do my model need to run? What is the expected response time?\n", "- How well will the predicted values match the real ones?\n", "\n", "You can usually answer the first question using a variety of tools from the software development world (like time, top and htop). However, you need to take into account the size of the input data. If you're loading into memory large text or image data, they might overflow and crash your program. Make sure you know the bounds of your data.\n", "\n", "How good your model predictions are? A wide variety of statistical tests are available to evaluate the performance of different models. And they are very good at what they do. But having large amounts of data changes the game a little bit. You can use simple tools like accuracy, confusion matrix, precision, recall and apply appropriate thresholding. Proper evaluation of your model can be done only if you're intimately familiar with the domain.\n", "\n", "One critical step in the process is looking at errors. Where your model makes errors? You should manually go through some errors to get a feel for them. How do you solve those? One simple and effective way to make your model better - add more data, matching the conditions where the model makes errors." ] }, { "cell_type": "markdown", "metadata": { "id": "kQFeb6yF033N" }, "source": [ "## Deployment\n", "\n", "Deploying your model allows you to get your work to your users. It might be that millions will use it (given you work at a company like Google) or just you. Either way, you'll need to make your model available for others.\n", "\n", "The most common way of deploying your model is behind a REST API. You can also embed it into a user's device (building a mobile app for iOS or Android).\n", "\n", "Most likely, your model will not be the only thing that requires computational resources. How much time does it take to make a prediction? How much RAM/VRAM is needed? How can you avoid blocking the user while your model is making its prediction? We'll answer those questions in the next part(s).\n", "\n", "Next, we'll pack our prototype into a complete Python project. We'll deploy the baseline model and serve its predictions using a REST API." ] }, { "cell_type": "markdown", "metadata": { "id": "lhXMzlIQoSZb" }, "source": [ "## What to do when your model isn't working\n", "\n", "While you can use many technical tricks to improve your model, you should start with something simpler. Ask for help! But not anyone, ask a domain expert for his input. How does he/she make predictions for this task? What the data required for those predictions? How accurate he/she is?\n", "\n", "Use the insights to build a better model and gather more data. Hopefully, this should give you an acceptable performance before doing hardcore optimization." ] } ] }