{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "02.building-a-prototype.ipynb", "provenance": [], "collapsed_sections": [], "authorship_tag": "ABX9TyMFVaIGYTI7PcTbwKmLwaLW" }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "o1NCK5Omuu9c" }, "source": [ "# Building an End-to-End Machine Learning Prototype\n", "\n", "> TL;DR Build a complete Machine Learning project skeleton that serves as a baseline for future improvements\n", "\n", "* [Run the notebook in your browser (Google Colab)](https://colab.research.google.com/drive/16AOmMyAInAcLRo0_wuAST9-HpAixb-XO?usp=sharing)\n", "\n", "Every Machine Learning project follows a similar set of steps to (hopefully) deliver the goods. Many attempts will fail, but the successful ones go through the following steps, iteratively:\n", "\n", "> The lifecycle of a Machine Learning project\n", "\n", "- Planning/choosing a goal\n", "- Data collection & labelling\n", "- Creating features and preprocessing\n", "- Training and optimization\n", "- Deployment and testing\n", "\n", "## Planning\n", "\n", "Choosing what to work on and what is the measurement of success is the most important part of the project.\n", "\n", "Get help here! Ask domain experts, business people, meditate on it. Do spend some time to think about it. But don't linger for too long. Analysis paralysis is a very common phenomenon in the real world!\n", "\n", "### Prototyping a baseline model\n", "\n", "Once you've chosen your goal, you're ready to get your hands dirty.\n", "\n", "Jupyter notebooks are a great prototyping/experimentation tool. You can use a notebook(s) to get some quick ideas about the feasibility and performance of a model.\n", "\n", "After you get some results, you'll proceed to create a full-blown project containing the baseline model. This will be a lot of work, but some rewards you might expect are bug fixes, new ideas, and developing something that (hopefully) has a real-world impact.\n", "\n", "Next, we'll go over an example task of automating the decision of whether a bank customer has good or bad credit risk.\n", "\n", "## Data collection\n", "\n", "If you haven't solved any real-world ML problems yet, you might believe that most datasets get stored as CSV files. While this format is great when learning, you'll often need to understand SQL, pickle, HDF5, Parquet, and many more. \n", "\n", "### Labeling\n", "\n", "Sometimes your data will have labels, but it might not be exactly the data you need. Other times, you'll miss labels altogether. What can you do?\n", "\n", "Creating a labeling infrastructure will be time well spent, for sure. Increasing the data size and reducing the noise in your data (having less wrong data) will dramatically increase the predictive power of your model(s).\n", "\n", "You will always want more and cleaner data. So keep getting it, slowly but surely. At some point, you'll notice you start getting diminishing returns. Depending on the problem you're solving, it might be a good idea to focus on other issues in the project.\n", "\n", "#### Looking at the data\n", "\n", "This might sound boring and complete nonsense but go through different examples from the data. Can you figure out the labels for each one? Are the labels correct? Are the labels consistent?\n", "\n", "Remember, feeding your model with crappy data will give you garbage results. At this stage of the project, you don't want to deal with crap - there will be plenty later on.\n", "\n", "### Get the data\n", "\n", "We'll build a complete prototype of a system that predicts the credit risk for a given bank customer. [The data](https://www.openml.org/d/31) describes the clients using a set of attributes of mixed types (similar to a real-world scenario).\n", "\n", "We'll use the dataset download utility from scikit learn:" ] }, { "cell_type": "code", "metadata": { "id": "zLWr26p1hL3N", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "92887666-76db-431e-dea9-4532d3c607b1" }, "source": [ "from sklearn import datasets\n", "\n", "features, targets = datasets.fetch_openml(\n", " name=\"credit-g\", \n", " version=1, \n", " return_X_y=True, \n", " as_frame=True\n", ")\n", "\n", "features.shape, targets.shape" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((1000, 20), (1000,))" ] }, "metadata": { "tags": [] }, "execution_count": 1 } ] }, { "cell_type": "markdown", "metadata": { "id": "Effp7EARN0gp" }, "source": [ "1,000 examples are really that much. Hopefully, you'll have a lot more data when solving real-world problems. Let's have a look:" ] }, { "cell_type": "code", "metadata": { "id": "g0Gh4sIYjIYp", "colab": { "base_uri": "https://localhost:8080/", "height": 258 }, "outputId": "01ad3a34-2070-4d3a-f6e6-19d11893ee1f" }, "source": [ "features.head()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | checking_status | \n", "duration | \n", "credit_history | \n", "purpose | \n", "credit_amount | \n", "savings_status | \n", "employment | \n", "installment_commitment | \n", "personal_status | \n", "other_parties | \n", "residence_since | \n", "property_magnitude | \n", "age | \n", "other_payment_plans | \n", "housing | \n", "existing_credits | \n", "job | \n", "num_dependents | \n", "own_telephone | \n", "foreign_worker | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "<0 | \n", "6.0 | \n", "critical/other existing credit | \n", "radio/tv | \n", "1169.0 | \n", "no known savings | \n", ">=7 | \n", "4.0 | \n", "male single | \n", "none | \n", "4.0 | \n", "real estate | \n", "67.0 | \n", "none | \n", "own | \n", "2.0 | \n", "skilled | \n", "1.0 | \n", "yes | \n", "yes | \n", "
1 | \n", "0<=X<200 | \n", "48.0 | \n", "existing paid | \n", "radio/tv | \n", "5951.0 | \n", "<100 | \n", "1<=X<4 | \n", "2.0 | \n", "female div/dep/mar | \n", "none | \n", "2.0 | \n", "real estate | \n", "22.0 | \n", "none | \n", "own | \n", "1.0 | \n", "skilled | \n", "1.0 | \n", "none | \n", "yes | \n", "
2 | \n", "no checking | \n", "12.0 | \n", "critical/other existing credit | \n", "education | \n", "2096.0 | \n", "<100 | \n", "4<=X<7 | \n", "2.0 | \n", "male single | \n", "none | \n", "3.0 | \n", "real estate | \n", "49.0 | \n", "none | \n", "own | \n", "1.0 | \n", "unskilled resident | \n", "2.0 | \n", "none | \n", "yes | \n", "
3 | \n", "<0 | \n", "42.0 | \n", "existing paid | \n", "furniture/equipment | \n", "7882.0 | \n", "<100 | \n", "4<=X<7 | \n", "2.0 | \n", "male single | \n", "guarantor | \n", "4.0 | \n", "life insurance | \n", "45.0 | \n", "none | \n", "for free | \n", "1.0 | \n", "skilled | \n", "2.0 | \n", "none | \n", "yes | \n", "
4 | \n", "<0 | \n", "24.0 | \n", "delayed previously | \n", "new car | \n", "4870.0 | \n", "<100 | \n", "1<=X<4 | \n", "3.0 | \n", "male single | \n", "none | \n", "4.0 | \n", "no known property | \n", "53.0 | \n", "none | \n", "for free | \n", "2.0 | \n", "skilled | \n", "2.0 | \n", "none | \n", "yes | \n", "