{"cells":[{"cell_type":"markdown","id":"ef3f373c-e39a-4b5f-9bb8-a920d161f1b6","metadata":{},"source":["
\n"," \"cognitiveclass.ai\n","
\n"]},{"cell_type":"markdown","id":"8dbdecec-cd11-4a57-9df4-82c8a80a78eb","metadata":{},"source":["# **Credit Card Fraud Detection using Scikit-Learn and Snap ML**\n"]},{"cell_type":"markdown","id":"ec071c2d-d1d4-4614-aa2a-189fc7c98f94","metadata":{},"source":["Estimated time needed: **30** minutes\n"]},{"cell_type":"markdown","id":"0b1690b3-6524-4109-bf72-1a4daf11b4c7","metadata":{},"source":["In this exercise session you will consolidate your machine learning (ML) modeling skills by using two popular classification models to recognize fraudulent credit card transactions. These models are: Decision Tree and Support Vector Machine. You will use a real dataset to train each of these models. The dataset includes information about \n","transactions made by credit cards in September 2013 by European cardholders. You will use the trained model to assess if a credit card transaction is legitimate or not.\n","\n","In the current exercise session, you will practice not only the Scikit-Learn Python interface, but also the Python API offered by the Snap Machine Learning (Snap ML) library. Snap ML is a high-performance IBM library for ML modeling. It provides highly-efficient CPU/GPU implementations of linear models and tree-based models. Snap ML not only accelerates ML algorithms through system awareness, but it also offers novel ML algorithms with best-in-class accuracy. For more information, please visit [snapml](https://ibm.biz/BdPfxy) information page.\n"]},{"cell_type":"markdown","id":"6f509f1e-49cf-4411-a4b8-b06a9e61ad6d","metadata":{},"source":["## Objectives\n"]},{"cell_type":"markdown","id":"676d912b-2c5d-47d8-9cf8-2d15db8d8503","metadata":{},"source":["After completing this lab you will be able to:\n"]},{"cell_type":"markdown","id":"c1b4e289-2670-476a-8746-9f4a1db8dadd","metadata":{},"source":["* Perform basic data preprocessing in Python\n","* Model a classification task using the Scikit-Learn and Snap ML Python APIs\n","* Train Suppport Vector Machine and Decision Tree models using Scikit-Learn and Snap ML\n","* Run inference and assess the quality of the trained models\n"]},{"cell_type":"markdown","id":"27a382a6-b52f-4adb-be30-5754e81775b1","metadata":{},"source":["## Table of Contents\n"]},{"cell_type":"markdown","id":"cf840df9-6d1a-490a-b8c8-e2d413329748","metadata":{},"source":["
\n","
    \n","
  1. Introduction
  2. \n","
  3. Import Libraries
  4. \n","
  5. Dataset Analysis
  6. \n","
  7. Dataset Preprocessing
  8. \n","
  9. Dataset Train/Test Split
  10. \n","
  11. Build a Decision Tree Classifier model with Scikit-Learn
  12. \n","
  13. Build a Decision Tree Classifier model with Snap ML
  14. \n","
  15. Evaluate the Scikit-Learn and Snap ML Decision Tree Classifiers
  16. \n","
  17. Build a Support Vector Machine model with Scikit-Learn
  18. \n","
  19. Build a Support Vector Machine model with Snap ML
  20. \n","
  21. Evaluate the Scikit-Learn and Snap ML Support Vector Machine Models
  22. \n","
\n","
\n","
\n","
\n"]},{"cell_type":"markdown","id":"d4427c75-08bc-41f9-97d3-7354e7ab6001","metadata":{},"source":["
\n","

Introduction

\n","
Imagine that you work for a financial institution and part of your job is to build a model that predicts if a credit card transaction is fraudulent or not. You can model the problem as a binary classification problem. A transaction belongs to the positive class (1) if it is a fraud, otherwise it belongs to the negative class (0).\n","
\n","
You have access to transactions that occured over a certain period of time. The majority of the transactions are normally legitimate and only a small fraction are non-legitimate. Thus, typically you have access to a dataset that is highly unbalanced. This is also the case of the current dataset: only 492 transactions out of 284,807 are fraudulent (the positive class - the frauds - accounts for 0.172% of all transactions).\n","
\n","
To train the model you can use part of the input dataset and the remaining data can be used to assess the quality of the trained model. First, let's download the dataset.\n","
\n","
\n"]},{"cell_type":"code","execution_count":52,"id":"b63e0611-cc91-414d-97db-c24950151d52","metadata":{},"outputs":[{"name":"stderr","output_type":"stream","text":["83240.57s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n"]},{"name":"stdout","output_type":"stream","text":["Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple\n","Requirement already satisfied: opendatasets in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (0.1.22)\n","Requirement already satisfied: tqdm in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from opendatasets) (4.66.1)\n","Requirement already satisfied: click in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from opendatasets) (8.1.7)\n","Requirement already satisfied: kaggle in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from opendatasets) (1.5.16)\n","Requirement already satisfied: six>=1.10 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from kaggle->opendatasets) (1.16.0)\n","Requirement already satisfied: certifi in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from kaggle->opendatasets) (2023.7.22)\n","Requirement already satisfied: python-dateutil in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from kaggle->opendatasets) (2.8.2)\n","Requirement already satisfied: requests in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from kaggle->opendatasets) (2.31.0)\n","Requirement already satisfied: python-slugify in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from kaggle->opendatasets) (8.0.1)\n","Requirement already satisfied: urllib3 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from kaggle->opendatasets) (2.0.7)\n","Requirement already satisfied: bleach in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from kaggle->opendatasets) (6.1.0)\n","Requirement already satisfied: webencodings in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from bleach->kaggle->opendatasets) (0.5.1)\n","Requirement already satisfied: text-unidecode>=1.3 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from python-slugify->kaggle->opendatasets) (1.3)\n","Requirement already satisfied: idna<4,>=2.5 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from requests->kaggle->opendatasets) (3.4)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from requests->kaggle->opendatasets) (3.3.1)\n","\u001b[33mWARNING: You are using pip version 22.0.4; however, version 23.3 is available.\n","You should consider upgrading via the '/home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/bin/python3.10 -m pip install --upgrade pip' command.\u001b[0m\u001b[33m\n","\u001b[0mPlease provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds\n","Your Kaggle username:Your Kaggle Key:Downloading creditcardfraud.zip to ./creditcardfraud\n"]},{"name":"stderr","output_type":"stream","text":["100%|██████████| 66.0M/66.0M [00:01<00:00, 53.3MB/s]\n"]},{"name":"stdout","output_type":"stream","text":["\n"]}],"source":["# install the opendatasets package\n","!pip install opendatasets\n","\n","import opendatasets as od\n","\n","# download the dataset (this is a Kaggle dataset)\n","# during download you will be required to input your Kaggle username and password\n","od.download(\"https://www.kaggle.com/mlg-ulb/creditcardfraud\")"]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[{"data":{"text/plain":["'/home/bbearce/Documents/code-journal/docs/notes/machine_learning/Coursera/JupyterNotebooks'"]},"execution_count":5,"metadata":{},"output_type":"execute_result"}],"source":["import os; os.getcwd()"]},{"cell_type":"markdown","id":"92cf3cd1-1413-4e75-a88e-2b6413de6c58","metadata":{},"source":["__Did you know?__ When it comes to Machine Learning, you will most likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](https://ibm.biz/BdPfxf)\n"]},{"cell_type":"markdown","id":"c44bc0d4-2b4c-4a91-883a-35fe50a7c070","metadata":{},"source":["
\n","

Import Libraries

\n","
\n"]},{"cell_type":"code","execution_count":6,"id":"9e5df4a7-505b-4921-9f45-cad5c0fffd79","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple\n","Collecting snapml\n"," Downloading snapml-1.14.2-cp310-cp310-manylinux_2_28_x86_64.whl (7.3 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.3/7.3 MB\u001b[0m \u001b[31m19.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n","\u001b[?25hRequirement already satisfied: scikit-learn in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from snapml) (1.3.1)\n","Requirement already satisfied: numpy>=1.21.3 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from snapml) (1.26.1)\n","Requirement already satisfied: scipy in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from snapml) (1.11.3)\n","Requirement already satisfied: joblib>=1.1.1 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from scikit-learn->snapml) (1.3.2)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/lib/python3.10/site-packages (from scikit-learn->snapml) (3.2.0)\n","Installing collected packages: snapml\n","Successfully installed snapml-1.14.2\n","\u001b[33mWARNING: You are using pip version 22.0.4; however, version 23.3 is available.\n","You should consider upgrading via the '/home/bbearce/.pyenv/versions/3.10.4/envs/venv3.10.4/bin/python3.10 -m pip install --upgrade pip' command.\u001b[0m\u001b[33m\n","\u001b[0m"]}],"source":["# Snap ML is available on PyPI. To install it simply run the pip command below.\n","!pip install snapml"]},{"cell_type":"code","execution_count":53,"id":"f0b103cc-6ffd-4bb2-8baf-55d246bfdd66","metadata":{},"outputs":[],"source":["# Import the libraries we need to use in this lab\n","from __future__ import print_function\n","import numpy as np\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","%matplotlib inline\n","from sklearn.model_selection import train_test_split\n","from sklearn.preprocessing import normalize, StandardScaler\n","from sklearn.utils.class_weight import compute_sample_weight\n","from sklearn.metrics import roc_auc_score\n","import time\n","import warnings\n","warnings.filterwarnings('ignore')"]},{"cell_type":"markdown","id":"cfd4973c-1def-4a5b-aeb1-dffd117a515f","metadata":{},"source":["
\n","

Dataset Analysis

\n","
\n"]},{"cell_type":"markdown","id":"3cc91238-9566-4282-a711-a2dc01ef2752","metadata":{},"source":["In this section you will read the dataset in a Pandas dataframe and visualize its content. You will also look at some data statistics. \n","\n","Note: A Pandas dataframe is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure. For more information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html. \n"]},{"cell_type":"code","execution_count":54,"id":"d81a695f-54e6-469a-8af1-80a682c98003","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["There are 284807 observations in the credit card fraud dataset.\n","There are 31 variables in the dataset.\n"]},{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass
00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620
10.01.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425...-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690
21.0-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.514654...0.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660
31.0-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024...-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500
42.0-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.817739...-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990
\n","

5 rows × 31 columns

\n","
"],"text/plain":[" Time V1 V2 V3 V4 V5 V6 V7 \\\n","0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n","1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 \n","2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 \n","3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 \n","4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 \n","\n"," V8 V9 ... V21 V22 V23 V24 V25 \\\n","0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n","1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 \n","2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 \n","3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 \n","4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 \n","\n"," V26 V27 V28 Amount Class \n","0 -0.189115 0.133558 -0.021053 149.62 0 \n","1 0.125895 -0.008983 0.014724 2.69 0 \n","2 -0.139097 -0.055353 -0.059752 378.66 0 \n","3 -0.221929 0.062723 0.061458 123.50 0 \n","4 0.502292 0.219422 0.215153 69.99 0 \n","\n","[5 rows x 31 columns]"]},"execution_count":54,"metadata":{},"output_type":"execute_result"}],"source":["# read the input data\n","raw_data = pd.read_csv('creditcardfraud/creditcard.csv')\n","print(\"There are \" + str(len(raw_data)) + \" observations in the credit card fraud dataset.\")\n","print(\"There are \" + str(len(raw_data.columns)) + \" variables in the dataset.\")\n","\n","# display the first rows in the dataset\n","raw_data.head()"]},{"cell_type":"code","execution_count":null,"id":"269cfea0-24a8-49b7-8932-656bd4b95547","metadata":{},"outputs":[],"source":["#Uncomment the following lines if you are unable to download the dataset using the Kaggle website.\n","\n","#url= \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/creditcard.csv\"\n","#raw_data=pd.read_csv(url)\n","#print(\"There are \" + str(len(raw_data)) + \" observations in the credit card fraud dataset.\")\n","#print(\"There are \" + str(len(raw_data.columns)) + \" variables in the dataset.\")\n","#raw_data.head()"]},{"cell_type":"markdown","id":"0e7a1db2-d2a1-4948-b47a-ef55ac8c249c","metadata":{},"source":["In practice, a financial institution may have access to a much larger dataset of transactions. To simulate such a case, we will inflate the original one 10 times.\n"]},{"cell_type":"code","execution_count":55,"id":"b856cd22-57e4-4774-9bb9-ff9d2e1ae5ec","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["There are 2848070 observations in the inflated credit card fraud dataset.\n","There are 31 variables in the dataset.\n"]},{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass
00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
10.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
20.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
30.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
40.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
\n","

5 rows × 31 columns

\n","
"],"text/plain":[" Time V1 V2 V3 V4 V5 V6 V7 \\\n","0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n","1 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n","2 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n","3 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n","4 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n","\n"," V8 V9 ... V21 V22 V23 V24 V25 \\\n","0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n","1 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n","2 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n","3 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n","4 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n","\n"," V26 V27 V28 Amount Class \n","0 -0.189115 0.133558 -0.021053 149.62 0.0 \n","1 -0.189115 0.133558 -0.021053 149.62 0.0 \n","2 -0.189115 0.133558 -0.021053 149.62 0.0 \n","3 -0.189115 0.133558 -0.021053 149.62 0.0 \n","4 -0.189115 0.133558 -0.021053 149.62 0.0 \n","\n","[5 rows x 31 columns]"]},"execution_count":55,"metadata":{},"output_type":"execute_result"}],"source":["n_replicas = 10\n","\n","# inflate the original dataset\n","big_raw_data = pd.DataFrame(np.repeat(raw_data.values, n_replicas, axis=0), columns=raw_data.columns)\n","\n","print(\"There are \" + str(len(big_raw_data)) + \" observations in the inflated credit card fraud dataset.\")\n","print(\"There are \" + str(len(big_raw_data.columns)) + \" variables in the dataset.\")\n","\n","# display first rows in the new dataset\n","big_raw_data.head()"]},{"cell_type":"markdown","id":"02909172-1b5b-4118-83d1-79df94fb32b3","metadata":{},"source":["Each row in the dataset represents a credit card transaction. As shown above, each row has 31 variables. One variable (the last variable in the table above) is called Class and represents the target variable. Your objective will be to train a model that uses the other variables to predict the value of the Class variable. Let's first retrieve basic statistics about the target variable.\n","\n","Note: For confidentiality reasons, the original names of most features are anonymized V1, V2 .. V28. The values of these features are the result of a PCA transformation and are numerical. The feature 'Class' is the target variable and it takes two values: 1 in case of fraud and 0 otherwise. For more information about the dataset please visit this webpage: https://www.kaggle.com/mlg-ulb/creditcardfraud.\n"]},{"cell_type":"code","execution_count":56,"id":"b6eea1c2-9a0f-488a-b06f-2f19d1c11c51","metadata":{},"outputs":[{"data":{"image/png":"","text/plain":["
"]},"metadata":{},"output_type":"display_data"}],"source":["# get the set of distinct classes\n","labels = big_raw_data.Class.unique()\n","\n","# get the count of each class\n","sizes = big_raw_data.Class.value_counts().values\n","\n","# plot the class value counts\n","fig, ax = plt.subplots()\n","ax.pie(sizes, labels=labels, autopct='%1.3f%%')\n","ax.set_title('Target Variable Value Counts')\n","plt.show()"]},{"cell_type":"markdown","id":"0345ac2b-7b3e-4379-be48-5db5df114308","metadata":{},"source":["As shown above, the Class variable has two values: 0 (the credit card transaction is legitimate) and 1 (the credit card transaction is fraudulent). Thus, you need to model a binary classification problem. Moreover, the dataset is highly unbalanced, the target variable classes are not represented equally. This case requires special attention when training or when evaluating the quality of a model. One way of handing this case at train time is to bias the model to pay more attention to the samples in the minority class. The models under the current study will be configured to take into account the class weights of the samples at train/fit time.\n"]},{"cell_type":"markdown","id":"76a77289-16aa-4861-a36e-0d5df963bb35","metadata":{},"source":["### Practice\n"]},{"cell_type":"markdown","id":"7a580e4a-13b9-4033-944a-022f5871dc3c","metadata":{},"source":["The credit card transactions have different amounts. Could you plot a histogram that shows the distribution of these amounts? What is the range of these amounts (min/max)? Could you print the 90th percentile of the amount values?\n"]},{"cell_type":"code","execution_count":29,"id":"e8f64dfb-12a5-423b-be7f-4d1c018b12a5","metadata":{},"outputs":[{"data":{"text/plain":[""]},"execution_count":29,"metadata":{},"output_type":"execute_result"},{"data":{"image/png":"","text/plain":["
"]},"metadata":{},"output_type":"display_data"}],"source":["# your code here\n","big_raw_data['Amount'].hist(bins=75)\n","\n"]},{"cell_type":"code","execution_count":37,"metadata":{},"outputs":[{"data":{"text/plain":["count 2.848070e+06\n","mean 8.834962e+01\n","std 2.501197e+02\n","min 0.000000e+00\n","25% 5.600000e+00\n","50% 2.200000e+01\n","75% 7.717000e+01\n","max 2.569116e+04\n","Name: Amount, dtype: float64"]},"execution_count":37,"metadata":{},"output_type":"execute_result"}],"source":["big_raw_data['Amount'].describe()"]},{"cell_type":"code","execution_count":43,"metadata":{},"outputs":[{"data":{"text/plain":["203.0"]},"execution_count":43,"metadata":{},"output_type":"execute_result"}],"source":["big_raw_data['Amount'].quantile(0.90)"]},{"cell_type":"code","execution_count":69,"id":"915a7737-d86d-4f47-99bb-6ed85aa316d1","metadata":{},"outputs":[{"data":{"image/png":"","text/plain":["
"]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["Minimum amount value is -0.3532293929665258\n","Maximum amount value is 102.36224270927762\n","90% of the transactions have an amount less or equal than 203.0\n"]}],"source":["# we provide our solution here\n","plt.hist(big_raw_data.Amount.values, 6, histtype='bar', facecolor='g')\n","plt.show()\n","\n","print(\"Minimum amount value is \", np.min(big_raw_data.Amount.values))\n","print(\"Maximum amount value is \", np.max(big_raw_data.Amount.values))\n","print(\"90% of the transactions have an amount less or equal than \", np.percentile(raw_data.Amount.values, 90))"]},{"cell_type":"markdown","id":"bc974abe-8deb-4921-a72d-dfdcf6317c7e","metadata":{},"source":["
\n","

Dataset Preprocessing

\n","
\n"]},{"cell_type":"markdown","id":"3ab66dad-f870-45bf-aecf-a45f2231d684","metadata":{},"source":["In this subsection you will prepare the data for training. \n"]},{"cell_type":"code","execution_count":61,"id":"fafe2ed1-84a3-4593-a278-9bf2d259fa4b","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["X.shape= (2848070, 29) y.shape= (2848070,)\n"]}],"source":["# data preprocessing such as scaling/normalization is typically useful for \n","# linear models to accelerate the training convergence\n","\n","# standardize features by removing the mean and scaling to unit variance\n","# big_raw_data.iloc[:, 1:30] = StandardScaler().fit_transform(big_raw_data.iloc[:, 1:30])\n","# data_matrix = big_raw_data.values\n","\n","# X: feature matrix (for this analysis, we exclude the Time variable from the dataset)\n","X = data_matrix[:, 1:30]\n","\n","# # y: labels vector\n","y = data_matrix[:, 30]\n","\n","# # data normalization\n","X = normalize(X, norm=\"l1\")\n","\n","# # print the shape of the features matrix and the labels vector\n","print('X.shape=', X.shape, 'y.shape=', y.shape)"]},{"cell_type":"markdown","id":"578fedad-8101-4bd1-9427-12a075ee4256","metadata":{},"source":["
\n","

Dataset Train/Test Split

\n","
\n"]},{"cell_type":"markdown","id":"cf03ffcf-efc6-42ad-87e6-190c00c1d657","metadata":{},"source":["Now that the dataset is ready for building the classification models, you need to first divide the pre-processed dataset into a subset to be used for training the model (the train set) and a subset to be used for evaluating the quality of the model (the test set).\n"]},{"cell_type":"code","execution_count":64,"id":"ebf3c04a-5eee-4000-ab41-8628a18cbf4f","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["X_train.shape= (1993649, 29) Y_train.shape= (1993649,)\n","X_test.shape= (854421, 29) Y_test.shape= (854421,)\n"]}],"source":["X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) \n","print('X_train.shape=', X_train.shape, 'Y_train.shape=', y_train.shape)\n","print('X_test.shape=', X_test.shape, 'Y_test.shape=', y_test.shape)"]},{"cell_type":"markdown","id":"d21e2582-33b9-46c9-9b3a-7fbba2c98750","metadata":{},"source":["
\n","

Build a Decision Tree Classifier model with Scikit-Learn

\n","
\n"]},{"cell_type":"code","execution_count":65,"id":"f47fb031-dd33-40f8-93a4-198d9e12e095","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["[Scikit-Learn] Training time (s): 33.24686\n"]}],"source":["# compute the sample weights to be used as input to the train routine so that \n","# it takes into account the class imbalance present in this dataset\n","w_train = compute_sample_weight('balanced', y_train)\n","\n","# import the Decision Tree Classifier Model from scikit-learn\n","from sklearn.tree import DecisionTreeClassifier\n","\n","# for reproducible output across multiple function calls, set random_state to a given integer value\n","sklearn_dt = DecisionTreeClassifier(max_depth=4, random_state=35)\n","\n","# train a Decision Tree Classifier using scikit-learn\n","t0 = time.time()\n","sklearn_dt.fit(X_train, y_train, sample_weight=w_train)\n","sklearn_time = time.time()-t0\n","print(\"[Scikit-Learn] Training time (s): {0:.5f}\".format(sklearn_time))"]},{"cell_type":"markdown","id":"bfaf66ee-a51c-43a2-8c33-5e38fa16e6c1","metadata":{},"source":["
\n","

Build a Decision Tree Classifier model with Snap ML

\n","
\n"]},{"cell_type":"code","execution_count":66,"id":"52f544f8-fdf0-4414-b5b1-551298a47235","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["[Snap ML] Training time (s): 2.49522\n"]}],"source":["# if not already computed, \n","# compute the sample weights to be used as input to the train routine so that \n","# it takes into account the class imbalance present in this dataset\n","# w_train = compute_sample_weight('balanced', y_train)\n","\n","# import the Decision Tree Classifier Model from Snap ML\n","from snapml import DecisionTreeClassifier\n","\n","# Snap ML offers multi-threaded CPU/GPU training of decision trees, unlike scikit-learn\n","# to use the GPU, set the use_gpu parameter to True\n","# snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, use_gpu=True)\n","\n","# to set the number of CPU threads used at training time, set the n_jobs parameter\n","# for reproducible output across multiple function calls, set random_state to a given integer value\n","snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, n_jobs=4)\n","\n","# train a Decision Tree Classifier model using Snap ML\n","t0 = time.time()\n","snapml_dt.fit(X_train, y_train, sample_weight=w_train)\n","snapml_time = time.time()-t0\n","print(\"[Snap ML] Training time (s): {0:.5f}\".format(snapml_time))"]},{"cell_type":"markdown","id":"863dac88-0c61-4ed5-b145-0b886cdaf59d","metadata":{},"source":["
\n","

Evaluate the Scikit-Learn and Snap ML Decision Tree Classifier Models

\n","
\n"]},{"cell_type":"code","execution_count":67,"id":"854e529e-6fd6-4643-bf5d-638328aa9c66","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["[Decision Tree Classifier] Snap ML vs. Scikit-Learn speedup : 13.32x \n","[Scikit-Learn] ROC-AUC score : 0.966\n","[Snap ML] ROC-AUC score : 0.966\n"]}],"source":["# Snap ML vs Scikit-Learn training speedup\n","training_speedup = sklearn_time/snapml_time\n","print('[Decision Tree Classifier] Snap ML vs. Scikit-Learn speedup : {0:.2f}x '.format(training_speedup))\n","\n","# run inference and compute the probabilities of the test samples \n","# to belong to the class of fraudulent transactions\n","sklearn_pred = sklearn_dt.predict_proba(X_test)[:,1]\n","\n","# evaluate the Compute Area Under the Receiver Operating Characteristic \n","# Curve (ROC-AUC) score from the predictions\n","sklearn_roc_auc = roc_auc_score(y_test, sklearn_pred)\n","print('[Scikit-Learn] ROC-AUC score : {0:.3f}'.format(sklearn_roc_auc))\n","\n","# run inference and compute the probabilities of the test samples\n","# to belong to the class of fraudulent transactions\n","snapml_pred = snapml_dt.predict_proba(X_test)[:,1]\n","\n","# evaluate the Compute Area Under the Receiver Operating Characteristic\n","# Curve (ROC-AUC) score from the prediction scores\n","snapml_roc_auc = roc_auc_score(y_test, snapml_pred) \n","print('[Snap ML] ROC-AUC score : {0:.3f}'.format(snapml_roc_auc))"]},{"cell_type":"markdown","id":"bdfd6017-e6b3-4ae5-8622-d6042a00c703","metadata":{},"source":["As shown above both decision tree models provide the same score on the test dataset. However Snap ML runs the training routine 12x faster than Scikit-Learn. This is one of the advantages of using Snap ML: acceleration of training of classical machine learning models, such as linear and tree-based models. For more Snap ML examples, please visit [snapml-examples](https://ibm.biz/BdPfxP).\n"]},{"cell_type":"markdown","id":"93ba40ba-1bb0-4fee-be5d-30393685744e","metadata":{},"source":["
\n","

Build a Support Vector Machine model with Scikit-Learn

\n","
\n"]},{"cell_type":"code","execution_count":null,"id":"0aaa921a-6e17-4aa9-b501-d8c4eac4d746","metadata":{},"outputs":[],"source":["# import the linear Support Vector Machine (SVM) model from Scikit-Learn\n","from sklearn.svm import LinearSVC\n","\n","# instatiate a scikit-learn SVM model\n","# to indicate the class imbalance at fit time, set class_weight='balanced'\n","# for reproducible output across multiple function calls, set random_state to a given integer value\n","sklearn_svm = LinearSVC(class_weight='balanced', random_state=31, loss=\"hinge\", fit_intercept=False)\n","\n","# train a linear Support Vector Machine model using Scikit-Learn\n","t0 = time.time()\n","sklearn_svm.fit(X_train, y_train)\n","sklearn_time = time.time() - t0\n","print(\"[Scikit-Learn] Training time (s): {0:.2f}\".format(sklearn_time))"]},{"cell_type":"markdown","id":"f2eb1ece-234e-4490-9d71-10c534b9eec0","metadata":{},"source":["
\n","

Build a Support Vector Machine model with Snap ML

\n","
\n"]},{"cell_type":"code","execution_count":null,"id":"2078a824-1839-4c36-9a0d-2c19c39a01d4","metadata":{},"outputs":[],"source":["# import the Support Vector Machine model (SVM) from Snap ML\n","from snapml import SupportVectorMachine\n","\n","# in contrast to scikit-learn's LinearSVC, Snap ML offers multi-threaded CPU/GPU training of SVMs\n","# to use the GPU, set the use_gpu parameter to True\n","# snapml_svm = SupportVectorMachine(class_weight='balanced', random_state=25, use_gpu=True, fit_intercept=False)\n","\n","# to set the number of threads used at training time, one needs to set the n_jobs parameter\n","snapml_svm = SupportVectorMachine(class_weight='balanced', random_state=25, n_jobs=4, fit_intercept=False)\n","# print(snapml_svm.get_params())\n","\n","# train an SVM model using Snap ML\n","t0 = time.time()\n","model = snapml_svm.fit(X_train, y_train)\n","snapml_time = time.time() - t0\n","print(\"[Snap ML] Training time (s): {0:.2f}\".format(snapml_time))"]},{"cell_type":"markdown","id":"fd2825c3-05d2-4177-956d-7c2723c13449","metadata":{},"source":["
\n","

Evaluate the Scikit-Learn and Snap ML Support Vector Machine Models

\n","
\n"]},{"cell_type":"code","execution_count":null,"id":"13d141f3-fc01-4fff-9225-7ee5b8cdb13d","metadata":{},"outputs":[],"source":["# compute the Snap ML vs Scikit-Learn training speedup\n","training_speedup = sklearn_time/snapml_time\n","print('[Support Vector Machine] Snap ML vs. Scikit-Learn training speedup : {0:.2f}x '.format(training_speedup))\n","\n","# run inference using the Scikit-Learn model\n","# get the confidence scores for the test samples\n","sklearn_pred = sklearn_svm.decision_function(X_test)\n","\n","# evaluate accuracy on test set\n","acc_sklearn = roc_auc_score(y_test, sklearn_pred)\n","print(\"[Scikit-Learn] ROC-AUC score: {0:.3f}\".format(acc_sklearn))\n","\n","# run inference using the Snap ML model\n","# get the confidence scores for the test samples\n","snapml_pred = snapml_svm.decision_function(X_test)\n","\n","# evaluate accuracy on test set\n","acc_snapml = roc_auc_score(y_test, snapml_pred)\n","print(\"[Snap ML] ROC-AUC score: {0:.3f}\".format(acc_snapml))"]},{"cell_type":"markdown","id":"d9c9c383-8488-4c4e-9ec9-2c87b60332a4","metadata":{},"source":["As shown above both SVM models provide the same score on the test dataset. However, as in the case of decision trees, Snap ML runs the training routine faster than Scikit-Learn. For more Snap ML examples, please visit [snapml-examples](https://ibm.biz/BdPfxP). Moreover, as shown above, not only is Snap ML seemlessly accelerating scikit-learn applications, but the library's Python API is also compatible with scikit-learn metrics and data preprocessors.\n"]},{"cell_type":"markdown","id":"bb0ef68e-9160-4472-8072-aace3e649ecf","metadata":{},"source":["### Practice\n"]},{"cell_type":"markdown","id":"6ec6d773-355a-45df-834a-497678a83797","metadata":{},"source":["In this section you will evaluate the quality of the SVM models trained above using the hinge loss metric (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html). Run inference on the test set using both Scikit-Learn and Snap ML models. Compute the hinge loss metric for both sets of predictions. Print the hinge losses of Scikit-Learn and Snap ML.\n"]},{"cell_type":"code","execution_count":null,"id":"2b4cdadb-0dc7-46df-a96a-c03a70023704","metadata":{},"outputs":[],"source":["# your code goes here"]},{"cell_type":"code","execution_count":null,"id":"255c8bf9-979b-46dd-8ab1-80c6026e26c0","metadata":{},"outputs":[],"source":["# get the confidence scores for the test samples\n","sklearn_pred = sklearn_svm.decision_function(X_test)\n","snapml_pred = snapml_svm.decision_function(X_test)\n","\n","# import the hinge_loss metric from scikit-learn\n","from sklearn.metrics import hinge_loss\n","\n","# evaluate the hinge loss metric from the predictions\n","loss_sklearn = hinge_loss(y_test, sklearn_pred)\n","print(\"[Scikit-Learn] Hinge loss: {0:.3f}\".format(loss_sklearn))\n","\n","# evaluate the hinge loss from the predictions\n","loss_snapml = hinge_loss(y_test, snapml_pred)\n","print(\"[Snap ML] Hinge loss: {0:.3f}\".format(loss_snapml))\n","\n","\n","# the two models should give the same Hinge loss"]},{"cell_type":"code","execution_count":70,"metadata":{},"outputs":[{"name":"stderr","output_type":"stream","text":["156768.14s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n"]}],"source":["!rm -r /home/bbearce/Documents/code-journal/docs/notes/machine_learning/Coursera/JupyterNotebooks/creditcardfraud"]},{"cell_type":"markdown","id":"ac1c96f3-aa3e-4994-badd-135ba2f262ea","metadata":{},"source":["## Authors\n"]},{"cell_type":"markdown","id":"48d1f5f9-a1f1-4237-99b2-ec5365924ce8","metadata":{},"source":["Andreea Anghel\n"]},{"cell_type":"markdown","id":"ee595903-8c72-458b-83ac-4f97d7f0d474","metadata":{},"source":["### Other Contributors\n"]},{"cell_type":"markdown","id":"d04171cd-fad2-4bf4-a519-2e6821d5a5e3","metadata":{},"source":["Joseph Santarcangelo\n"]},{"cell_type":"markdown","id":"7f3815b6-3b71-4310-aac3-28cd8277a85f","metadata":{},"source":["## Change Log\n"]},{"cell_type":"markdown","id":"f64915d1-9b98-4de7-840e-b9018b59ee1d","metadata":{},"source":["| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2021-08-31 | 0.1 | AAN | Created Lab Content |\n"]},{"cell_type":"markdown","id":"54920996-2b56-4130-b1d4-14262bc459e0","metadata":{},"source":[" Copyright © 2021 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).\n"]}],"metadata":{"kernelspec":{"display_name":"venv3.10.4","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.4"}},"nbformat":4,"nbformat_minor":4}