GitHub Repository: DanielBarnes18/IBM-Data-Science-Professional-Certificate
Path: blob/main/09. Machine Learning with Python/Final Project/Machine Learning with Python - The Best Classifier.ipynb
Classification with Python

In this notebook we try to practice all the classification algorithms that we have learned in this course.

We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.

Let's first load required libraries:

import itertools import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import NullFormatter import pandas as pd import numpy as np import matplotlib.ticker as ticker from sklearn import preprocessing %matplotlib inline

About dataset

This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

Loan_statusWhether a loan is paid off on in collection
PrincipalBasic principal loan amount at the
TermsOrigination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_dateWhen the loan got originated and took effects
Due_dateSince it’s one-time payoff schedule, each loan has one single due date
AgeAge of applicant
EducationEducation of applicant
GenderThe gender of applicant

Load Data From CSV File

df = pd.read_csv('') df.head()
(346, 10)

Convert to date time object

df['due_date'] = pd.to_datetime(df['due_date']) df['effective_date'] = pd.to_datetime(df['effective_date']) df.head()

Data visualization and pre-processing

Let’s see how many of each class is in our data set

PAIDOFF 260 COLLECTION 86 Name: loan_status, dtype: int64

260 people have paid off the loan on time while 86 have gone into collection

Let's plot some columns to underestand data better:

import seaborn as sns bins = np.linspace(df.Principal.min(), df.Principal.max(), 10) g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2), 'Principal', bins=bins, ec="k") g.axes[-1].legend()
Image in a Jupyter notebook
bins = np.linspace(df.age.min(), df.age.max(), 10) g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2), 'age', bins=bins, ec="k") g.axes[-1].legend()
Image in a Jupyter notebook

Pre-processing: Feature selection/extraction

Let's look at the day of the week people get the loan

df['dayofweek'] = df['effective_date'].dt.dayofweek bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10) g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2), 'dayofweek', bins=bins, ec="k") g.axes[-1].legend()
Image in a Jupyter notebook

We see that people who get the loan at the end of the week don't pay it off, so let's use Feature binarization to set a threshold value less than day 4

df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0) df.head()

Convert Categorical features to numerical values

Let's look at gender:

Gender loan_status female PAIDOFF 0.865385 COLLECTION 0.134615 male PAIDOFF 0.731293 COLLECTION 0.268707 Name: loan_status, dtype: float64

86 % of female pay there loans while only 73 % of males pay there loan

Let's convert male to 0 and female to 1:

df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True) df.head()

One Hot Encoding

How about education?

education loan_status Bechalor PAIDOFF 0.750000 COLLECTION 0.250000 High School or Below PAIDOFF 0.741722 COLLECTION 0.258278 Master or Above COLLECTION 0.500000 PAIDOFF 0.500000 college PAIDOFF 0.765101 COLLECTION 0.234899 Name: loan_status, dtype: float64

Features before One Hot Encoding


Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame

Feature = df[['Principal','terms','age','Gender','weekend']] Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1) Feature.drop(['Master or Above'], axis = 1,inplace=True) Feature.head()

Feature Selection

Let's define feature sets, X:

X = Feature X[0:5]

What are our lables?

y = df['loan_status'].values y[0:5]
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

Normalize Data

Data Standardization give data zero mean and unit variance (technically should be done after train test split)

X= preprocessing.StandardScaler().fit(X).transform(X) X[0:5]
array([[ 0.51578458, 0.92071769, 2.33152555, -0.42056004, -1.20577805, -0.38170062, 1.13639374, -0.86968108], [ 0.51578458, 0.92071769, 0.34170148, 2.37778177, -1.20577805, 2.61985426, -0.87997669, -0.86968108], [ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805, -0.38170062, -0.87997669, 1.14984679], [ 0.51578458, 0.92071769, -0.48739188, 2.37778177, 0.82934003, -0.38170062, -0.87997669, 1.14984679], [ 0.51578458, 0.92071769, -0.3215732 , -0.42056004, 0.82934003, -0.38170062, -0.87997669, 1.14984679]])


Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:

  • K Nearest Neighbor(KNN)

  • Decision Tree

  • Support Vector Machine

  • Logistic Regression

__ Notice:__

  • You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.

  • You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.

  • You should include the code of the algorithm in the following cells.

K Nearest Neighbor(KNN)

Notice: You should find the best k to build the model with the best accuracy. warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k.

#Import Libraries from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier
#Split data set into train and test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4) print ('Train set:', X_train.shape, y_train.shape) print ('Test set:', X_test.shape, y_test.shape)
Train set: (276, 8) (276,) Test set: (70, 8) (70,)
#Determine K value through Accuracy Evaluation: from sklearn import metrics Ks = 10 mean_acc = np.zeros((Ks-1)) std_acc = np.zeros((Ks-1)) for n in range(1,Ks): #Train Model and Predict neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train) yhat=neigh.predict(X_test) mean_acc[n-1] = metrics.accuracy_score(y_test, yhat) std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0]) mean_acc
array([0.67142857, 0.65714286, 0.71428571, 0.68571429, 0.75714286, 0.71428571, 0.78571429, 0.75714286, 0.75714286])
#quick check that predicted values are as expected (either PAIDOFF or COLLECTION): yhat = neigh.predict(X_test) yhat[0:5]
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)
# Plot the model accuracy for a different number of neighbors plt.plot(range(1,Ks),mean_acc,'g') plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10) plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green") plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd')) plt.ylabel('Accuracy ') plt.xlabel('Number of Neighbors (K)') plt.tight_layout() print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)
Image in a Jupyter notebook
The best accuracy was with 0.7857142857142857 with k= 7
#Build the model this time using the k value that produced the highest accuracy knn = KNeighborsClassifier(n_neighbors = mean_acc.argmax()+1) #Fit the model with the training set, y_train)
#Make some predictions knn_yhat = knn.predict(X_test)

Decision Tree

#Comment these out when installed #!conda install -c conda-forge pydotplus -y #!conda install -c conda-forge python-graphviz -y
#Import Libraries from sklearn.tree import DecisionTreeClassifier from io import StringIO import pydotplus import matplotlib.image as mpimg from sklearn import tree
#We must find the optimum depth to choose: ds = 10 mean_acc = np.zeros((ds-1)) std_acc = np.zeros((ds-1)) for d in range(1,ds): #Train Model and Predict dt = DecisionTreeClassifier(criterion = 'entropy', max_depth = d).fit(X_train, y_train) #Predict the response for the test dataset yhat=dt.predict(X_test) #Calculate the accuracy score mean_acc[d-1] = metrics.accuracy_score(y_test, yhat) std_acc[d-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0]) print("For depth = {} the accuracy score is {} ".format(d, mean_acc[d-1]))
For depth = 1 the accuracy score is 0.7857142857142857 For depth = 2 the accuracy score is 0.7857142857142857 For depth = 3 the accuracy score is 0.6142857142857143 For depth = 4 the accuracy score is 0.6142857142857143 For depth = 5 the accuracy score is 0.6428571428571429 For depth = 6 the accuracy score is 0.7714285714285715 For depth = 7 the accuracy score is 0.7571428571428571 For depth = 8 the accuracy score is 0.7571428571428571 For depth = 9 the accuracy score is 0.6571428571428571
# Plot the model accuracy for different depths plt.plot(range(1,ds),mean_acc,'g') plt.fill_between(range(1,ds),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10) plt.fill_between(range(1,ds),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green") plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd')) plt.ylabel('Accuracy ') plt.xlabel('Depth (d)') plt.tight_layout() print( "The best accuracy was with", mean_acc.max(), "with d=", mean_acc.argmax()+2) #best is with d=1, but use 2 so +2 instead of +1 in print statement
Image in a Jupyter notebook
The best accuracy was with 0.7857142857142857 with d= 2
#The greatest accuracy was with depth = 1, but depth = 2 makes more sense to use here as it was also equal
#Build the model this time using the d value that produced the highest accuracy loanTree = DecisionTreeClassifier(criterion="entropy", max_depth=2) #fit the data with the training set, y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=2)
#make some predictions: predTree = loanTree.predict(X_test)
dot_data = StringIO() filename = "loantree.png" featureNames = df.columns[3:11] out=tree.export_graphviz(loanTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True, special_characters=True,rotate=False) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.write_png(filename) img = mpimg.imread(filename) plt.figure(figsize=(100, 200)) plt.imshow(img,interpolation='nearest')
<matplotlib.image.AxesImage at 0x23dcc9f39c8>
Image in a Jupyter notebook

Support Vector Machine

#Import Libraries from sklearn import svm from sklearn.metrics import classification_report, confusion_matrix import itertools #Use the Radial Basis Function (the default) svm_model = svm.SVC(kernel='rbf') #Fit the model with the training set, y_train)
#Make some predictions yhat = svm_model.predict(X_test) yhat [0:5]
array(['COLLECTION', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

Logistic Regression

from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss
#choose which function to use for k in ('lbfgs', 'saga', 'liblinear', 'newton-cg', 'sag'): LR_model = LogisticRegression(C = 0.01, solver = k).fit(X_train, y_train) LR_yhat = LR_model.predict(X_test) y_prob = LR_model.predict_proba(X_test) print('Solver: {}, logloss: {}'.format(k, log_loss(y_test, y_prob)))
Solver: lbfgs, logloss: 0.4920179847937498 Solver: saga, logloss: 0.49201948568367027 Solver: liblinear, logloss: 0.5772287609479654 Solver: newton-cg, logloss: 0.492017801467927 Solver: sag, logloss: 0.4920289144344473
#logloss is highest when solver is liblinear
#Train and fit the model with the training set LR_model = LogisticRegression(solver = 'liblinear', C=0.01).fit(X_train,y_train) LR_model
LogisticRegression(C=0.01, solver='liblinear')
#Make some predictions yhat = LR_model.predict(X_test) yhat[0:5]
array(['COLLECTION', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

Model Evaluation using Test set

from sklearn.metrics import jaccard_score from sklearn.metrics import f1_score from sklearn.metrics import log_loss

Load Test set for evaluation

test_df = pd.read_csv('') test_df.head()
#convert date types to date time objects test_df['due_date'] = pd.to_datetime(test_df['due_date']) test_df['effective_date'] = pd.to_datetime(test_df['effective_date']) test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek #set a threshold less than day 4 test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0) #Convert Categorical features to numerical values test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True) #one hot encoding for education test_feature = test_df[['Principal','terms','age','Gender','weekend']] test_feature = pd.concat([test_feature,pd.get_dummies(test_df['education'])], axis=1) test_feature.drop(['Master or Above'], axis = 1,inplace=True) # Testing feature X_loan_test = test_feature # Normalizing Test Data X_loan_test = preprocessing.StandardScaler().fit(X_loan_test).transform(X_loan_test) # Target result y_loan_test = test_df['loan_status'].values
#METRICS # KNN knn_yhat = knn.predict(X_loan_test) #Jaccard Score: knn_js = round(jaccard_score(y_loan_test, knn_yhat, pos_label = "PAIDOFF"), 2) #F1 Score: knn_f1 = round(f1_score(y_loan_test, knn_yhat, average = 'weighted'), 2) # Decision Tree loanTree_yhat = loanTree.predict(X_loan_test) #Jaccard Score: loanTree_js = round(jaccard_score(y_loan_test, loanTree_yhat, pos_label = "PAIDOFF"), 2) #F1 Score: loanTree_f1 = round(f1_score(y_loan_test, loanTree_yhat, average = 'weighted'), 2) # Support Vector Machine svm_model_yhat = svm_model.predict(X_loan_test) #Jaccard Score: svm_model_js = round(jaccard_score(y_loan_test, svm_model_yhat, pos_label = "PAIDOFF"), 2) #F1 Score: svm_model_f1 = round(f1_score(y_loan_test, svm_model_yhat, average = 'weighted'), 2) # Logistic Regression LR_model_yhat = LR_model.predict(X_loan_test) #Jaccard Score: LR_model_js = round(jaccard_score(y_loan_test, LR_model_yhat, pos_label = "PAIDOFF"), 2) #F1 Score: LR_model_f1 = round(f1_score(y_loan_test, LR_model_yhat, average = 'weighted'), 2) #LogLoss: LR_model_logloss = round(log_loss(y_test, LR_model.predict_proba(X_test)),2)
Jaccard_scores = [knn_js, loanTree_js, svm_model_js, LR_model_js] F1_scores = [knn_f1, loanTree_f1, svm_model_f1, LR_model_f1] LogLoss_scores = ['NA', 'NA', 'NA', LR_model_logloss] all_values = [Jaccard_scores, F1_scores, LogLoss_scores]
algorithms = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression'] metrics = ['Jaccard', 'F1-score', 'Logloss'] accuracy_df = pd.DataFrame(all_values, index = metrics, columns = algorithms) accuracy_df.transpose()