Dataset Information¶

Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

This is a standard supervised classification task.A classification problem where we have to predict whether a loan would be approved or not. Below is the dataset attributes with description.

Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N)
ApplicantIncome Applicant income
CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)

Import modules¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Loading the dataset¶

In [2]:
df = pd.read_csv("Loan Prediction Dataset.csv")
df.head()
Out[2]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y
4 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y
In [3]:
df.describe()
Out[3]:
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
count 614.000000 614.000000 592.000000 600.00000 564.000000
mean 5403.459283 1621.245798 146.412162 342.00000 0.842199
std 6109.041673 2926.248369 85.587325 65.12041 0.364878
min 150.000000 0.000000 9.000000 12.00000 0.000000
25% 2877.500000 0.000000 100.000000 360.00000 1.000000
50% 3812.500000 1188.500000 128.000000 360.00000 1.000000
75% 5795.000000 2297.250000 168.000000 360.00000 1.000000
max 81000.000000 41667.000000 700.000000 480.00000 1.000000
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Preprocessing the dataset¶

In [5]:
# find the null values
df.isnull().sum()
Out[5]:
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64
In [6]:
# fill the missing values for numerical terms - mean
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean())
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mean())
In [7]:
# fill the missing values for categorical terms - mode
df['Gender'] = df["Gender"].fillna(df['Gender'].mode()[0])
df['Married'] = df["Married"].fillna(df['Married'].mode()[0])
df['Dependents'] = df["Dependents"].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df["Self_Employed"].fillna(df['Self_Employed'].mode()[0])
In [8]:
df.isnull().sum()
Out[8]:
Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

Exploratory Data Analysis¶

In [9]:
# categorical attributes visualization
sns.countplot(df['Gender'])
Out[9]:
<AxesSubplot:xlabel='Gender', ylabel='count'>
In [10]:
sns.countplot(df['Married'])
Out[10]:
<AxesSubplot:xlabel='Married', ylabel='count'>
In [11]:
sns.countplot(df['Dependents'])
Out[11]:
<AxesSubplot:xlabel='Dependents', ylabel='count'>
In [12]:
sns.countplot(df['Education'])
Out[12]:
<AxesSubplot:xlabel='Education', ylabel='count'>
In [13]:
sns.countplot(df['Self_Employed'])
Out[13]:
<AxesSubplot:xlabel='Self_Employed', ylabel='count'>
In [14]:
sns.countplot(df['Property_Area'])
Out[14]:
<AxesSubplot:xlabel='Property_Area', ylabel='count'>
In [15]:
sns.countplot(df['Loan_Status'])
Out[15]:
<AxesSubplot:xlabel='Loan_Status', ylabel='count'>
In [ ]:
 
In [16]:
# numerical attributes visualization
sns.distplot(df["ApplicantIncome"])
Out[16]:
<AxesSubplot:xlabel='ApplicantIncome', ylabel='Density'>
In [17]:
sns.distplot(df["CoapplicantIncome"])
Out[17]:
<AxesSubplot:xlabel='CoapplicantIncome', ylabel='Density'>
In [18]:
sns.distplot(df["LoanAmount"])
Out[18]:
<AxesSubplot:xlabel='LoanAmount', ylabel='Density'>
In [19]:
sns.distplot(df['Loan_Amount_Term'])
Out[19]:
<AxesSubplot:xlabel='Loan_Amount_Term', ylabel='Density'>
In [20]:
sns.distplot(df['Credit_History'])
Out[20]:
<AxesSubplot:xlabel='Credit_History', ylabel='Density'>
In [ ]:
 

Creation of new attributes¶

In [21]:
# total income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df.head()
Out[21]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status Total_Income
0 LP001002 Male No 0 Graduate No 5849 0.0 146.412162 360.0 1.0 Urban Y 5849.0
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.000000 360.0 1.0 Rural N 6091.0
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.000000 360.0 1.0 Urban Y 3000.0
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.000000 360.0 1.0 Urban Y 4941.0
4 LP001008 Male No 0 Graduate No 6000 0.0 141.000000 360.0 1.0 Urban Y 6000.0

Log Transformation¶

In [22]:
# apply log transformation to the attribute
df['ApplicantIncomeLog'] = np.log(df['ApplicantIncome']+1)
sns.distplot(df["ApplicantIncomeLog"])
Out[22]:
<AxesSubplot:xlabel='ApplicantIncomeLog', ylabel='Density'>
In [23]:
df['CoapplicantIncomeLog'] = np.log(df['CoapplicantIncome']+1)
sns.distplot(df["CoapplicantIncomeLog"])
Out[23]:
<AxesSubplot:xlabel='CoapplicantIncomeLog', ylabel='Density'>
In [24]:
df['LoanAmountLog'] = np.log(df['LoanAmount']+1)
sns.distplot(df["LoanAmountLog"])
Out[24]:
<AxesSubplot:xlabel='LoanAmountLog', ylabel='Density'>
In [25]:
df['Loan_Amount_Term_Log'] = np.log(df['Loan_Amount_Term']+1)
sns.distplot(df["Loan_Amount_Term_Log"])
Out[25]:
<AxesSubplot:xlabel='Loan_Amount_Term_Log', ylabel='Density'>
In [26]:
df['Total_Income_Log'] = np.log(df['Total_Income']+1)
sns.distplot(df["Total_Income_Log"])
Out[26]:
<AxesSubplot:xlabel='Total_Income_Log', ylabel='Density'>

Correlation Matrix¶

In [27]:
corr = df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot = True, cmap="BuPu")
Out[27]:
<AxesSubplot:>
In [28]:
df.head()
Out[28]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status Total_Income ApplicantIncomeLog CoapplicantIncomeLog LoanAmountLog Loan_Amount_Term_Log Total_Income_Log
0 LP001002 Male No 0 Graduate No 5849 0.0 146.412162 360.0 1.0 Urban Y 5849.0 8.674197 0.000000 4.993232 5.888878 8.674197
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.000000 360.0 1.0 Rural N 6091.0 8.430327 7.319202 4.859812 5.888878 8.714732
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.000000 360.0 1.0 Urban Y 3000.0 8.006701 0.000000 4.204693 5.888878 8.006701
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.000000 360.0 1.0 Urban Y 4941.0 7.857094 7.765993 4.795791 5.888878 8.505525
4 LP001008 Male No 0 Graduate No 6000 0.0 141.000000 360.0 1.0 Urban Y 6000.0 8.699681 0.000000 4.955827 5.888878 8.699681
In [29]:
# drop unnecessary columns
cols = ['ApplicantIncome', 'CoapplicantIncome', "LoanAmount", "Loan_Amount_Term", "Total_Income", 'Loan_ID', 'CoapplicantIncomeLog']
df = df.drop(columns=cols, axis=1)
df.head()
Out[29]:
Gender Married Dependents Education Self_Employed Credit_History Property_Area Loan_Status ApplicantIncomeLog LoanAmountLog Loan_Amount_Term_Log Total_Income_Log
0 Male No 0 Graduate No 1.0 Urban Y 8.674197 4.993232 5.888878 8.674197
1 Male Yes 1 Graduate No 1.0 Rural N 8.430327 4.859812 5.888878 8.714732
2 Male Yes 0 Graduate Yes 1.0 Urban Y 8.006701 4.204693 5.888878 8.006701
3 Male Yes 0 Not Graduate No 1.0 Urban Y 7.857094 4.795791 5.888878 8.505525
4 Male No 0 Graduate No 1.0 Urban Y 8.699681 4.955827 5.888878 8.699681

Label Encoding¶

In [30]:
from sklearn.preprocessing import LabelEncoder
cols = ['Gender',"Married","Education",'Self_Employed',"Property_Area","Loan_Status","Dependents"]
le = LabelEncoder()
for col in cols:
    df[col] = le.fit_transform(df[col])
In [31]:
df.head()
Out[31]:
Gender Married Dependents Education Self_Employed Credit_History Property_Area Loan_Status ApplicantIncomeLog LoanAmountLog Loan_Amount_Term_Log Total_Income_Log
0 1 0 0 0 0 1.0 2 1 8.674197 4.993232 5.888878 8.674197
1 1 1 1 0 0 1.0 0 0 8.430327 4.859812 5.888878 8.714732
2 1 1 0 0 1 1.0 2 1 8.006701 4.204693 5.888878 8.006701
3 1 1 0 1 0 1.0 2 1 7.857094 4.795791 5.888878 8.505525
4 1 0 0 0 0 1.0 2 1 8.699681 4.955827 5.888878 8.699681
In [32]:
import warnings
warnings.filterwarnings('ignore')

Train-Test Split¶

In [33]:
# specify input and output attributes
X = df.drop(columns=['Loan_Status'], axis=1)
y = df['Loan_Status']
In [34]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

x_train_norm = (x_train - x_train.mean())/(x_train.max() - x_train.min())
x_test_norm = (x_test - x_test.mean())/(x_test.max() - x_test.min())

Model Training¶

In [35]:
# classify function
from sklearn.model_selection import cross_val_score
def classify(clf, x, y):
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    clf.fit(x_train, y_train)
    print("Accuracy is", clf.score(x_test, y_test)*100)
    # cross validation - it is used for better validation of model
    # eg: cv-5, train-4, test-1
    score = cross_val_score(clf, x, y, cv=5)
    print("Cross validation is",np.mean(score)*100)
In [36]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
classify(clf, X, y)
Accuracy is 78.86178861788618
Cross validation is 80.9462881514061
In [37]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
classify(clf, X, y)
Accuracy is 68.29268292682927
Cross validation is 72.31507397041183
In [38]:
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
clf = RandomForestClassifier()
classify(clf, X, y)
Accuracy is 78.04878048780488
Cross validation is 78.50459816073571
In [39]:
clf = ExtraTreesClassifier()
classify(clf, X, y)
Accuracy is 72.35772357723577
Cross validation is 77.20378515260562
In [40]:
from sklearn.svm import SVC
clf = SVC()
classify(clf, X, y)
Accuracy is 65.04065040650406
Cross validation is 69.70545115287219
In [41]:
from xgboost import XGBClassifier
clf = XGBClassifier(eval_metric='mlogloss')
classify(clf, X, y)
Accuracy is 74.79674796747967
Cross validation is 75.5631080900973

Feature Engineering¶

Univariate Feature Selection and XGBoost¶

In [42]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

select_feature = SelectKBest(chi2, k=11).fit(x_train, y_train)

print('Score List: ', select_feature.scores_)
print('Feature List: ', x_train.columns)
Score List:  [7.19500797e-03 2.06871125e+00 7.71256305e-01 1.52247417e+00
 7.78194275e-03 2.03768514e+01 1.11421163e-01 4.03909601e-03
 4.53042155e-02 8.22114016e-04 2.97424549e-03]
Feature List:  Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Credit_History', 'Property_Area', 'ApplicantIncomeLog',
       'LoanAmountLog', 'Loan_Amount_Term_Log', 'Total_Income_Log'],
      dtype='object')
In [43]:
x_train_2 = select_feature.transform(x_train)
x_test_2 = select_feature.transform(x_test)

clf = XGBClassifier(eval_metric='mlogloss').fit(x_train_2, y_train)

print("Accuracy is", clf.score(x_test, y_test)*100)
score = cross_val_score(clf, X, y, cv=5)
print("Cross validation is",np.mean(score)*100)
Accuracy is 74.79674796747967
Cross validation is 75.5631080900973

Recursive Feature Elimination with Cross-Validation¶

In [44]:
from sklearn.feature_selection import RFECV

clf = XGBClassifier(eval_metric='mlogloss')
rfecv = RFECV(estimator=clf, step=1, cv=5, scoring='accuracy', n_jobs=1).fit(x_train, y_train)

print('Optimal number of features: ', rfecv.n_features_)
print('Best features: ', x_train.columns[rfecv.support_])
Optimal number of features:  1
Best features:  Index(['Credit_History'], dtype='object')
In [45]:
print('Accuracy is: ', accuracy_score(y_test, rfecv.predict(x_test)))
Accuracy is:  0.7886178861788617
In [46]:
num_features = [i for i in range(1,(rfecv.grid_scores_.shape[0]+1))]
cv_scores = [np.mean(score) for score in rfecv.grid_scores_]
ax = sns.lineplot(x=num_features, y=cv_scores)
ax.set(xlabel='No. of selected features', ylabel='CV scores')
Out[46]:
[Text(0.5, 0, 'No. of selected features'), Text(0, 0.5, 'CV scores')]

Principal Component Analysis¶

In [47]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(x_train_norm)

plt.figure(1, figsize=(10,8))
sns.lineplot(data=np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('No. of components')
Out[47]:
Text(0.5, 0, 'No. of components')
In [48]:
x_best = x_train[x_train.columns[rfecv.support_]]

Hyperparameter tuning¶

In [49]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

k_fold = KFold(n_splits=5, shuffle = True)
In [50]:
# Logistic Regression   
clf = GridSearchCV(LogisticRegression(),{
    'penalty':['l1', 'l2', 'elasticnet', 'none'],
    'C':[1,10,20],
    'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
} ,cv=k_fold)

clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best,y_train, cv=k_fold,return_times=True)

LR = pd.DataFrame(clf.cv_results_)
LR.sort_values(by='rank_test_score').head()
Out[50]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_penalty param_solver params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
29 0.000867 0.000034 0.000336 0.000008 10 l2 saga {'C': 10, 'penalty': 'l2', 'solver': 'saga'} 0.818182 0.744898 0.877551 0.826531 0.806122 0.814657 0.04254 1
25 0.001275 0.000048 0.000350 0.000006 10 l2 newton-cg {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'} 0.818182 0.744898 0.877551 0.826531 0.806122 0.814657 0.04254 1
26 0.001283 0.000042 0.000356 0.000007 10 l2 lbfgs {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'} 0.818182 0.744898 0.877551 0.826531 0.806122 0.814657 0.04254 1
27 0.000557 0.000016 0.000330 0.000005 10 l2 liblinear {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'} 0.818182 0.744898 0.877551 0.826531 0.806122 0.814657 0.04254 1
28 0.001012 0.000038 0.000335 0.000003 10 l2 sag {'C': 10, 'penalty': 'l2', 'solver': 'sag'} 0.818182 0.744898 0.877551 0.826531 0.806122 0.814657 0.04254 1

Logistic Regression Learning Curve¶

In [51]:
# logistic regression score
estimator = LogisticRegression(C=10, penalty='l2',solver='liblinear')

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)

plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
Out[51]:
[<matplotlib.lines.Line2D at 0x1263a8640>]
In [52]:
# Decision Tree
clf = GridSearchCV(DecisionTreeClassifier(),{
    'criterion':['gini', 'entropy', 'log_loss'],
    'splitter':['best','random'],
    'max_features':['sqrt', 'log2', None]
} ,cv=k_fold)

clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)

DT = pd.DataFrame(clf.cv_results_)
DT.sort_values(by='rank_test_score').head()
Out[52]:
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_features param_splitter params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.000743 0.000271 0.000542 0.000157 gini sqrt best {'criterion': 'gini', 'max_features': 'sqrt', ... 0.838384 0.77551 0.867347 0.816327 0.77551 0.814616 0.035796 1
11 0.000461 0.000030 0.000328 0.000006 entropy None random {'criterion': 'entropy', 'max_features': None,... 0.838384 0.77551 0.867347 0.816327 0.77551 0.814616 0.035796 1
10 0.000450 0.000006 0.000323 0.000001 entropy None best {'criterion': 'entropy', 'max_features': None,... 0.838384 0.77551 0.867347 0.816327 0.77551 0.814616 0.035796 1
9 0.000458 0.000007 0.000333 0.000004 entropy log2 random {'criterion': 'entropy', 'max_features': 'log2... 0.838384 0.77551 0.867347 0.816327 0.77551 0.814616 0.035796 1
7 0.000457 0.000007 0.000335 0.000001 entropy sqrt random {'criterion': 'entropy', 'max_features': 'sqrt... 0.838384 0.77551 0.867347 0.816327 0.77551 0.814616 0.035796 1

Decision Tree Learning Curve¶

In [53]:
# decision tree score
estimator = DecisionTreeClassifier(criterion='entropy', max_features='sqrt',splitter='best')

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)

plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
Out[53]:
[<matplotlib.lines.Line2D at 0x12649f2e0>]
In [54]:
# Random Forest
clf = GridSearchCV(RandomForestClassifier(),{
    'n_estimators':[10,60,100],
    'criterion':['gini', 'entropy', 'log_loss'],
    'max_features':['sqrt', 'log2', None]
} ,cv=k_fold)

clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)

RF = pd.DataFrame(clf.cv_results_)
RF.sort_values(by='rank_test_score').head()
Out[54]:
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_features param_n_estimators params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.005674 0.001162 0.000911 0.000116 gini sqrt 10 {'criterion': 'gini', 'max_features': 'sqrt', ... 0.79798 0.846939 0.806122 0.826531 0.795918 0.814698 0.019417 1
17 0.038177 0.000060 0.002952 0.000022 entropy None 100 {'criterion': 'entropy', 'max_features': None,... 0.79798 0.846939 0.806122 0.826531 0.795918 0.814698 0.019417 1
16 0.023206 0.000079 0.001944 0.000003 entropy None 60 {'criterion': 'entropy', 'max_features': None,... 0.79798 0.846939 0.806122 0.826531 0.795918 0.814698 0.019417 1
15 0.004446 0.000129 0.000707 0.000008 entropy None 10 {'criterion': 'entropy', 'max_features': None,... 0.79798 0.846939 0.806122 0.826531 0.795918 0.814698 0.019417 1
14 0.039015 0.000137 0.003074 0.000112 entropy log2 100 {'criterion': 'entropy', 'max_features': 'log2... 0.79798 0.846939 0.806122 0.826531 0.795918 0.814698 0.019417 1

Random Forest Learning Curve¶

In [55]:
# random forest score
estimator = RandomForestClassifier(n_estimators=60,criterion='gini', max_features='log2')

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)

plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
Out[55]:
[<matplotlib.lines.Line2D at 0x126567e20>]
In [56]:
# Extra Trees
clf = GridSearchCV(ExtraTreesClassifier(),{
    'n_estimators':[10,60,100],
    'criterion':['gini', 'entropy', 'log_loss'],
    'max_features':['sqrt', 'log2', None]
} ,cv=k_fold)

clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)

RF = pd.DataFrame(clf.cv_results_)
RF.sort_values(by='rank_test_score').head()
Out[56]:
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_features param_n_estimators params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.004407 0.000581 0.000990 0.000117 gini sqrt 10 {'criterion': 'gini', 'max_features': 'sqrt', ... 0.808081 0.826531 0.836735 0.77551 0.826531 0.814677 0.021657 1
17 0.028284 0.000431 0.003190 0.000118 entropy None 100 {'criterion': 'entropy', 'max_features': None,... 0.808081 0.826531 0.836735 0.77551 0.826531 0.814677 0.021657 1
16 0.017020 0.000095 0.001989 0.000013 entropy None 60 {'criterion': 'entropy', 'max_features': None,... 0.808081 0.826531 0.836735 0.77551 0.826531 0.814677 0.021657 1
15 0.003338 0.000004 0.000720 0.000037 entropy None 10 {'criterion': 'entropy', 'max_features': None,... 0.808081 0.826531 0.836735 0.77551 0.826531 0.814677 0.021657 1
14 0.028078 0.000053 0.002999 0.000004 entropy log2 100 {'criterion': 'entropy', 'max_features': 'log2... 0.808081 0.826531 0.836735 0.77551 0.826531 0.814677 0.021657 1

Extra Trees Learning Curve¶

In [57]:
# extra trees score
estimator = ExtraTreesClassifier(n_estimators=60,criterion='entropy', max_features='log2')

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)

plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
Out[57]:
[<matplotlib.lines.Line2D at 0x1265f33a0>]
In [58]:
# SVM
clf = GridSearchCV(SVC(),{
    'C':[1,10,20],
    'kernel':['linear', 'poly', 'rbf', 'sigmoid'],
    'degree':[2,3,4]
    # 'gamma':['auto', 'scale'],
    # 'coef0':[0,1,2,3],
    # 'shrinking':['True','False'],
    # 'probability':['True','False'],
    # 'class_weight':['balanced'],
    # 'max_iter':[-1],
    # 'random_state':[42]
} ,cv=k_fold)

clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)

SVM = pd.DataFrame(clf.cv_results_)
SVM.sort_values(by='rank_test_score').head()
Out[58]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_degree param_kernel params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.001592 0.000252 0.000721 0.000096 1 2 linear {'C': 1, 'degree': 2, 'kernel': 'linear'} 0.828283 0.77551 0.867347 0.77551 0.826531 0.814636 0.035122 1
32 0.001195 0.000029 0.000488 0.000008 20 4 linear {'C': 20, 'degree': 4, 'kernel': 'linear'} 0.828283 0.77551 0.867347 0.77551 0.826531 0.814636 0.035122 1
31 0.001596 0.000043 0.000570 0.000011 20 3 sigmoid {'C': 20, 'degree': 3, 'kernel': 'sigmoid'} 0.828283 0.77551 0.867347 0.77551 0.826531 0.814636 0.035122 1
30 0.001317 0.000038 0.000863 0.000026 20 3 rbf {'C': 20, 'degree': 3, 'kernel': 'rbf'} 0.828283 0.77551 0.867347 0.77551 0.826531 0.814636 0.035122 1
29 0.025659 0.002409 0.000513 0.000009 20 3 poly {'C': 20, 'degree': 3, 'kernel': 'poly'} 0.828283 0.77551 0.867347 0.77551 0.826531 0.814636 0.035122 1

SVM Learning Curve¶

In [59]:
# svm score
estimator = SVC(C=1,degree=2, kernel='linear')

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)

plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
Out[59]:
[<matplotlib.lines.Line2D at 0x12666cb20>]
In [60]:
# XGBClassifier
clf = GridSearchCV(XGBClassifier(),{
    'n_estimators':[1,10,20],
    'booster':['gbtree', 'gblinear', 'dart'],
    'eval_metric':['mlogloss']
} ,cv=k_fold)

clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)

XGB = pd.DataFrame(clf.cv_results_)
XGB.sort_values(by='rank_test_score').head()
Out[60]:
mean_fit_time std_fit_time mean_score_time std_score_time param_booster param_eval_metric param_n_estimators params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.003111 0.000685 0.002186 0.001226 gbtree mlogloss 1 {'booster': 'gbtree', 'eval_metric': 'mlogloss... 0.79798 0.826531 0.816327 0.795918 0.836735 0.814698 0.015877 1
1 0.003099 0.001133 0.000854 0.000112 gbtree mlogloss 10 {'booster': 'gbtree', 'eval_metric': 'mlogloss... 0.79798 0.826531 0.816327 0.795918 0.836735 0.814698 0.015877 1
2 0.006327 0.002770 0.000760 0.000013 gbtree mlogloss 20 {'booster': 'gbtree', 'eval_metric': 'mlogloss... 0.79798 0.826531 0.816327 0.795918 0.836735 0.814698 0.015877 1
4 0.001549 0.000675 0.000705 0.000154 gblinear mlogloss 10 {'booster': 'gblinear', 'eval_metric': 'mloglo... 0.79798 0.826531 0.816327 0.795918 0.836735 0.814698 0.015877 1
5 0.001426 0.000060 0.000594 0.000007 gblinear mlogloss 20 {'booster': 'gblinear', 'eval_metric': 'mloglo... 0.79798 0.826531 0.816327 0.795918 0.836735 0.814698 0.015877 1

XGB Learning Curve¶

In [61]:
# xgb score
estimator = XGBClassifier(booster='gblinear', n_estimators=10, eval_metric='mlogloss')

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)

plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
Out[61]:
[<matplotlib.lines.Line2D at 0x1266e8520>]

Confusion Matrix¶

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

In [62]:
# logistic regression score
clf = LogisticRegression(C=10, penalty='l2',solver='liblinear')
clf.fit(x_best,y_train)
Out[62]:
LogisticRegression(C=10, solver='liblinear')
In [63]:
y_pred = clf.predict(x_test[['Credit_History']])
plot_confusion_matrix(clf, x_test[['Credit_History']], y_test)
Out[63]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x12669ad90>
In [64]:
accuracy_score(y_test,y_pred)
Out[64]:
0.7886178861788617
In [65]:
precision_score(y_test,y_pred)
Out[65]:
0.7596153846153846
In [66]:
recall_score(y_test,y_pred)
Out[66]:
0.9875
In [69]:
# svm score
clf = SVC(C=1,degree=2, kernel='linear')
clf.fit(x_best,y_train)
Out[69]:
SVC(C=1, degree=2, kernel='linear')
In [70]:
y_pred = clf.predict(x_test[['Credit_History']])
plot_confusion_matrix(clf, x_test[['Credit_History']], y_test)
Out[70]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x12454d0d0>
In [71]:
accuracy_score(y_test,y_pred)
Out[71]:
0.7886178861788617
In [72]:
precision_score(y_test,y_pred)
Out[72]:
0.7596153846153846
In [73]:
recall_score(y_test,y_pred)
Out[73]:
0.9875
In [77]:
# xgb score
clf = XGBClassifier(booster='gblinear', n_estimators=10, eval_metric='mlogloss')
clf.fit(x_best,y_train)
Out[77]:
XGBClassifier(base_score=0.5, booster='gblinear', colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None,
              enable_categorical=False, eval_metric='mlogloss', gamma=None,
              gpu_id=-1, importance_type=None, interaction_constraints=None,
              learning_rate=0.5, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=10, n_jobs=8, num_parallel_tree=None, predictor=None,
              random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
              subsample=None, tree_method=None, validate_parameters=1,
              verbosity=None)
In [78]:
y_pred = clf.predict(x_test[['Credit_History']])
plot_confusion_matrix(clf, x_test[['Credit_History']], y_test)
Out[78]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1267a2220>
In [79]:
accuracy_score(y_test,y_pred)
Out[79]:
0.7886178861788617
In [80]:
precision_score(y_test,y_pred)
Out[80]:
0.7596153846153846
In [81]:
recall_score(y_test,y_pred)
Out[81]:
0.9875
In [82]:
from rfpimp import *   # pip install rfpimp

from sklearn import tree

import dtreeviz
from dtreeviz import clfviz
In [83]:
X = x_train[['Total_Income_Log','LoanAmountLog']]
y = y_train

lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)

svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)


fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Total_Income_Log','LoanAmountLog'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Total_Income_Log','LoanAmountLog'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
In [84]:
X = x_train[['Gender','Credit_History']]
y = y_train

lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)

svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)


fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Gender','Credit_History'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Gender','Credit_History'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
In [85]:
X = x_train[['Total_Income_Log']]
y = y_train

lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)

svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)


fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Total_Income_Log'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Total_Income_Log'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
In [86]:
X = x_train[['LoanAmountLog']]
y = y_train

lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)

svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)


fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['LoanAmountLog'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['LoanAmountLog'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
In [87]:
X = x_train[['Gender']]
y = y_train

lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)

svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)


fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Gender'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Gender'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
In [88]:
X = x_train[['Credit_History']]
y = y_train

lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)

svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)


fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Credit_History'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
       # show classification regions not probabilities
       show=['instances', 'boundaries', 'misclassified'], 
       feature_names=['Credit_History'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
In [ ]: