Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.
This is a standard supervised classification task.A classification problem where we have to predict whether a loan would be approved or not. Below is the dataset attributes with description.
Variable | Description |
---|---|
Loan_ID | Unique Loan ID |
Gender | Male/ Female |
Married | Applicant married (Y/N) |
Dependents | Number of dependents |
Education | Applicant Education (Graduate/ Under Graduate) |
Self_Employed | Self employed (Y/N) |
ApplicantIncome | Applicant income |
CoapplicantIncome | Coapplicant income |
LoanAmount | Loan amount in thousands |
Loan_Amount_Term | Term of loan in months |
Credit_History | credit history meets guidelines |
Property_Area | Urban/ Semi Urban/ Rural |
Loan_Status | Loan approved (Y/N) |
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("Loan Prediction Dataset.csv")
df.head()
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
3 | LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.0 | 360.0 | 1.0 | Urban | Y |
4 | LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.0 | 360.0 | 1.0 | Urban | Y |
df.describe()
ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | |
---|---|---|---|---|---|
count | 614.000000 | 614.000000 | 592.000000 | 600.00000 | 564.000000 |
mean | 5403.459283 | 1621.245798 | 146.412162 | 342.00000 | 0.842199 |
std | 6109.041673 | 2926.248369 | 85.587325 | 65.12041 | 0.364878 |
min | 150.000000 | 0.000000 | 9.000000 | 12.00000 | 0.000000 |
25% | 2877.500000 | 0.000000 | 100.000000 | 360.00000 | 1.000000 |
50% | 3812.500000 | 1188.500000 | 128.000000 | 360.00000 | 1.000000 |
75% | 5795.000000 | 2297.250000 | 168.000000 | 360.00000 | 1.000000 |
max | 81000.000000 | 41667.000000 | 700.000000 | 480.00000 | 1.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 614 entries, 0 to 613 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 614 non-null object 1 Gender 601 non-null object 2 Married 611 non-null object 3 Dependents 599 non-null object 4 Education 614 non-null object 5 Self_Employed 582 non-null object 6 ApplicantIncome 614 non-null int64 7 CoapplicantIncome 614 non-null float64 8 LoanAmount 592 non-null float64 9 Loan_Amount_Term 600 non-null float64 10 Credit_History 564 non-null float64 11 Property_Area 614 non-null object 12 Loan_Status 614 non-null object dtypes: float64(4), int64(1), object(8) memory usage: 62.5+ KB
# find the null values
df.isnull().sum()
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
# fill the missing values for numerical terms - mean
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean())
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mean())
# fill the missing values for categorical terms - mode
df['Gender'] = df["Gender"].fillna(df['Gender'].mode()[0])
df['Married'] = df["Married"].fillna(df['Married'].mode()[0])
df['Dependents'] = df["Dependents"].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df["Self_Employed"].fillna(df['Self_Employed'].mode()[0])
df.isnull().sum()
Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64
# categorical attributes visualization
sns.countplot(df['Gender'])
<AxesSubplot:xlabel='Gender', ylabel='count'>
sns.countplot(df['Married'])
<AxesSubplot:xlabel='Married', ylabel='count'>
sns.countplot(df['Dependents'])
<AxesSubplot:xlabel='Dependents', ylabel='count'>
sns.countplot(df['Education'])
<AxesSubplot:xlabel='Education', ylabel='count'>
sns.countplot(df['Self_Employed'])
<AxesSubplot:xlabel='Self_Employed', ylabel='count'>
sns.countplot(df['Property_Area'])
<AxesSubplot:xlabel='Property_Area', ylabel='count'>
sns.countplot(df['Loan_Status'])
<AxesSubplot:xlabel='Loan_Status', ylabel='count'>
# numerical attributes visualization
sns.distplot(df["ApplicantIncome"])
<AxesSubplot:xlabel='ApplicantIncome', ylabel='Density'>
sns.distplot(df["CoapplicantIncome"])
<AxesSubplot:xlabel='CoapplicantIncome', ylabel='Density'>
sns.distplot(df["LoanAmount"])
<AxesSubplot:xlabel='LoanAmount', ylabel='Density'>
sns.distplot(df['Loan_Amount_Term'])
<AxesSubplot:xlabel='Loan_Amount_Term', ylabel='Density'>
sns.distplot(df['Credit_History'])
<AxesSubplot:xlabel='Credit_History', ylabel='Density'>
# total income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df.head()
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | Total_Income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | 146.412162 | 360.0 | 1.0 | Urban | Y | 5849.0 |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.000000 | 360.0 | 1.0 | Rural | N | 6091.0 |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.000000 | 360.0 | 1.0 | Urban | Y | 3000.0 |
3 | LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.000000 | 360.0 | 1.0 | Urban | Y | 4941.0 |
4 | LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.000000 | 360.0 | 1.0 | Urban | Y | 6000.0 |
# apply log transformation to the attribute
df['ApplicantIncomeLog'] = np.log(df['ApplicantIncome']+1)
sns.distplot(df["ApplicantIncomeLog"])
<AxesSubplot:xlabel='ApplicantIncomeLog', ylabel='Density'>
df['CoapplicantIncomeLog'] = np.log(df['CoapplicantIncome']+1)
sns.distplot(df["CoapplicantIncomeLog"])
<AxesSubplot:xlabel='CoapplicantIncomeLog', ylabel='Density'>
df['LoanAmountLog'] = np.log(df['LoanAmount']+1)
sns.distplot(df["LoanAmountLog"])
<AxesSubplot:xlabel='LoanAmountLog', ylabel='Density'>
df['Loan_Amount_Term_Log'] = np.log(df['Loan_Amount_Term']+1)
sns.distplot(df["Loan_Amount_Term_Log"])
<AxesSubplot:xlabel='Loan_Amount_Term_Log', ylabel='Density'>
df['Total_Income_Log'] = np.log(df['Total_Income']+1)
sns.distplot(df["Total_Income_Log"])
<AxesSubplot:xlabel='Total_Income_Log', ylabel='Density'>
corr = df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot = True, cmap="BuPu")
<AxesSubplot:>
df.head()
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | Total_Income | ApplicantIncomeLog | CoapplicantIncomeLog | LoanAmountLog | Loan_Amount_Term_Log | Total_Income_Log | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | 146.412162 | 360.0 | 1.0 | Urban | Y | 5849.0 | 8.674197 | 0.000000 | 4.993232 | 5.888878 | 8.674197 |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.000000 | 360.0 | 1.0 | Rural | N | 6091.0 | 8.430327 | 7.319202 | 4.859812 | 5.888878 | 8.714732 |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.000000 | 360.0 | 1.0 | Urban | Y | 3000.0 | 8.006701 | 0.000000 | 4.204693 | 5.888878 | 8.006701 |
3 | LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.000000 | 360.0 | 1.0 | Urban | Y | 4941.0 | 7.857094 | 7.765993 | 4.795791 | 5.888878 | 8.505525 |
4 | LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.000000 | 360.0 | 1.0 | Urban | Y | 6000.0 | 8.699681 | 0.000000 | 4.955827 | 5.888878 | 8.699681 |
# drop unnecessary columns
cols = ['ApplicantIncome', 'CoapplicantIncome', "LoanAmount", "Loan_Amount_Term", "Total_Income", 'Loan_ID', 'CoapplicantIncomeLog']
df = df.drop(columns=cols, axis=1)
df.head()
Gender | Married | Dependents | Education | Self_Employed | Credit_History | Property_Area | Loan_Status | ApplicantIncomeLog | LoanAmountLog | Loan_Amount_Term_Log | Total_Income_Log | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Male | No | 0 | Graduate | No | 1.0 | Urban | Y | 8.674197 | 4.993232 | 5.888878 | 8.674197 |
1 | Male | Yes | 1 | Graduate | No | 1.0 | Rural | N | 8.430327 | 4.859812 | 5.888878 | 8.714732 |
2 | Male | Yes | 0 | Graduate | Yes | 1.0 | Urban | Y | 8.006701 | 4.204693 | 5.888878 | 8.006701 |
3 | Male | Yes | 0 | Not Graduate | No | 1.0 | Urban | Y | 7.857094 | 4.795791 | 5.888878 | 8.505525 |
4 | Male | No | 0 | Graduate | No | 1.0 | Urban | Y | 8.699681 | 4.955827 | 5.888878 | 8.699681 |
from sklearn.preprocessing import LabelEncoder
cols = ['Gender',"Married","Education",'Self_Employed',"Property_Area","Loan_Status","Dependents"]
le = LabelEncoder()
for col in cols:
df[col] = le.fit_transform(df[col])
df.head()
Gender | Married | Dependents | Education | Self_Employed | Credit_History | Property_Area | Loan_Status | ApplicantIncomeLog | LoanAmountLog | Loan_Amount_Term_Log | Total_Income_Log | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 2 | 1 | 8.674197 | 4.993232 | 5.888878 | 8.674197 |
1 | 1 | 1 | 1 | 0 | 0 | 1.0 | 0 | 0 | 8.430327 | 4.859812 | 5.888878 | 8.714732 |
2 | 1 | 1 | 0 | 0 | 1 | 1.0 | 2 | 1 | 8.006701 | 4.204693 | 5.888878 | 8.006701 |
3 | 1 | 1 | 0 | 1 | 0 | 1.0 | 2 | 1 | 7.857094 | 4.795791 | 5.888878 | 8.505525 |
4 | 1 | 0 | 0 | 0 | 0 | 1.0 | 2 | 1 | 8.699681 | 4.955827 | 5.888878 | 8.699681 |
import warnings
warnings.filterwarnings('ignore')
# specify input and output attributes
X = df.drop(columns=['Loan_Status'], axis=1)
y = df['Loan_Status']
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train_norm = (x_train - x_train.mean())/(x_train.max() - x_train.min())
x_test_norm = (x_test - x_test.mean())/(x_test.max() - x_test.min())
# classify function
from sklearn.model_selection import cross_val_score
def classify(clf, x, y):
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(x_train, y_train)
print("Accuracy is", clf.score(x_test, y_test)*100)
# cross validation - it is used for better validation of model
# eg: cv-5, train-4, test-1
score = cross_val_score(clf, x, y, cv=5)
print("Cross validation is",np.mean(score)*100)
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
classify(clf, X, y)
Accuracy is 78.86178861788618 Cross validation is 80.9462881514061
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
classify(clf, X, y)
Accuracy is 68.29268292682927 Cross validation is 72.31507397041183
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
clf = RandomForestClassifier()
classify(clf, X, y)
Accuracy is 78.04878048780488 Cross validation is 78.50459816073571
clf = ExtraTreesClassifier()
classify(clf, X, y)
Accuracy is 72.35772357723577 Cross validation is 77.20378515260562
from sklearn.svm import SVC
clf = SVC()
classify(clf, X, y)
Accuracy is 65.04065040650406 Cross validation is 69.70545115287219
from xgboost import XGBClassifier
clf = XGBClassifier(eval_metric='mlogloss')
classify(clf, X, y)
Accuracy is 74.79674796747967 Cross validation is 75.5631080900973
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
select_feature = SelectKBest(chi2, k=11).fit(x_train, y_train)
print('Score List: ', select_feature.scores_)
print('Feature List: ', x_train.columns)
Score List: [7.19500797e-03 2.06871125e+00 7.71256305e-01 1.52247417e+00 7.78194275e-03 2.03768514e+01 1.11421163e-01 4.03909601e-03 4.53042155e-02 8.22114016e-04 2.97424549e-03] Feature List: Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area', 'ApplicantIncomeLog', 'LoanAmountLog', 'Loan_Amount_Term_Log', 'Total_Income_Log'], dtype='object')
x_train_2 = select_feature.transform(x_train)
x_test_2 = select_feature.transform(x_test)
clf = XGBClassifier(eval_metric='mlogloss').fit(x_train_2, y_train)
print("Accuracy is", clf.score(x_test, y_test)*100)
score = cross_val_score(clf, X, y, cv=5)
print("Cross validation is",np.mean(score)*100)
Accuracy is 74.79674796747967 Cross validation is 75.5631080900973
from sklearn.feature_selection import RFECV
clf = XGBClassifier(eval_metric='mlogloss')
rfecv = RFECV(estimator=clf, step=1, cv=5, scoring='accuracy', n_jobs=1).fit(x_train, y_train)
print('Optimal number of features: ', rfecv.n_features_)
print('Best features: ', x_train.columns[rfecv.support_])
Optimal number of features: 1 Best features: Index(['Credit_History'], dtype='object')
print('Accuracy is: ', accuracy_score(y_test, rfecv.predict(x_test)))
Accuracy is: 0.7886178861788617
num_features = [i for i in range(1,(rfecv.grid_scores_.shape[0]+1))]
cv_scores = [np.mean(score) for score in rfecv.grid_scores_]
ax = sns.lineplot(x=num_features, y=cv_scores)
ax.set(xlabel='No. of selected features', ylabel='CV scores')
[Text(0.5, 0, 'No. of selected features'), Text(0, 0.5, 'CV scores')]
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(x_train_norm)
plt.figure(1, figsize=(10,8))
sns.lineplot(data=np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('No. of components')
Text(0.5, 0, 'No. of components')
x_best = x_train[x_train.columns[rfecv.support_]]
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
k_fold = KFold(n_splits=5, shuffle = True)
# Logistic Regression
clf = GridSearchCV(LogisticRegression(),{
'penalty':['l1', 'l2', 'elasticnet', 'none'],
'C':[1,10,20],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
} ,cv=k_fold)
clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best,y_train, cv=k_fold,return_times=True)
LR = pd.DataFrame(clf.cv_results_)
LR.sort_values(by='rank_test_score').head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_C | param_penalty | param_solver | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | 0.000867 | 0.000034 | 0.000336 | 0.000008 | 10 | l2 | saga | {'C': 10, 'penalty': 'l2', 'solver': 'saga'} | 0.818182 | 0.744898 | 0.877551 | 0.826531 | 0.806122 | 0.814657 | 0.04254 | 1 |
25 | 0.001275 | 0.000048 | 0.000350 | 0.000006 | 10 | l2 | newton-cg | {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'} | 0.818182 | 0.744898 | 0.877551 | 0.826531 | 0.806122 | 0.814657 | 0.04254 | 1 |
26 | 0.001283 | 0.000042 | 0.000356 | 0.000007 | 10 | l2 | lbfgs | {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'} | 0.818182 | 0.744898 | 0.877551 | 0.826531 | 0.806122 | 0.814657 | 0.04254 | 1 |
27 | 0.000557 | 0.000016 | 0.000330 | 0.000005 | 10 | l2 | liblinear | {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'} | 0.818182 | 0.744898 | 0.877551 | 0.826531 | 0.806122 | 0.814657 | 0.04254 | 1 |
28 | 0.001012 | 0.000038 | 0.000335 | 0.000003 | 10 | l2 | sag | {'C': 10, 'penalty': 'l2', 'solver': 'sag'} | 0.818182 | 0.744898 | 0.877551 | 0.826531 | 0.806122 | 0.814657 | 0.04254 | 1 |
# logistic regression score
estimator = LogisticRegression(C=10, penalty='l2',solver='liblinear')
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)
plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
[<matplotlib.lines.Line2D at 0x1263a8640>]
# Decision Tree
clf = GridSearchCV(DecisionTreeClassifier(),{
'criterion':['gini', 'entropy', 'log_loss'],
'splitter':['best','random'],
'max_features':['sqrt', 'log2', None]
} ,cv=k_fold)
clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)
DT = pd.DataFrame(clf.cv_results_)
DT.sort_values(by='rank_test_score').head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_features | param_splitter | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000743 | 0.000271 | 0.000542 | 0.000157 | gini | sqrt | best | {'criterion': 'gini', 'max_features': 'sqrt', ... | 0.838384 | 0.77551 | 0.867347 | 0.816327 | 0.77551 | 0.814616 | 0.035796 | 1 |
11 | 0.000461 | 0.000030 | 0.000328 | 0.000006 | entropy | None | random | {'criterion': 'entropy', 'max_features': None,... | 0.838384 | 0.77551 | 0.867347 | 0.816327 | 0.77551 | 0.814616 | 0.035796 | 1 |
10 | 0.000450 | 0.000006 | 0.000323 | 0.000001 | entropy | None | best | {'criterion': 'entropy', 'max_features': None,... | 0.838384 | 0.77551 | 0.867347 | 0.816327 | 0.77551 | 0.814616 | 0.035796 | 1 |
9 | 0.000458 | 0.000007 | 0.000333 | 0.000004 | entropy | log2 | random | {'criterion': 'entropy', 'max_features': 'log2... | 0.838384 | 0.77551 | 0.867347 | 0.816327 | 0.77551 | 0.814616 | 0.035796 | 1 |
7 | 0.000457 | 0.000007 | 0.000335 | 0.000001 | entropy | sqrt | random | {'criterion': 'entropy', 'max_features': 'sqrt... | 0.838384 | 0.77551 | 0.867347 | 0.816327 | 0.77551 | 0.814616 | 0.035796 | 1 |
# decision tree score
estimator = DecisionTreeClassifier(criterion='entropy', max_features='sqrt',splitter='best')
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)
plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
[<matplotlib.lines.Line2D at 0x12649f2e0>]
# Random Forest
clf = GridSearchCV(RandomForestClassifier(),{
'n_estimators':[10,60,100],
'criterion':['gini', 'entropy', 'log_loss'],
'max_features':['sqrt', 'log2', None]
} ,cv=k_fold)
clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)
RF = pd.DataFrame(clf.cv_results_)
RF.sort_values(by='rank_test_score').head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_features | param_n_estimators | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.005674 | 0.001162 | 0.000911 | 0.000116 | gini | sqrt | 10 | {'criterion': 'gini', 'max_features': 'sqrt', ... | 0.79798 | 0.846939 | 0.806122 | 0.826531 | 0.795918 | 0.814698 | 0.019417 | 1 |
17 | 0.038177 | 0.000060 | 0.002952 | 0.000022 | entropy | None | 100 | {'criterion': 'entropy', 'max_features': None,... | 0.79798 | 0.846939 | 0.806122 | 0.826531 | 0.795918 | 0.814698 | 0.019417 | 1 |
16 | 0.023206 | 0.000079 | 0.001944 | 0.000003 | entropy | None | 60 | {'criterion': 'entropy', 'max_features': None,... | 0.79798 | 0.846939 | 0.806122 | 0.826531 | 0.795918 | 0.814698 | 0.019417 | 1 |
15 | 0.004446 | 0.000129 | 0.000707 | 0.000008 | entropy | None | 10 | {'criterion': 'entropy', 'max_features': None,... | 0.79798 | 0.846939 | 0.806122 | 0.826531 | 0.795918 | 0.814698 | 0.019417 | 1 |
14 | 0.039015 | 0.000137 | 0.003074 | 0.000112 | entropy | log2 | 100 | {'criterion': 'entropy', 'max_features': 'log2... | 0.79798 | 0.846939 | 0.806122 | 0.826531 | 0.795918 | 0.814698 | 0.019417 | 1 |
# random forest score
estimator = RandomForestClassifier(n_estimators=60,criterion='gini', max_features='log2')
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)
plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
[<matplotlib.lines.Line2D at 0x126567e20>]
# Extra Trees
clf = GridSearchCV(ExtraTreesClassifier(),{
'n_estimators':[10,60,100],
'criterion':['gini', 'entropy', 'log_loss'],
'max_features':['sqrt', 'log2', None]
} ,cv=k_fold)
clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)
RF = pd.DataFrame(clf.cv_results_)
RF.sort_values(by='rank_test_score').head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_features | param_n_estimators | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.004407 | 0.000581 | 0.000990 | 0.000117 | gini | sqrt | 10 | {'criterion': 'gini', 'max_features': 'sqrt', ... | 0.808081 | 0.826531 | 0.836735 | 0.77551 | 0.826531 | 0.814677 | 0.021657 | 1 |
17 | 0.028284 | 0.000431 | 0.003190 | 0.000118 | entropy | None | 100 | {'criterion': 'entropy', 'max_features': None,... | 0.808081 | 0.826531 | 0.836735 | 0.77551 | 0.826531 | 0.814677 | 0.021657 | 1 |
16 | 0.017020 | 0.000095 | 0.001989 | 0.000013 | entropy | None | 60 | {'criterion': 'entropy', 'max_features': None,... | 0.808081 | 0.826531 | 0.836735 | 0.77551 | 0.826531 | 0.814677 | 0.021657 | 1 |
15 | 0.003338 | 0.000004 | 0.000720 | 0.000037 | entropy | None | 10 | {'criterion': 'entropy', 'max_features': None,... | 0.808081 | 0.826531 | 0.836735 | 0.77551 | 0.826531 | 0.814677 | 0.021657 | 1 |
14 | 0.028078 | 0.000053 | 0.002999 | 0.000004 | entropy | log2 | 100 | {'criterion': 'entropy', 'max_features': 'log2... | 0.808081 | 0.826531 | 0.836735 | 0.77551 | 0.826531 | 0.814677 | 0.021657 | 1 |
# extra trees score
estimator = ExtraTreesClassifier(n_estimators=60,criterion='entropy', max_features='log2')
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)
plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
[<matplotlib.lines.Line2D at 0x1265f33a0>]
# SVM
clf = GridSearchCV(SVC(),{
'C':[1,10,20],
'kernel':['linear', 'poly', 'rbf', 'sigmoid'],
'degree':[2,3,4]
# 'gamma':['auto', 'scale'],
# 'coef0':[0,1,2,3],
# 'shrinking':['True','False'],
# 'probability':['True','False'],
# 'class_weight':['balanced'],
# 'max_iter':[-1],
# 'random_state':[42]
} ,cv=k_fold)
clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)
SVM = pd.DataFrame(clf.cv_results_)
SVM.sort_values(by='rank_test_score').head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_C | param_degree | param_kernel | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.001592 | 0.000252 | 0.000721 | 0.000096 | 1 | 2 | linear | {'C': 1, 'degree': 2, 'kernel': 'linear'} | 0.828283 | 0.77551 | 0.867347 | 0.77551 | 0.826531 | 0.814636 | 0.035122 | 1 |
32 | 0.001195 | 0.000029 | 0.000488 | 0.000008 | 20 | 4 | linear | {'C': 20, 'degree': 4, 'kernel': 'linear'} | 0.828283 | 0.77551 | 0.867347 | 0.77551 | 0.826531 | 0.814636 | 0.035122 | 1 |
31 | 0.001596 | 0.000043 | 0.000570 | 0.000011 | 20 | 3 | sigmoid | {'C': 20, 'degree': 3, 'kernel': 'sigmoid'} | 0.828283 | 0.77551 | 0.867347 | 0.77551 | 0.826531 | 0.814636 | 0.035122 | 1 |
30 | 0.001317 | 0.000038 | 0.000863 | 0.000026 | 20 | 3 | rbf | {'C': 20, 'degree': 3, 'kernel': 'rbf'} | 0.828283 | 0.77551 | 0.867347 | 0.77551 | 0.826531 | 0.814636 | 0.035122 | 1 |
29 | 0.025659 | 0.002409 | 0.000513 | 0.000009 | 20 | 3 | poly | {'C': 20, 'degree': 3, 'kernel': 'poly'} | 0.828283 | 0.77551 | 0.867347 | 0.77551 | 0.826531 | 0.814636 | 0.035122 | 1 |
# svm score
estimator = SVC(C=1,degree=2, kernel='linear')
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)
plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
[<matplotlib.lines.Line2D at 0x12666cb20>]
# XGBClassifier
clf = GridSearchCV(XGBClassifier(),{
'n_estimators':[1,10,20],
'booster':['gbtree', 'gblinear', 'dart'],
'eval_metric':['mlogloss']
} ,cv=k_fold)
clf.fit(x_best,y_train)
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(clf, x_best, y_train, cv=k_fold,return_times=True)
XGB = pd.DataFrame(clf.cv_results_)
XGB.sort_values(by='rank_test_score').head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_booster | param_eval_metric | param_n_estimators | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.003111 | 0.000685 | 0.002186 | 0.001226 | gbtree | mlogloss | 1 | {'booster': 'gbtree', 'eval_metric': 'mlogloss... | 0.79798 | 0.826531 | 0.816327 | 0.795918 | 0.836735 | 0.814698 | 0.015877 | 1 |
1 | 0.003099 | 0.001133 | 0.000854 | 0.000112 | gbtree | mlogloss | 10 | {'booster': 'gbtree', 'eval_metric': 'mlogloss... | 0.79798 | 0.826531 | 0.816327 | 0.795918 | 0.836735 | 0.814698 | 0.015877 | 1 |
2 | 0.006327 | 0.002770 | 0.000760 | 0.000013 | gbtree | mlogloss | 20 | {'booster': 'gbtree', 'eval_metric': 'mlogloss... | 0.79798 | 0.826531 | 0.816327 | 0.795918 | 0.836735 | 0.814698 | 0.015877 | 1 |
4 | 0.001549 | 0.000675 | 0.000705 | 0.000154 | gblinear | mlogloss | 10 | {'booster': 'gblinear', 'eval_metric': 'mloglo... | 0.79798 | 0.826531 | 0.816327 | 0.795918 | 0.836735 | 0.814698 | 0.015877 | 1 |
5 | 0.001426 | 0.000060 | 0.000594 | 0.000007 | gblinear | mlogloss | 20 | {'booster': 'gblinear', 'eval_metric': 'mloglo... | 0.79798 | 0.826531 | 0.816327 | 0.795918 | 0.836735 | 0.814698 | 0.015877 | 1 |
# xgb score
estimator = XGBClassifier(booster='gblinear', n_estimators=10, eval_metric='mlogloss')
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, x_best, y_train, cv=5,return_times=True)
plt.plot(train_sizes,np.mean(train_scores,axis=1))
plt.plot(train_sizes,np.mean(test_scores,axis=1))
[<matplotlib.lines.Line2D at 0x1266e8520>]
A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.
# logistic regression score
clf = LogisticRegression(C=10, penalty='l2',solver='liblinear')
clf.fit(x_best,y_train)
LogisticRegression(C=10, solver='liblinear')
y_pred = clf.predict(x_test[['Credit_History']])
plot_confusion_matrix(clf, x_test[['Credit_History']], y_test)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x12669ad90>
accuracy_score(y_test,y_pred)
0.7886178861788617
precision_score(y_test,y_pred)
0.7596153846153846
recall_score(y_test,y_pred)
0.9875
# svm score
clf = SVC(C=1,degree=2, kernel='linear')
clf.fit(x_best,y_train)
SVC(C=1, degree=2, kernel='linear')
y_pred = clf.predict(x_test[['Credit_History']])
plot_confusion_matrix(clf, x_test[['Credit_History']], y_test)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x12454d0d0>
accuracy_score(y_test,y_pred)
0.7886178861788617
precision_score(y_test,y_pred)
0.7596153846153846
recall_score(y_test,y_pred)
0.9875
# xgb score
clf = XGBClassifier(booster='gblinear', n_estimators=10, eval_metric='mlogloss')
clf.fit(x_best,y_train)
XGBClassifier(base_score=0.5, booster='gblinear', colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, enable_categorical=False, eval_metric='mlogloss', gamma=None, gpu_id=-1, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=10, n_jobs=8, num_parallel_tree=None, predictor=None, random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, subsample=None, tree_method=None, validate_parameters=1, verbosity=None)
y_pred = clf.predict(x_test[['Credit_History']])
plot_confusion_matrix(clf, x_test[['Credit_History']], y_test)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1267a2220>
accuracy_score(y_test,y_pred)
0.7886178861788617
precision_score(y_test,y_pred)
0.7596153846153846
recall_score(y_test,y_pred)
0.9875
from rfpimp import * # pip install rfpimp
from sklearn import tree
import dtreeviz
from dtreeviz import clfviz
X = x_train[['Total_Income_Log','LoanAmountLog']]
y = y_train
lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)
svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)
fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Total_Income_Log','LoanAmountLog'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Total_Income_Log','LoanAmountLog'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
X = x_train[['Gender','Credit_History']]
y = y_train
lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)
svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)
fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Gender','Credit_History'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Gender','Credit_History'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
X = x_train[['Total_Income_Log']]
y = y_train
lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)
svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)
fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Total_Income_Log'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Total_Income_Log'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
X = x_train[['LoanAmountLog']]
y = y_train
lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)
svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)
fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['LoanAmountLog'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['LoanAmountLog'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
X = x_train[['Gender']]
y = y_train
lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)
svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)
fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Gender'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Gender'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()
X = x_train[['Credit_History']]
y = y_train
lr = LogisticRegression(C=10, penalty='l2',solver='liblinear')
lr.fit(X, y)
svm = SVC(C=1,degree=2, kernel='linear', probability=True)
svm.fit(X,y)
fig,axes = plt.subplots(1,2, figsize=(8,4), dpi=300)
clfviz(lr, X, y, ax=axes[0],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Credit_History'], target_name='Loan_Status')
clfviz(svm, X, y, ax=axes[1],
# show classification regions not probabilities
show=['instances', 'boundaries', 'misclassified'],
feature_names=['Credit_History'], target_name='Loan_Status')
plt.title(label='Logistic Regression vs Support Vector Machine')
plt.show()