People analytics¶

Trodecco is a company operating in LA. Over time, the company has found that after recruiting and training employees, many of their brightest and most experienced employees leave the company for other opportunities. Trodecco wants to understand what causes employees to leave and how to retain their employees. They compiled data for the past 3 years in the file: Retention.csv These data include (for each employee):

  • Satisfaction: (0.0- 1.0) Employee’s self-reported satisfaction level, collected through a recent survey at the company (1.0 is highest)
  • Last_evaluation: (0.0 – 1.0) The employee’s latest annual performance evaluation (1.0 is highest)
  • Project: (No. of Projects) Average number of projects per month the employee was assigned
  • Monthly_hours: (hours) Average number of hours the employee worked each month
  • Time: (years) Number of years the employee has been with the company
  • Left: (1=yes, 0 = no) Whether the employee quit
  • Promotion: Whether the employee was promoted in the last 5 years (1=yes, 0 =no)
  • Department: Department the employee worked in.
  • Salary: (low, medium, high). Whether the employee’s salary was in the top third, middle third, or top third of salaries for their department.

The second file Training_Cost.csv gives the average cost of training a new employee for each department by salary level.

Part 1: Exploratory data analysis (EDA):¶

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as ss
from statsmodels.stats.weightstats import ztest

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
retention = pd.read_csv('Retention.csv', encoding='ISO-8859-1')
retention.head()
Out[3]:
Satisfaction Last_evaluation Project Monthly_hours Time Left Promotion Department Salary
0 0.71 0.36 4 278 4 0 0 product_mng high
1 0.44 0.36 2 136 3 0 0 technical high
2 0.79 0.36 3 114 3 0 0 technical high
3 0.42 0.37 2 284 3 0 0 technical high
4 0.96 0.37 3 111 2 0 0 sales high
In [4]:
trainingCost = pd.read_csv('Training_Cost.csv', encoding='ISO-8859-1')
trainingCost
Out[4]:
Department high medium low
0 accounting 2000 1000 1000
1 hr 1300 900 500
2 IT 4000 2000 1000
3 management 2000 1000 300
4 marketing 4000 3000 600
5 product_mng 3000 2000 1000
6 RandD 2000 2000 1000
7 sales 5000 2000 500
8 support 3000 1000 300
9 technical 4000 2000 1000

How many employees are in the data set?

In [5]:
retention.shape
Out[5]:
(14999, 9)

Are there any missing values?

In [6]:
retention.notna().shape
Out[6]:
(14999, 9)

Min, average and max satisfaction score in the data set

In [7]:
retention['Satisfaction'].describe()
Out[7]:
count    14999.000000
mean         0.612834
std          0.248631
min          0.090000
25%          0.440000
50%          0.640000
75%          0.820000
max          1.000000
Name: Satisfaction, dtype: float64

Number of departments

In [8]:
retention['Department'].nunique()
Out[8]:
10

How many employees are in each department?

In [9]:
retention.groupby('Department')['Satisfaction'].count()
Out[9]:
Department
IT             1227
RandD           787
accounting      767
hr              739
management      630
marketing       858
product_mng     902
sales          4140
support        2229
technical      2720
Name: Satisfaction, dtype: int64

Average satisfaction score in each department

In [10]:
retention.groupby('Department')['Satisfaction'].mean()
Out[10]:
Department
IT             0.618142
RandD          0.619822
accounting     0.582151
hr             0.598809
management     0.621349
marketing      0.618601
product_mng    0.619634
sales          0.614447
support        0.618300
technical      0.607897
Name: Satisfaction, dtype: float64

Calculate retention rates, i.e., the % of employees that have NOT left, for all departments.

In [11]:
retention_rates = 1 - retention.groupby('Department')['Left'].sum()/retention.groupby('Department')['Left'].count()
retention_rates
Out[11]:
Department
IT             0.777506
RandD          0.846252
accounting     0.734029
hr             0.709066
management     0.855556
marketing      0.763403
product_mng    0.780488
sales          0.755072
support        0.751009
technical      0.743750
Name: Left, dtype: float64
In [12]:
retention['Left'].value_counts()
Out[12]:
0    11428
1     3571
Name: Left, dtype: int64
In [13]:
retention.groupby('Left').mean()
Out[13]:
Satisfaction Last_evaluation Project Monthly_hours Time Promotion
Left
0 0.666810 0.714951 3.786664 199.060203 3.380032 0.026251
1 0.440098 0.720877 3.855503 207.419210 3.876505 0.005321
In [14]:
retention.groupby('Department').mean()
Out[14]:
Satisfaction Last_evaluation Project Monthly_hours Time Left Promotion
Department
IT 0.618142 0.716634 3.816626 202.215974 3.468623 0.222494 0.002445
RandD 0.619822 0.711881 3.853875 200.800508 3.367217 0.153748 0.034307
accounting 0.582151 0.718318 3.825293 201.162973 3.522816 0.265971 0.018253
hr 0.598809 0.710108 3.654939 198.684709 3.355886 0.290934 0.020298
management 0.621349 0.723794 3.860317 201.249206 4.303175 0.144444 0.109524
marketing 0.618601 0.716340 3.687646 199.385781 3.569930 0.236597 0.050117
product_mng 0.619634 0.714823 3.807095 199.965632 3.475610 0.219512 0.000000
sales 0.614447 0.710121 3.776329 200.911353 3.534058 0.244928 0.024155
support 0.618300 0.723194 3.803948 200.758188 3.393001 0.248991 0.008973
technical 0.607897 0.721379 3.877941 202.497426 3.411397 0.256250 0.010294
In [15]:
retention.groupby('Salary').mean()
Out[15]:
Satisfaction Last_evaluation Project Monthly_hours Time Left Promotion
Salary
high 0.637470 0.704325 3.767179 199.867421 3.692805 0.066289 0.058205
low 0.600753 0.717017 3.799891 200.996583 3.438218 0.296884 0.009021
medium 0.621817 0.717927 3.813528 201.338349 3.529010 0.204313 0.028079
In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
pd.crosstab(retention.Department,retention.Left).plot(kind='bar')
plt.title('Turnover Frequency for Department')
plt.xlabel('Department')
plt.ylabel('Frequency of Turnover')
plt.savefig('department_bar_chart')
In [17]:
table=pd.crosstab(retention.Salary, retention.Left)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Salary Level vs Turnover')
plt.xlabel('Salary Level')
plt.ylabel('Proportion of Employees')
plt.savefig('salary_bar_chart')
In [18]:
pd.crosstab(retention.Department, retention.Left)
Out[18]:
Left 0 1
Department
IT 954 273
RandD 666 121
accounting 563 204
hr 524 215
management 539 91
marketing 655 203
product_mng 704 198
sales 3126 1014
support 1674 555
technical 2023 697
In [19]:
num_bins = 10

retention.hist(bins=num_bins, figsize=(20,15))
plt.show()

Which department has the lowest retention rate?

In [20]:
retention_rates.loc['hr']
Out[20]:
0.7090663058186739

Fill in data from Training Cost dataset.

In [21]:
retention['trainingCost'] = trainingCost.set_index('Department').lookup(retention['Department'],retention['Salary'])
retention.head()
Out[21]:
Satisfaction Last_evaluation Project Monthly_hours Time Left Promotion Department Salary trainingCost
0 0.71 0.36 4 278 4 0 0 product_mng high 3000
1 0.44 0.36 2 136 3 0 0 technical high 4000
2 0.79 0.36 3 114 3 0 0 technical high 4000
3 0.42 0.37 2 284 3 0 0 technical high 4000
4 0.96 0.37 3 111 2 0 0 sales high 5000

What is the total training cost spent for all the employees?

In [22]:
retention['trainingCost'].sum()
Out[22]:
20567600

What is the percentage spent on those who have left the company?

In [23]:
retention.groupby('Left')['trainingCost'].sum()[1]/retention['trainingCost'].sum()
Out[23]:
0.19397498978976643

Which department spends the largest fraction of its training costs on employees that ultimately leave the company?

In [24]:
retention[retention['Left']==1].groupby('Department')['trainingCost'].sum()/retention.groupby('Department')['trainingCost'].sum()
Out[24]:
Department
IT             0.189955
RandD          0.154545
accounting     0.248514
hr             0.289747
management     0.069547
marketing      0.186834
product_mng    0.209008
sales          0.172518
support        0.174801
technical      0.238479
Name: trainingCost, dtype: float64

Part 2: AB testing:¶

There is a belief that the best employees , i.e., the employees with the highest evaluation scores, are the ones that are leaving the company. Let's take a closer look with A/B Testing.

In [25]:
R = retention.loc[retention['Left']==0,'Last_evaluation']
L = retention.loc[retention['Left']==1,'Last_evaluation']
In [26]:
ss.stats.ttest_ind(a=R, b=L,equal_var=False,alternative='greater')
Out[26]:
Ttest_indResult(statistic=-1.6508075726997398, pvalue=0.9505808315672778)

Data does not provide significant enough evidence that better than average employees are leaving.

The problem might be that the best employees might be leaving because they feel underpaid. If that were true, you might expect a (statistically significant) difference in evaluation scores between the employees who left and the employees who stayed, particularly at the “high” salary level.

In [27]:
RH = retention[retention['Salary']=='high'].loc[retention['Left']==0,'Last_evaluation']
LH = retention[retention['Salary']=='high'].loc[retention['Left']==1,'Last_evaluation']
In [28]:
ss.stats.ttest_ind(a=RH, b=LH,equal_var=False,alternative='greater')
Out[28]:
Ttest_indResult(statistic=3.082181370261907, pvalue=0.0013663080445550345)

Data does indicate that we should reject the null in this case, and accept that at higher salary levels retaining valuable employees has become a problem.

Consider the other salary levels.

In [29]:
RM = retention[retention['Salary']=='medium'].loc[retention['Left']==0,'Last_evaluation']
LM = retention[retention['Salary']=='medium'].loc[retention['Left']==1,'Last_evaluation']
In [30]:
ss.stats.ttest_ind(a=RM, b=LM,equal_var=False,alternative='greater')
Out[30]:
Ttest_indResult(statistic=-3.1063203806486315, pvalue=0.9990382952951038)
In [31]:
RL = retention[retention['Salary']=='low'].loc[retention['Left']==0,'Last_evaluation']
LL = retention[retention['Salary']=='low'].loc[retention['Left']==1,'Last_evaluation']
In [32]:
ss.stats.ttest_ind(a=RL, b=LL,equal_var=False,alternative='greater')
Out[32]:
Ttest_indResult(statistic=0.007113352772731251, pvalue=0.4971624104987519)

Salary clusters may be based on the variability of the risk of losing a specific employee.

In this case "Salary" was a confounding variable in our initial consideration of the evaluation metric, as it was evenly dispersed between that and the dependent variable "Left" but ended up having some significant correlation with each.

Based on analysis, our highest paid and highest performing employees are the most likely to leave.

Part 3: Model building:¶

There are two categorical variables in the dataset and they need to be converted to dummy variables before they can be used for modelling.

In [33]:
retention.head()
Out[33]:
Satisfaction Last_evaluation Project Monthly_hours Time Left Promotion Department Salary trainingCost
0 0.71 0.36 4 278 4 0 0 product_mng high 3000
1 0.44 0.36 2 136 3 0 0 technical high 4000
2 0.79 0.36 3 114 3 0 0 technical high 4000
3 0.42 0.37 2 284 3 0 0 technical high 4000
4 0.96 0.37 3 111 2 0 0 sales high 5000
In [34]:
cat_vars=['Department','Salary']
for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(retention[var], prefix=var)
    r1=retention.join(cat_list)
    retention=r1

The actual categorical variable needs to be removed once the dummy variables have been created.

In [35]:
retention.drop(['Department', 'Salary'], axis=1, inplace=True)
In [36]:
retention.columns.values
Out[36]:
array(['Satisfaction', 'Last_evaluation', 'Project', 'Monthly_hours',
       'Time', 'Left', 'Promotion', 'trainingCost', 'Department_IT',
       'Department_RandD', 'Department_accounting', 'Department_hr',
       'Department_management', 'Department_marketing',
       'Department_product_mng', 'Department_sales', 'Department_support',
       'Department_technical', 'Salary_high', 'Salary_low',
       'Salary_medium'], dtype=object)
In [37]:
plt.figure(figsize=(15, 15))
sns.heatmap(retention.corr(), annot=True)
Out[37]:
<AxesSubplot:>
In [38]:
retention_vars=retention.columns.values.tolist()
y=['Left']
X=[i for i in retention_vars if i not in y]
In [39]:
X
Out[39]:
['Satisfaction',
 'Last_evaluation',
 'Project',
 'Monthly_hours',
 'Time',
 'Promotion',
 'trainingCost',
 'Department_IT',
 'Department_RandD',
 'Department_accounting',
 'Department_hr',
 'Department_management',
 'Department_marketing',
 'Department_product_mng',
 'Department_sales',
 'Department_support',
 'Department_technical',
 'Salary_high',
 'Salary_low',
 'Salary_medium']
In [40]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

rfe = RFE(model)
rfe = rfe.fit(retention[X], retention[y])
print(rfe.support_)
print(rfe.ranking_)
[ True  True False False  True  True False False  True False  True  True
 False False False False False  True  True  True]
[ 1  1  2  9  1  1 11  3  1  7  1  1  8  4 10  6  5  1  1  1]
In [41]:
cols=['Satisfaction', 'Last_evaluation', 'Time', 
      'Department_IT', 'Department_accounting', 'Department_hr', 'Department_management', 'Salary_high', 'Salary_low', 'Salary_low'] 
X=retention[cols]
y=retention['Left']

Logistic regression model¶

In [42]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[43]:
LogisticRegression()
In [44]:
from sklearn.metrics import accuracy_score
print('Logistic regression accuracy: {:.3f}'.format(accuracy_score(y_test, logreg.predict(X_test))))
Logistic regression accuracy: 0.762

Random Forest¶

In [45]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
Out[45]:
RandomForestClassifier()
In [46]:
print('Random Forest Accuracy: {:.3f}'.format(accuracy_score(y_test, rf.predict(X_test))))
Random Forest Accuracy: 0.981

Support Vector Machine¶

In [47]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
Out[47]:
SVC()
In [48]:
print('Support vector machine accuracy: {:.3f}'.format(accuracy_score(y_test, svc.predict(X_test))))
Support vector machine accuracy: 0.898
In [49]:
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10)
modelCV = RandomForestClassifier()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
10-fold cross validation average accuracy: 0.980

Precision and recall¶

In [50]:
from sklearn.metrics import classification_report
print(classification_report(y_test, rf.predict(X_test)))
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3421
           1       0.96      0.96      0.96      1079

    accuracy                           0.98      4500
   macro avg       0.97      0.97      0.97      4500
weighted avg       0.98      0.98      0.98      4500

In [51]:
y_pred = rf.predict(X_test)
from sklearn.metrics import confusion_matrix
import seaborn as sns
forest_cm = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(forest_cm, annot=True, fmt='.2f',xticklabels = ["Left", "Retained"] , yticklabels = ["Left", "Retained"] )
plt.xlabel('True class')
plt.ylabel('Predicted class')
plt.title('Random Forest')
plt.savefig('random_forest')
In [52]:
print(classification_report(y_test, logreg.predict(X_test)))
              precision    recall  f1-score   support

           0       0.80      0.92      0.85      3421
           1       0.51      0.26      0.34      1079

    accuracy                           0.76      4500
   macro avg       0.65      0.59      0.60      4500
weighted avg       0.73      0.76      0.73      4500

In [53]:
logreg_y_pred = logreg.predict(X_test)
logreg_cm = metrics.confusion_matrix(y_test,logreg_y_pred)
sns.heatmap(logreg_cm, annot=True, fmt='.2f',xticklabels = ["Left", "Retained"] , yticklabels = ["Left", "Retained"] )
plt.xlabel('True class')
plt.ylabel('Predicted class')
plt.title('Logistic Regression')
plt.savefig('logistic_regression')
In [54]:
print(classification_report(y_test, svc.predict(X_test)))
              precision    recall  f1-score   support

           0       0.95      0.91      0.93      3421
           1       0.76      0.85      0.80      1079

    accuracy                           0.90      4500
   macro avg       0.85      0.88      0.87      4500
weighted avg       0.90      0.90      0.90      4500

In [55]:
svc_y_pred = svc.predict(X_test)
svc_cm = metrics.confusion_matrix(y_test, svc_y_pred)
sns.heatmap(svc_cm, annot=True, fmt='.2f',xticklabels = ["Left", "Retained"] , yticklabels = ["Left", "Retained"] )
plt.xlabel('True class')
plt.ylabel('Predicted class')
plt.title('Support Vector Machine')
plt.savefig('support_vector_machine')

When an employee left, how often does my classifier predict that correctly? This measurement is called "recall" and a quick look at these diagrams can demonstrate that random forest is clearly best for this criteria. Out of all the turnover cases, random forest correctly retrieved 1035 out of 1079. This translates to a turnover "recall" of about 96%, far better than logistic regression (26%) or support vector machines (85%).

When a classifier predicts an employee will leave, how often does that employee actually leave? This measurement is called "precision". Random forest again out preforms the other two at about 96% precision (1035 out of 1077) with logistic regression at about 51% (281 out of 556), and support vector machine at about 76% (917 out of 1213).

In [56]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])

rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])

plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('ROC')
plt.show()
In [57]:
feature_labels = np.array(['Satisfaction', 'Last_evaluation', 'Time', 
      'Department_IT', 'Department_accounting', 'Department_hr', 'Department_management', 'Salary_high', 'Salary_low', 'Salary_low'])
importance = rf.feature_importances_
feature_indexes_by_importance = importance.argsort()
for index in feature_indexes_by_importance:
    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))
Department_management-0.25%
Department_IT-0.25%
Department_hr-0.26%
Department_accounting-0.34%
Salary_low-0.52%
Salary_low-0.59%
Salary_high-0.83%
Last_evaluation-19.60%
Time-25.82%
Satisfaction-51.55%
In [ ]: