Story points | 3 |
Tags | linear-regression data-analysis statistics |
Hard Prerequisites | |
IMPORTANT: Please review these prerequisites, they include important information that will help you with this content. | |
|
This week is all about one-hot encoding and multiple regression.
We will predict employee salaries from different employee characteristics (or features).
Import the data salary.csv to a Jupyter Notebook. A description of the variables is given in [Salary metadata.csv](Salary metadata.csv). You will need the packages matplotlib
/ seaborn
, pandas
and statsmodels
.
Perform some exploratory data analysis (EDA) by creating appropriate plots (e.g scatterplots and histograms) to visualise and investigate relationships between dependent variables and the target/independent variable (salary).
Perform some basic features engineering by one-hot encoding the variable Field into three dummy variables, using HR as the reference category. You can use pandas’ get_dummies()
function for this (refer to “Background materials 1-3”).
Perform correlation and statistical significance analysis to validate the relationship salary to each of the potential predictor variables:
Conduct some basic feature selection tasks by aggregating results from EDA, correlation matrix and p-values. Justify your feature selection decisions.
Train model: Split your data into a training and test set. Fit a multiple linear regression model using a training dataset with corresponding features selected above
Evaluate model: Run your model on the test set.
resid()
) and standardised predicted values (fittedvalues()
).residplot
with predicted values as the x parameter, and the actual values as y, specify lowess=True
.Benchmark with cross-validation model
Data is made up and inspired by Cohen, Cohen, West & Aiken. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd Edition.
All relevant instructions from the simple-linear-regression project also apply here.
EDA means actually analysing the data; it’s not enough to create a graph, there needs to be commentary on what it reveals about the data.
For the one-hot encoding, make sure that HR is used as the reference category.
An understanding of multi-collinearity should be clearly demonstrated and checked with any of the standard techniques.
A common issue is incorrect interpretation of the residuals plot. Make sure interpretation demonstrates understanding for what the residual tell us about the appropriateness of the model.