project: Multivariate Linear Regression

Story points 3
Tags linear-regression data-analysis statistics
Hard Prerequisites
IMPORTANT: Please review these prerequisites, they include important information that will help you with this content.
  • PROJECTS: Statistical Thinking
  • PROJECTS: Cross-validation & Simple Linear Regression
  • TOPICS: Data Science Methodology
  • TOPICS: Jupyter notebooks best practices

  • This week is all about one-hot encoding and multiple regression.

    Background materials

    1. Robust One-Hot Encoding in Python
    2. Feature Engineering and Selection ebook
    3. One-hot encoding multicollinearity and the dummy variable trap
    4. Emulating R Regression Plots in Python
    5. Statsmodels Regression Plot
    6. Building and evaluating models.
    7. Test/Train Splits and Crossvalidation
    8. Interpreting residuals plot Stattrek and Statwing

    Assignment

    We will predict employee salaries from different employee characteristics (or features). Import the data salary.csv to a Jupyter Notebook. A description of the variables is given in [Salary metadata.csv](Salary metadata.csv). You will need the packages matplotlib / seaborn, pandas and statsmodels.

    Steps and questions

    1. Perform some exploratory data analysis (EDA) by creating appropriate plots (e.g scatterplots and histograms) to visualise and investigate relationships between dependent variables and the target/independent variable (salary).

      • Create a descriptive statistics table to further characterise and describe the population under investigation.
      • Which variables seem like good predictors of salary?
      • Do any of the variables need to be transformed to be able to use them in a linear regression model?
    2. Perform some basic features engineering by one-hot encoding the variable Field into three dummy variables, using HR as the reference category. You can use pandas’ get_dummies() function for this (refer to “Background materials 1-3”).

    3. Perform correlation and statistical significance analysis to validate the relationship salary to each of the potential predictor variables:

      • Calculate Pearson correlation coefficient and plot the corresponding correlation matrix
      • Calculate p-values related to the Pearson correlation coefficients
      • Address any problems that may adversely affect the multiple regression (e.g multicollinearity)
    4. Conduct some basic feature selection tasks by aggregating results from EDA, correlation matrix and p-values. Justify your feature selection decisions.

    5. Train model: Split your data into a training and test set. Fit a multiple linear regression model using a training dataset with corresponding features selected above

      • Use the multiple linear regression model created from independent variables selected above and the training dataset to predict salary using the training dataset.
      • Interpret the standardised coefficients given in the statsmodels output.
      • What are the most important features when predicting employee salary?
    6. Evaluate model: Run your model on the test set.

      • Calculate and explain the significance of the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE) and R-squared values for your model
      • Calculate the standardised residuals (resid()) and standardised predicted values (fittedvalues()).
      • Plot the residuals versus the predicted values using seaborn’s residplot with predicted values as the x parameter, and the actual values as y, specify lowess=True.
      • Are there any problems with the regression?
    7. Benchmark with cross-validation model

      • Perform cross-validation using the training dataset, test and evaluate the cross-validation model with test data
      • Compare performance of the cross-validation model (less prone to over-fitting) to determine whether the developed model has overfitted or not
      • Does it seem like you have a reasonably good model?

    References

    Data is made up and inspired by Cohen, Cohen, West & Aiken. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd Edition.

    Instructions for reviewer

    1. All relevant instructions from the simple-linear-regression project also apply here.

    2. EDA means actually analysing the data; it’s not enough to create a graph, there needs to be commentary on what it reveals about the data.

    3. For the one-hot encoding, make sure that HR is used as the reference category.

    4. An understanding of multi-collinearity should be clearly demonstrated and checked with any of the standard techniques.

    5. A common issue is incorrect interpretation of the residuals plot. Make sure interpretation demonstrates understanding for what the residual tell us about the appropriateness of the model.


    RAW CONTENT URL