African Coding Network Syllabus > Projects > Data Wrangling

project: Data Wrangling

Story points	5
Tags	data-wrangling jupyter-notebooks
Hard Prerequisites
IMPORTANT: Please review these prerequisites, they include important information that will help you with this content.
TOPICS: Data validation and quality control TOPICS: Jupyter notebooks best practices TOPICS: Data Ethics and Privacy PROJECTS: Assertive programming helpers for Pandas

Please make use of Jupyter notebooks while doing this project.

Key concepts

merging data frames
filtering data frames
manipulating rows and columns

Background materials

Pandas content - https://www.youtube.com/watch?v=tRKeLrwfUgU&ab_channel=NicholasRenotte
Intro to pandas and data manipulation here
https://towardsdatascience.com/data-wrangling-with-pandas-5b0be151df4e
https://towardsdatascience.com/7-must-know-data-wrangling-operations-with-python-pandas-849438a90d15
How to handle missing data - https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
If you’d like more, complete the DataCamp skills track Data Manipulation with Python

About the dataset and assignment

This data contains personality scores for learners, plus the department they applied for.

Prerequisites

You should be able to write basic functions and for loops for this assignment. You should also be familiar with merging, filtering and creating new columns in pandas.

Optional: As far as possible, use functional programming techniques (map, reduce, apply) instead of loops when writing the functions below.

For example, to modify every column in a data frame (to get a percentage in this case), instead of writing:

for column in df:
  column = column/10*100 #get percentage

use:

def get_percentage(score):
  score/10*100

df.apply(get_percentage, axis = 1) #axis=1 applies the function to all columns

Instructions

Your repo should contain everything needed to replicate your work. It’s good practice to structure your files well, so we’ll expect you to have a separate directory for “data” and “notebook”, so that your final file structure looks something like this:

├──data
│   ├──departments.csv
│   └──personality_scores.csv 
├──notebook
│  └──data_wrangling.ipynb
├──README.md
├──requirements.txt
└──.gitignore

Load the dataset personality_scores.csv into a DataFrame.

Examine the DataFrame for duplicates (based on ID), and drop any duplicates that exist. The data is ordered chronologically, please keep the most recent data entry.
From this recent project: PROJECT: Assertive programming helpers for Pandas , use the methods you practiced to assert that the updated DataFrame has the dimensions you expect it to have.
Perform some final cleaning steps by removing empty columns and cleaning up column names like this: Section 5 of 6 [I am always prepared.] becomes I am always prepared. Store this updated version of the DataFrame in a variable called: personality_score_df.

Create new columns containing the total score of each of the personality test subscales.

Write a function (or functions) to calculate the total score for each subscale as defined in scoring. The new columns should be named Conscientiousness, Emotional stability, Openness to experience, Agreeableness, Extraversion. In other words, for the Conscientiousness total score, all items marked as belonging to that subscale should be summed.
The new data frame should be named personality_score_totals_df and will look something like this:

I am always prepared	I am easily disturbed	I am exacting (demanding) in my work	…	Conscientiousness	Neuroticism
(3, 5)	(4, 5)	(3, 5)	…	10	5
(3, 5)	(4, 1)	(3, 1)	…	6	1
(3, 5)	(4, 3)	(3, 3)	…	8	3

Load the data from departments.csv into a DataFrame.

Merge this DataFrame with personality_score_totals_df.
Again, use assertion techniques from PROJECT: Assertive programming helpers for Pandas to check that the newly created merged DataFrame has the expected number of rows and columns.
The merged DataFrame should be named merged_personality_department_df

Filter the merged DataFrame so that you only see the applicants who scored less than 30 on emotional stability, conscientiousness AND agreeableness. Next, assign these applicants the tag “High risk” in a new column called Risk Status. All other applicants get the tag “Low risk”. The DataFrame here should be named risk_status_df.
Wrangle a new DataFrame with a count of the number of low and high risk applicants within each department. Let each department be a separate column. This new DataFrame should be named risk_status_summary_df. If there are no learners in one of the categories, it should be represented by zero and not a null entry. The DataFrame should look something like this:

Risk Status	Copywriting	Data	Design	Strategy	Web Dev
Low risk	150	123	0	4	6
High risk	40	0	22	67	9