Story points | 13 |
Tags | stats |
Hard Prerequisites | |
IMPORTANT: Please review these prerequisites, they include important information that will help you with this content. | |
|
At the end of this assignment you should be able to:
Complete the DataCamp courses Statistical Thinking in Python Part 1 and Statistical Thinking in Python Part 2.
The Millennium Development Goals were a set of 8 goals for 2015 that were defined by the United Nations to help improve living conditions and the conditions of our planet. Key indicators were defined for each of these goals, to see whether they were being met. We will have a look at some of the key indicators from Goal 7: Ensure environmental sustainability, namely carbon dioxide (CO2) emissions, protected land and sea areas, and forests. The full dataset can be found at http://mdgs.un.org/ .
It’s good practice to structure your files well, so we’ll expect you to have a separate directory for “data” and “notebook” so that your final file structure looks something like this:
├──data
│ └──MDG_Export_20191227.csv
├──notebook
│ └──statistical_thinking.ipynb
├──README.md
├──requirements.txt
└──.gitignore
matplotlib
, numpy
, seaborn
, pandas
and scipy
.number_of_countries
.missing_values_by_country_df
with column names Country
and missing_values_count
missing_values_by_year_df
with column names Year
and missing_values_count
missing_values_by_series_df
with column names Series
and missing_values_count
mdg_df
.top_countries_co2_emmissions_1990_df
and bottom_countries_co2_emmissions_1990_df
with columns Country
and co2_emissions
and order the data from highest to lowest for top_countries_co2_emmissions_1990_df
and from lowest to highest for bottom_countries_co2_emmissions_1990_df
. Create similarly named data frames for the emissions in 2011. How have these emissions changed compared with 1990?mean_co2_emmisions_1990
and median_co2_emmisions_1990
respectively. Why do you think these values differ?minimum_co2_emmisions_1990
, maximum_co2_emmisions_1990
and iqr_co2_emissions_1990
respectively. Using this information, as well as the mean and median calculated previously for this year, explain what this tells us about the distribution of CO2 emissions?std_co2_emmisions_1990
and stderr_co2_emmisions_1990
respectively. How is the standard error different from the standard deviation?mean_land_area_covered_forest_1990
and std_land_area_covered_forest_1990
respectively. Why do you think the standard deviation is so large?seaborn.regplot
to show the relationship between the proportion of land area covered by forest and the percentage of area protected in 2000.log_transformed_land_area_covered_2000_df
with column names Country
and log_transformed_forested_land_area_value
log_transformed_protected_area_2000_df
with column names Country
and log_transformed_protected_area_value
.pearsonr
function from the scipy.stats
module, calculate the Pearson correlation coefficient and its corresponding p value. Save this answer in a variable called pearson_correlation_coefficient_1990
. The p value here should be saved in a variable called pearson_p_value_1990
. See help(pearsonr)
for help on this function.spearman_correlation_coefficient_1990
. The p value here should be saved in a variable called spearman_p_value_1990
. This test only looks at the order of the categories, not the values. The Spearman Rank-Order Coefficient is therefore not influenced by non-normality of variables or outliers. How do the results of this test compare with the results of the Pearson’s correlation?Learners should understand the difference between a missing value and a NaN value. It seems that students use the terminology interchangeably.
Learners need to be able to make the correct deductions following commands such as .info()
and .describe()
.
Sometimes learners will use these commands since they know it is required of them, but they are not entirely
comfortable with what is presented to them by these commands.
How do learners answer basic questions such as ‘How many different countries are represented?’ or ‘Who are the top and
bottom 5 countries in terms of CO2 emissions in 1990 and what are their emissions?’. If it is a case of
endless for
loops and if
statements then the learner is not comfortable using standard pandas functionality such as
.nlargest()
, .contains()
, .groupby()
.
Any plots that are to be done, should be neat and easily readable. The plot must have a heading, the labels should be in bold, the x and y axes should make sense (Multipliers should be added if necessary).
Pearson coefficient, Spearman coefficient, correlation coefficient, p-value, Null hypothesis and Alternative hypothesis. This is where most learners have major difficulty. They can find the answers to the questions through the code easy enough, but they cannot clearly and simply explain what these terms mean and how they all work together, when to accept or reject a Null hypothesis, if the p-value is small, but the Null hypothesis is stated not as a negative but as a positive, should I still reject the Null hypothesis? These are the kind of questions learners should be comfortable to answer.