| Story points | 13 |
| Tags | stats |
| Hard Prerequisites | |
| IMPORTANT: Please review these prerequisites, they include important information that will help you with this content. | |
|
|
At the end of this assignment you should be able to:
Complete the DataCamp courses Statistical Thinking in Python Part 1 and Statistical Thinking in Python Part 2.
The Millennium Development Goals were a set of 8 goals for 2015 that were defined by the United Nations to help improve living conditions and the conditions of our planet. Key indicators were defined for each of these goals, to see whether they were being met. We will have a look at some of the key indicators from Goal 7: Ensure environmental sustainability, namely carbon dioxide (CO2) emissions, protected land and sea areas, and forests. The full dataset can be found at http://mdgs.un.org/ .
It’s good practice to structure your files well, so we’ll expect you to have a separate directory for “data” and “notebook” so that your final file structure looks something like this:
├──data
│ └──MDG_Export_20191227.csv
├──notebook
│ └──statistical_thinking.ipynb
├──README.md
├──requirements.txt
└──.gitignore
matplotlib, numpy, seaborn, pandas and scipy.number_of_countries.missing_values_by_country_df with column names Country and missing_values_countmissing_values_by_year_df with column names Year and missing_values_countmissing_values_by_series_df with column names Series and missing_values_countmdg_df.top_countries_co2_emmissions_1990_df and bottom_countries_co2_emmissions_1990_df with columns Country and co2_emissions and order the data from highest to lowest for top_countries_co2_emmissions_1990_df and from lowest to highest for bottom_countries_co2_emmissions_1990_df. Create similarly named data frames for the emissions in 2011. How have these emissions changed compared with 1990?mean_co2_emmisions_1990 and median_co2_emmisions_1990 respectively. Why do you think these values differ?minimum_co2_emmisions_1990, maximum_co2_emmisions_1990 and iqr_co2_emissions_1990 respectively. Using this information, as well as the mean and median calculated previously for this year, explain what this tells us about the distribution of CO2 emissions?std_co2_emmisions_1990 and stderr_co2_emmisions_1990 respectively. How is the standard error different from the standard deviation?mean_land_area_covered_forest_1990 and std_land_area_covered_forest_1990 respectively. Why do you think the standard deviation is so large?seaborn.regplot to show the relationship between the proportion of land area covered by forest and the percentage of area protected in 2000.log_transformed_land_area_covered_2000_df with column names Country and log_transformed_forested_land_area_valuelog_transformed_protected_area_2000_df with column names Country and log_transformed_protected_area_value.pearsonr function from the scipy.stats module, calculate the Pearson correlation coefficient and its corresponding p value. Save this answer in a variable called pearson_correlation_coefficient_1990. The p value here should be saved in a variable called pearson_p_value_1990. See help(pearsonr) for help on this function.spearman_correlation_coefficient_1990. The p value here should be saved in a variable called spearman_p_value_1990. This test only looks at the order of the categories, not the values. The Spearman Rank-Order Coefficient is therefore not influenced by non-normality of variables or outliers. How do the results of this test compare with the results of the Pearson’s correlation?Learners should understand the difference between a missing value and a NaN value. It seems that students use the terminology interchangeably.
Learners need to be able to make the correct deductions following commands such as .info() and .describe().
Sometimes learners will use these commands since they know it is required of them, but they are not entirely
comfortable with what is presented to them by these commands.
How do learners answer basic questions such as ‘How many different countries are represented?’ or ‘Who are the top and
bottom 5 countries in terms of CO2 emissions in 1990 and what are their emissions?’. If it is a case of
endless for loops and if statements then the learner is not comfortable using standard pandas functionality such as
.nlargest(), .contains(), .groupby().
Any plots that are to be done, should be neat and easily readable. The plot must have a heading, the labels should be in bold, the x and y axes should make sense (Multipliers should be added if necessary).
Pearson coefficient, Spearman coefficient, correlation coefficient, p-value, Null hypothesis and Alternative hypothesis. This is where most learners have major difficulty. They can find the answers to the questions through the code easy enough, but they cannot clearly and simply explain what these terms mean and how they all work together, when to accept or reject a Null hypothesis, if the p-value is small, but the Null hypothesis is stated not as a negative but as a positive, should I still reject the Null hypothesis? These are the kind of questions learners should be comfortable to answer.