End Interactive Session 6
::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
knitr
library(tidyverse)
library(here)
library(janitor)
Data wrangling: tidying with tidyr
, dplyr
and pandas
tidyverse
, here
, and janitor
packagesData for these examples are from:
The data files we’ll use today are in the data
subfolder of the project. They are:
wb_indicators.csv
: a CSV containing data for select development indicators for countries in the World Bank database, from 2001 - 2020wb_indicators_metadata.csv
: a CSV containing metadata information for the development indicatorsALWAYS ALWAYS ALWAYS look at what you’ve done.
We see that years - a single variable - is spread out over multiple columns. We’ll want to reshape this data frame from wide-to-long format so that year is only in a single column to make it Tidy.
tidyr::pivot_longer()
wb_indicators_long <- wb_indicators %>%
pivot_longer(cols = '2001 [YR2001]':'2020 [YR2020]', # Which columns to squish
names_to = "year", # The original column names are squished into this column
values_to = "indicator_value") # The values are correctly aligned in this column
# Check it out (always):
# View(wb_indicators_long) # Why do I comment this out or run it in the Console?
melt()
wb_indicators_long = wb_indicators.melt(id_vars = ['Country Name', 'Country Code', 'Series Name', 'Series Code'],
var_name = 'year',
value_name = 'indicator_value')
# Check it out:
wb_indicators_long.head
<bound method NDFrame.head of Country Name ... indicator_value
0 Afghanistan ... 9.51
1 Afghanistan ... NaN
2 Afghanistan ... 810.00
3 Afghanistan ... NaN
4 Afghanistan ... NaN
... ... ... ...
26695 NaN ... NaN
26696 NaN ... NaN
26697 NaN ... NaN
26698 Data from database: World Development Indicators ... NaN
26699 Last Updated: 07/21/2021 ... NaN
[26700 rows x 6 columns]>
(26700, 6)
We can see that the year is stored in a weird format (e.g. 2018 [YR2018]
) that’s going to make our life difficult if we want to, for example, use year as a number to plot changes in the indicator values over time.
Let’s separate the information in the year
column so that we can just keep the nice 4-digit year as a number.
tidyr::separate()
wb_data_clean <- wb_indicators_long %>%
tidyr::separate(col = year, into = c("year", "year_chr"), sep = " ") %>%
dplyr::select(-year_chr, -'Country Code', -'Series Code') # This drops a few redundant columns (caution here...best to leave things if you're not sure)
head(wb_data_clean)
# A tibble: 6 × 4
`Country Name` `Series Name` year indicator_value
<chr> <chr> <chr> <dbl>
1 Afghanistan Access to clean fuels and technologies f… 2001 9.51
2 Afghanistan Access to clean fuels and technologies f… 2002 10.4
3 Afghanistan Access to clean fuels and technologies f… 2003 11.5
4 Afghanistan Access to clean fuels and technologies f… 2004 12.4
5 Afghanistan Access to clean fuels and technologies f… 2005 13.5
6 Afghanistan Access to clean fuels and technologies f… 2006 14.8
str.split()
wb_indicators_long[['year','year_chr']] = wb_indicators_long.year.str.split(expand=True)
list(wb_indicators_long) # Cool, now there's year and year_chr
['Country Name', 'Country Code', 'Series Name', 'Series Code', 'year', 'indicator_value', 'year_chr']
# Let's also drop some variables we won't use:
wb_data_clean = wb_indicators_long.drop(['Country Code', 'Series Code', 'year_chr'], axis = 1)
list(wb_data_clean)
['Country Name', 'Series Name', 'year', 'indicator_value']
Our data still aren’t quite tidy! Why?
Notice that we have multiple variables that were measured (our different indicators) all in a single column. This is a scenario where there are multiple variables in a single column. To be Tidy, we want each variable to live in just one column.
tidyr::pivot_wider()
That means we’re going to need to widen this data. We’ll do that using tidyr::pivot_wider()
.
wb_data_tidy <- wb_data_clean %>%
tidyr::drop_na('Series Name') %>%
tidyr::pivot_wider(names_from = 'Series Name', values_from = indicator_value) # Pivot to wide format
head(wb_data_tidy)
# A tibble: 6 × 7
`Country Name` year Access to clean fuels and techno…¹ Access to electricit…²
<chr> <chr> <dbl> <dbl>
1 Afghanistan 2001 9.51 NA
2 Afghanistan 2002 10.4 NA
3 Afghanistan 2003 11.5 NA
4 Afghanistan 2004 12.4 NA
5 Afghanistan 2005 13.5 22.3
6 Afghanistan 2006 14.8 28.1
# ℹ abbreviated names:
# ¹`Access to clean fuels and technologies for cooking (% of population)`,
# ²`Access to electricity (% of population)`
# ℹ 3 more variables: `CO2 emissions (kt)` <dbl>,
# `Fossil fuel energy consumption (% of total)` <dbl>,
# `Level of water stress: freshwater withdrawal as a proportion of available freshwater resources` <dbl>
wb_data_tidy = wb_data_clean.pivot_table(index = ['Country Name', 'year'],
columns = 'Series Name',
values = 'indicator_value')
# Indexes back to normal column entries:
wb_data_tidy = wb_data_tidy.reset_index()
# Check the data frame now:
wb_data_tidy.head() # I feel better.
Series Name Country Name ... Level of water stress: freshwater withdrawal as a proportion of available freshwater resources
0 Afghanistan ... NaN
1 Afghanistan ... 54.757019
2 Afghanistan ... NaN
3 Afghanistan ... NaN
4 Afghanistan ... NaN
[5 rows x 7 columns]
Series Name Country Name ... Level of water stress: freshwater withdrawal as a proportion of available freshwater resources
0 Afghanistan ... NaN
1 Afghanistan ... 54.757019
2 Afghanistan ... NaN
3 Afghanistan ... NaN
4 Afghanistan ... NaN
[5 rows x 7 columns]
Our column names are now a nightmare. We can reassign all names as follows (in this order):
names(wb_data_tidy) <- c("country", "year", "access_clean_fuels_pp", "access_electricity_pp", "co2_emissions_kt", "fossil_fuel_cons_pt", "water_stress")
head(wb_data_tidy)
# A tibble: 6 × 7
country year access_clean_fuels_pp access_electricity_pp co2_emissions_kt
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 2001 9.51 NA 810
2 Afghanistan 2002 10.4 NA 1100
3 Afghanistan 2003 11.5 NA 1350
4 Afghanistan 2004 12.4 NA 1130
5 Afghanistan 2005 13.5 22.3 1640
6 Afghanistan 2006 14.8 28.1 1940
# ℹ 2 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>
Or, we’ll learn how to use dplyr::rename()
soon…
wb_data_tidy = wb_data_tidy.rename(columns = {'Country Name': 'country', 'Access to clean fuels and technologies for cooking (% of population)': 'access_fuels_pp',
'Access to electricity (% of population)': 'access_electricity_pp',
'CO2 emissions (kt)': 'co2_emissions_kt',
'Fossil fuel energy consumption (% of total)': 'fossil_fuel_consumption_pt',
'Level of water stress: freshwater withdrawal as a proportion of available freshwater resources': 'water_stress'})
wb_data_tidy.head()
Series Name country year ... fossil_fuel_consumption_pt water_stress
0 Afghanistan 2001 ... NaN NaN
1 Afghanistan 2002 ... NaN 54.757019
2 Afghanistan 2003 ... NaN NaN
3 Afghanistan 2004 ... NaN NaN
4 Afghanistan 2005 ... NaN NaN
[5 rows x 7 columns]
dplyr
dplyr::filter()
Use dplyr::filter()
to keep or exclude rows based on your conditions.
Some examples:
I only want to keep observations from the dataset above for “United States”:
# A tibble: 6 × 7
country year access_clean_fuels_pp access_electricity_pp co2_emissions_kt
<chr> <chr> <dbl> <dbl> <dbl>
1 United Sta… 2001 100 100 5749250
2 United Sta… 2002 100 100 5594160
3 United Sta… 2003 100 100 5659630
4 United Sta… 2004 100 100 5740030
5 United Sta… 2005 100 100 5756080
6 United Sta… 2006 100 100 5656580
# ℹ 2 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>
I want to keep observations if the country is “United States” OR “Mexico” OR “Brazil”:
I want to keep observations if the country is “Guatemala” OR the year is 2020:
I want to keep observations in the year is 2010 and CO2 emissions is greater than 10,000kt:
dplyr::select()
Select (or exclude) columns using dplyr::select()
. Put a minus sign (-) in front of a column name or position to exclude it.
Exclude the water_stress and access_electricity_pp columns:
dplyr::rename()
Use dplyr::rename()
to rename one or more columns, in the order new_name = old_name
.
dplyr::mutate()
Use dplyr::mutate()
to add a new column, or transform an existing one.
Example: to change the class of a variable (careful - this overwrites the existing column!)
[1] "character"
wb_data_tidy <- wb_data_tidy %>%
mutate(year = as.numeric(year))
# Check again:
class(wb_data_tidy$year)
[1] "numeric"
Example: Add a new column that has co2 in TONS (instead of kilotons):
# A tibble: 6 × 8
country year access_clean_fuels_pp access_electricity_pp co2_emissions_kt
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 2001 9.51 NA 810
2 Afghanistan 2002 10.4 NA 1100
3 Afghanistan 2003 11.5 NA 1350
4 Afghanistan 2004 12.4 NA 1130
5 Afghanistan 2005 13.5 22.3 1640
6 Afghanistan 2006 14.8 28.1 1940
# ℹ 3 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>,
# co2_tons <dbl>
dplyr::group_by() %>% summarize()
To perform one or more functions on data by group, returning a nice summary table, use group_by
+ summarize()
.
Example: find the total reported co2 emissions (kt) for 2001 - 2020 from each country:
group_by()
summarize()
) in python with df.groupby.agg
:Example: find the total co2 emissions (kt) across all country for each year from 2001 - 2020:
We’ve learned a bunch of different useful functions for data wrangling in the {tidyverse}. But this may still feel a bit tedious.
Often, for readability and efficiency, we may want to string together different steps into a sequence. We can do that using the pipe operator (%>%
in the tidyverse, or |>
is the sparkly new native pipe in R).
Let’s take our raw data that we initially read in:
# A tibble: 6 × 24
`Country Name` `Country Code` `Series Name` `Series Code` `2001 [YR2001]`
<chr> <chr> <chr> <chr> <dbl>
1 Afghanistan AFG Access to clean f… EG.CFT.ACCS.… 9.51
2 Afghanistan AFG Access to electri… EG.ELC.ACCS.… NA
3 Afghanistan AFG CO2 emissions (kt) EN.ATM.CO2E.… 810
4 Afghanistan AFG Fossil fuel energ… EG.USE.COMM.… NA
5 Afghanistan AFG Level of water st… ER.H2O.FWST.… NA
6 Albania ALB Access to clean f… EG.CFT.ACCS.… 42.7
# ℹ 19 more variables: `2002 [YR2002]` <dbl>, `2003 [YR2003]` <dbl>,
# `2004 [YR2004]` <dbl>, `2005 [YR2005]` <dbl>, `2006 [YR2006]` <dbl>,
# `2007 [YR2007]` <dbl>, `2008 [YR2008]` <dbl>, `2009 [YR2009]` <dbl>,
# `2010 [YR2010]` <dbl>, `2011 [YR2011]` <dbl>, `2012 [YR2012]` <dbl>,
# `2013 [YR2013]` <dbl>, `2014 [YR2014]` <dbl>, `2015 [YR2015]` <dbl>,
# `2016 [YR2016]` <dbl>, `2017 [YR2017]` <dbl>, `2018 [YR2018]` <dbl>,
# `2019 [YR2019]` <dbl>, `2020 [YR2020]` <lgl>
Let’s tidy this up in a single sequence, with the pipe operator between.
REMEMBER: Look at what you’ve done after every step in a sequence.
wb_tidy <- wb_indicators %>%
tidyr::pivot_longer(cols = `2001 [YR2001]`:`2020 [YR2020]`,
names_to = "year",
values_to = "indicator_value") %>%
tidyr::separate(col = year, into = c("year", "year_chr"), sep = " ") %>%
dplyr::select(-'Country Code', -'Series Code', -year_chr) %>%
tidyr::drop_na('Series Name') %>%
tidyr::pivot_wider(names_from = 'Series Name', values_from = 'indicator_value') %>%
dplyr::rename(country = 'Country Name',
year = 'year',
clean_fuels = 'Access to clean fuels and technologies for cooking (% of population)',
access_elec = 'Access to electricity (% of population)',
co2 = 'CO2 emissions (kt)',
fossil_fuels = 'Fossil fuel energy consumption (% of total)',
water_stress = 'Level of water stress: freshwater withdrawal as a proportion of available freshwater resources') %>%
dplyr::mutate(year = as.numeric(year))
# Recall you can get the names of columns easily using names(df)
How long should a piped sequence be before I store the output and start a new piped sequence? Can a piped sequence be too long? Can I always pipe into everything?
End Interactive Session 6