End Interactive Session 7
location | species | maturity |
---|---|---|
lagoon | bobcat | adult |
bluff | coyote | juvenile |
creek | fox | adult |
oaks | squirrel | juvenile |
bluff | bobcat | adult |
Data wrangling continued: joins, easier dates, & working with strings
eds221-m2021-day7-interactive
tidyverse
palmerpenguins
lubridate
Refresher 1: Starting with the penguins
dataset in the palmerpenguins
package, write a single piped sequence in which you:
year
and sex
variablesbody_mass_kg
, with penguin mass converted from grams to kilogramsisland
variable to location
Refresher 2: Staring with the penguins
dataset in the palmerpenguins
package, write a single piped sequence in which you:
flipper_length_mm
is NA
(hint: !is.na()
)sex
mean()
), standard deviation (sd
) and sample size (n
) of flipper_length_mm for male and female Adelie penguins, returned in a nice summary tableLet’s create some data to practice and clarify different types of joins.
datapasta
package in R.animals
sites
location | species | maturity |
---|---|---|
lagoon | bobcat | adult |
bluff | coyote | juvenile |
creek | fox | adult |
oaks | squirrel | juvenile |
bluff | bobcat | adult |
location | full_site_name | jurisdiction |
---|---|---|
beach | Goleta Beach | SB City |
lagoon | UCSB Lagoon | UCSB |
bluff | Ellwood Mesa | SB City |
oaks | Fremont Campground | USFS |
dplyr::full_join()
The dplyr::full_join()
function adds columns from the second df to the first df. It is the safest join - nothing is excluded. When in doubt, full join.
Joining with `by = join_by(location)`
# A tibble: 6 × 5
location species maturity full_site_name jurisdiction
<chr> <chr> <chr> <chr> <chr>
1 lagoon bobcat adult UCSB Lagoon UCSB
2 bluff coyote juvenile Ellwood Mesa SB City
3 creek fox adult <NA> <NA>
4 oaks squirrel juvenile Fremont Campground USFS
5 bluff bobcat adult Ellwood Mesa SB City
6 beach <NA> <NA> Goleta Beach SB City
dplyr::left_join()
The dplyr::left_join(x,y)
function keeps everything in x, and only joins from y (by matching key) if they have a match in x. Otherwise they’re dropped.
Joining with `by = join_by(location)`
# A tibble: 5 × 5
location species maturity full_site_name jurisdiction
<chr> <chr> <chr> <chr> <chr>
1 lagoon bobcat adult UCSB Lagoon UCSB
2 bluff coyote juvenile Ellwood Mesa SB City
3 creek fox adult <NA> <NA>
4 oaks squirrel juvenile Fremont Campground USFS
5 bluff bobcat adult Ellwood Mesa SB City
dplyr::right_join()
Opposite of a left_join()
.
Joining with `by = join_by(location)`
# A tibble: 5 × 5
location species maturity full_site_name jurisdiction
<chr> <chr> <chr> <chr> <chr>
1 lagoon bobcat adult UCSB Lagoon UCSB
2 bluff coyote juvenile Ellwood Mesa SB City
3 oaks squirrel juvenile Fremont Campground USFS
4 bluff bobcat adult Ellwood Mesa SB City
5 beach <NA> <NA> Goleta Beach SB City
dplyr::inner_join()
Rows are only kept if the key matches in both x and y (intersection).
Joining with `by = join_by(location)`
# A tibble: 4 × 5
location species maturity full_site_name jurisdiction
<chr> <chr> <chr> <chr> <chr>
1 lagoon bobcat adult UCSB Lagoon UCSB
2 bluff coyote juvenile Ellwood Mesa SB City
3 oaks squirrel juvenile Fremont Campground USFS
4 bluff bobcat adult Ellwood Mesa SB City
We’ll just look at two filtering join functions (from dplyr documentation):
semi_join()
“return[s] all rows from x with a match in y”anti_join()
“return[s] all rows from x without a match in y”Joining with `by = join_by(location)`
# A tibble: 4 × 3
location species maturity
<chr> <chr> <chr>
1 lagoon bobcat adult
2 bluff coyote juvenile
3 oaks squirrel juvenile
4 bluff bobcat adult
{lubridate}
ISO 8601 for everyone!
Yup. What happens if we give lubridate something that doesn’t make sense?
[1] "1984-12-09"
In other words, parsing dates can fail in multiple ways:
KNOW YOUR DATA.
See the Olson Names: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
Use ymd-hm
to convert this to a date time that R will understand.
Question: Why am I using ymd_hm
here?
[1] "2020-08-12 11:18:00 UTC"
[1] "POSIXct" "POSIXt"
[1] "2020-08-12 04:18:00 PDT"
[1] "2020-08-12 21:18:00 AEST"
[1] 33
[1] 12
[1] 11
[1] 18
[1] 0
You can also get information about your dates using nice built-in lubridate functions.
This can be useful if you want to group your day in different ways for analyses or exploration. Use lubridate functions, in combination with mutate(), to add new columns containing separate pieces of the date, e.g. year, month, day in three separate columns).
For example, let’s just make a little data frame to try this out:
urchin_counts <- tribble(
~date, ~species, ~size_mm,
"10/3/2020", "purple", 55,
"10/4/2020", "red", 48,
"11/17/2020", "red", 67
)
urchin_counts_ymd <- urchin_counts %>%
mutate(date = lubridate::mdy(date)) %>%
mutate(year = year(date),
month = month(date),
day = day(date))
urchin_counts_ymd
# A tibble: 3 × 6
date species size_mm year month day
<date> <chr> <dbl> <dbl> <dbl> <int>
1 2020-10-03 purple 55 2020 10 3
2 2020-10-04 red 48 2020 10 4
3 2020-11-17 red 67 2020 11 17
stringr
str_detect()
to detect a string patternReturns TRUE or FALSE based on whether the pattern is or is not detected.
my_string <- "Teddy loves eating salmon and socks."
# Does the pattern "love" exist within the string?
my_string %>% str_detect("love")
[1] TRUE
[1] FALSE
This also works on vectors (…think ahead - data frame columns!). It is case sensitive (by default):
my_string <- c("burrito", "fish taco", "Taco salad")
# Does the vector element contain the pattern "fish"?
my_string %>% str_detect("fish")
[1] FALSE TRUE FALSE
It is most powerful when used in combination with other functions.
For example, let’s look at the starwars
dataset (in dplyr
):
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
I want to only keep rows where the name column contains the pattern “Skywalker.” Remember: what’s the function to keep or exclude rows based on our conditions? It’s dplyr::filter()
! I can use that with str_detect()
to get the Skywalker family characters for me:
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 Anakin S… 188 84 blond fair blue 41.9 male mascu…
3 Shmi Sky… 163 NA black fair brown 72 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
str_replace()
to replace a string pattern with something elsefirewalkers <- starwars %>%
mutate(name = str_replace(name, pattern = "Sky", replacement = "Fire"))
head(firewalkers)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Fir… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
You can imagine this may be really helpful if there is a repeated spelling error, annoying syntax, or otherwise that you want to update throughout your data frame.
Note: this is very different from text mining and analysis, which involves analyzing textual information to gain insights about patterns, trends, and sentiments - look forward to that in EDS 242!
str_trim()
or str_squish()
to remove excess white spaceEnd Interactive Session 7