In EDS 212, we learned about Boolean operations and logic - the land of TRUE and FALSE. Boolean operations are ubiquitous in scientific programming. Whether looking for string pattern matches, comparing values, or filtering data frames, Boolean operators show up all the time - and even if we aren’t writing them explicitly, are working behind the scenes.
Let’s refresh some basic Boolean (logical) operators for programming:
# Create some objects (let's say these are tree heights)pinyon_pine <-14lodgepole_pine <-46# Some logical expressions: pinyon_pine ==10# Exact match?
[1] FALSE
pinyon_pine < lodgepole_pine # Less than?
[1] TRUE
lodgepole_pine >=46# Greater than or equal?
[1] TRUE
pinyon_pine !=25# Not equal to?
[1] TRUE
2. Conditionals
We write conditionals to tell a computer what do to based on whether or not some conditions that we set are satisfied. For example, we may want to use conditionals to:
Only keep observations from counties with median home value > $600,000
Re-categorize farms as “moderate implementation” if they implement between 4 and 8 best management practices for mitigating nutrient runoff
A basic if statement
The fundamental conditional statement is an if statement, which we can read as “If this condition is met, then do this.”
An example in R:
burrito <-2.4# Assign an object value# Write a short 'if' statement:if (burrito >2) {print("I love burritos!")}
[1] "I love burritos!"
Try changing the value of burritos to 1.5. What is returned when the if statement condition isn’t met?
This is important: for a solely “if” statement, there is no “else” option. So nothing happens if the condition isn’t met.
In Python:
burrito =2.4if burrito >2:print("I love burritos!")
I love burritos!
An example with strings:
Here, we’re using a new function from the stringr package: stringr::str_detect(). First, let’s learn something about how str_detect() works (see ?str_detect() for more information).
str_detect() “detects the presence or absence of a pattern in a string.” For example:
my_ships <-c("Millenium Falcon", "X-wing", "Tie-Fighter", "Death Star")str_detect(my_ships, pattern ="r") # Asks: which elements in the vector contain "r"
[1] FALSE FALSE TRUE TRUE
Notice that it returns a logical TRUE or FALSE for each element in the vector, based on whether or not they contain the string pattern “r”.
Now let’s try using it in a conditional statement. Here, if a phrase contains the word “love”, return the phrase “Big burrito fan!” Otherwise, return nothing.
phrase <-"I love burritos"if (str_detect(phrase, "love")) {print("Big burrito fan!")}
[1] "Big burrito fan!"
Try updating the phrase or string pattern in the code above to test different variations.
A basic if-else statement
In the examples above, there was a lonely “if” statement, which returned something if the condition was met, but otherwise returned nothing. Usually, we’ll want our code to return something else if our condition is not met. In that case, we can write an if-else statement. Here are a couple of examples:
In R:
pika <-89.1if (pika >60) {print("mega pika")} elseprint("normal pika")
[1] "mega pika"
In Python:
pika =89.1if pika >60:print("mega pika")else:print("normal pika")
mega pika
An example with strings:
food <-"I love enchiladas!"if (str_detect(food, "burritos")) {print("yay burritos!")} elseprint("what about burritos?")
[1] "what about burritos?"
More options: if-else if-else statements
Sometimes there aren’t just two outcomes! In that case, you can specify different options using the if-else if-else structure. Several examples are shown below.
In R:
marmot <-2.8if (marmot <0.5) {print("a small marmot!")} elseif (marmot >=0.5& marmot <3) {print("a medium marmot!")} elseprint("a large marmot!")
[1] "a medium marmot!"
In Python:
marmot =2.8if marmot <0.5:print("a small marmot!")elif0.5<= marmot <3: print("a medium marmot!")else: print("a large marmot!")
a medium marmot!
switch statements
A slightly more efficient tool is the switch() function, which allows you to “select one of a list of alternatives.” See ?switch for more information. This can be particularly useful for switching between different character strings, based on a condition or user selection.
In R:
species ="mouse"switch(species,"cat"=print("Meow"),"dog"=print("Woof!"),"mouse"=print("Squeak"))
[1] "Squeak"
Helpful functionals for conditional statements
As we get into Week 2 of EDS 221, we’ll learn other functions that make conditionals a bit more read and writable. For example, the dplyr::case_when() function is very helpful for writing vectorized if-else statements!
3. Introduction to for loops
A general guideline from RStudio Chief Scientist Hadley Wickham is that if we copy something more than twice, we should write a function or a loop. Avoiding redundancy in code can make it more readable, reproducible and efficient.
We write a for loop to iterate through different elements of a data structure (e.g. values in a vector, or columns in a data frame), applying some operation to each, and returning the new output. In this section, we’ll learn some for loop basics.
Basic for loop
Let’s start with a very basic for loop: for each element in a vector, do something to it and return the new thing. Here, we start with a vector of puppy names. Starting with the first name “Teddy”, we enter the loop and add “My dog’s name is” to the beginning of the string. Then we move on to the next name in the vector, continuing until we have applied the updated text to all puppy names.
In R:
dog_names <-c("Teddy", "Khora", "Banjo", "Waffle")for (pupster in dog_names) {print(paste("My dog's name is", pupster))}
[1] "My dog's name is Teddy"
[1] "My dog's name is Khora"
[1] "My dog's name is Banjo"
[1] "My dog's name is Waffle"
# Or similarly (\n is for new line):for (pupster in dog_names) {cat("My dog's name is", pupster, "\n")}
My dog's name is Teddy
My dog's name is Khora
My dog's name is Banjo
My dog's name is Waffle
In Python:
dog_names = ["Teddy", "Khora", "Banjo", "Waffle"]for i in dog_names: print("My dog's name is "+ i)
My dog's name is Teddy
My dog's name is Khora
My dog's name is Banjo
My dog's name is Waffle
Another basic for loop example:
Let’s check out another little example. Here, we create a sequence that ranges from 0 to 3 by increments of 0.5. Then, for each element in that vector, we add 2 to it, then move on to the next element.
In R:
mass <-seq(from =0, to =3, by =0.5)for (i in mass) { new_val = i +2print(new_val)}
[1] 2
[1] 2.5
[1] 3
[1] 3.5
[1] 4
[1] 4.5
[1] 5
Or, using seq_along():
mass <-seq(from =0, to =3, by =0.5)for (i inseq_along(mass)) { new_val = mass[i] +2print(new_val)}
[1] 2
[1] 2.5
[1] 3
[1] 3.5
[1] 4
[1] 4.5
[1] 5
Let’s try another one with seq_along():
For each element in a vector, find the sum of that value plus the next value in the sequence:
tree_height <-c(1,2,6,10)for (i inseq_along(tree_height)) { val = tree_height[i] + tree_height[i +1]print(val)}
[1] 3
[1] 8
[1] 16
[1] NA
4. For loops with conditional statements
Earlier in this session, we learned how to write a conditional if or if-else statement. Sometimes we’ll want to change what a for loop does based on a conditional - so we’ll have a conditional statement within a for loop. Let’s take a look and talk through an example:
A basic conditional for loop in R: Here, we have a vector of animals called animal. Running through each animal in the vector, if the species is “dog” we want to return “I love dogs!”. For all other animals, we’ll return “These are other animals!”
# Create the animals vector:animal <-c("cat", "dog", "dog", "zebra", "dog")# Create the for loop with conditional statement: for (i inseq_along(animal)) {if (animal[i] =="dog") {print("I love dogs!") } elseprint("These are other animals!")}
[1] "These are other animals!"
[1] "I love dogs!"
[1] "I love dogs!"
[1] "These are other animals!"
[1] "I love dogs!"
Or, for a numerical example:
# Animal types:species <-c("dog", "elephant", "goat", "dog", "dog", "elephant")# And their respective ages in human years:age_human <-c(3, 8, 4, 6, 12, 18)# Convert ages to "animal years" using the following:# 1 human year = 7 in dog years# 1 human year = 0.88 in elephant years# 1 human year = 4.7 in goat yearsfor (i inseq_along(species)) {if (species[i] =="dog") { animal_age <- age_human[i] *7 } elseif (species[i] =="elephant") { animal_age <- age_human[i] *0.88 } elseif (species[i] =="goat") { animal_age <- age_human[i] *4.7 }print(animal_age)}
[1] 21
[1] 7.04
[1] 18.8
[1] 42
[1] 84
[1] 15.84
Reminder
Keep this idea in mind when we learn dplyr::case_when()!
5. Storing outputs of a for loop
So far, we’ve returned outputs of a for loop, but we haven’t stored the outputs of a for loop as a new object in our environment.
To store outputs of a for loop, we’ll create an empty vector, then populate it with the for loop elements as they’re created. It’s important to do this - it will make your loops quicker, which is critical once you start working with big data.
# Create the empty vector animal_ages:animal_ages <-vector(mode ="numeric", length =length(species))# Vectors with species and human age: species <-c("dog", "elephant", "goat", "dog", "dog", "elephant")age_human <-c(3, 8, 4, 6, 12, 18)# Same loop as above, with additional piece added# To populate our empty vectorfor (i inseq_along(species)) {if (species[i] =="dog") { animal_age <- age_human[i] *7 } elseif (species[i] =="elephant") { animal_age <- age_human[i] *0.88 } elseif (species[i] =="goat") { animal_age <- age_human[i] *4.7 } animal_ages[i] <- animal_age # Populate our empty vector}
Don’t make your life harder for no reason. What’s the easiest way to fine the big_cats values as calculated above?
For loops across columns of a data frame
Recall from lecture: df[[i]] calls the ith column from the df data frame with simplification (i.e., it is pulled out as a vector, not a 1-d data frame).
Write a loop that iteratively calculates the mean of value of each column in mtcars.
# Create our storage vector# Note: ncol() returns the number of columns in a data framemean_mtcars <-vector(mode ="numeric", length =ncol(mtcars))# Write the loopfor (i in1:ncol(mtcars)) { mean_val <-mean(mtcars[[i]], na.rm =TRUE) mean_mtcars[[i]] <- mean_val}# Tada.
A for loop over columns with a condition
Sometimes you’ll want to iterate over some, but not all, columns in a data frame. Then, you may want to write a for loop with a condition.
For example, starting with the penguins data frame (from the palmerpenguins package), let’s find the median value of all numeric variables.
for (i inseq_along(penguins)) {if (is.numeric(penguins[[i]])) { penguin_med <-median(penguins[[i]], na.rm =TRUE)print(penguin_med) } else {print("non-numeric") }}
For loops are a critical skill for any data scientist to understand. However, you may not end up writing many of them from scratch. That’s because there exist a number of useful tools (functions) to make iteration easier for us. Here, we’ll briefly learn about two:
The apply family of functions for iteration
Making iteration {purrr}
apply
The apply functions simplify iteration over elements of a data structure (e.g., a data frame). To use apply() most simply, we need to tell it 3 things:
What is the data we’re iterating over?
Are we iterating across rows (1) or columns (2)?
What function are we applying to each element we’re iterating over?
For example, let’s say we want to find the mean value of all columns in the mtcars data frame:
apply(X = mtcars, MARGIN =2, FUN = mean)
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
Critical thinking
Take a moment to break down what each argument does in the apply() function above. Keep this in mind - we can also write our own function that we’d want to apply to each element!
There are also variations on apply, like lapply and sapply, that are worth looking into for specific use cases.
dplyr::across(), group_by() and summarize() in combo
If the takeaway is “Whoa, there are a lot of options…” - YES. There are many existing functions that can help you loop over elements of your data and apply a function.
Here’ we’ll learn some of my favorites - to loop over columns and by group - and return a nice table of values.
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
ℹ In group 1: `species = Adelie`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
The documentation for ?purrr::map() is titled “Apply a function to each element of a list or atomic vector.” Which should ring your for loop brain bells.
There are a number of functions within the purrr::map() family to best suit your specific needs.
The equivalent map() approach to the example above (finding the mean of each column in the mtcars data frame) is as follows: