Motivation
Some days back I had a tought to create use cases of all the functions listed in the {dplyr} cheatsheet. Eventhough the cheatsheet shows the syntax on how to use a given function and provides a lucid one line (more or less) explanation along with excellent visual cues, I feel for new users (new to {dplyr} or R) it can be daunting to see all these functions at once and not knowing what to exactly look for to address the problem they might be facing.
What one might need to know beforehand
- . dplyr verb. Direct Spark SQL (DBI). SDF function (Scala API). Export an R DataFrame. Read a file. Read existing Hive table Data Science in Spark with Sparklyr:: CHEAT SHEET Intro Using sparklyr.
- Dplyr functions work with pipes and expect tidy data. In tidy data: pipes x%% f(y) becomes f(x, y). Data Transformation with dplyr:: CHEAT SHEET A B C A B C.
With dplyr and tidyr Cheat Sheet. Dplyr::select(iris, Sepal.Width, Petal.Length, Species) Select columns by name or helper function.
The said cheat sheet is available in the resources section of the RStudio website under the name Data Transformation Cheat Sheet
To demonstrate the use of the functions listed in the cheat sheet I will be using the palmerpenguis
data. Want to know more about this data set? Look at the github page of Allison Horst’s {palmerpenguin}.
Another point that I would like to mention is that this is not a comprehensive resource that necessarily documents use cases of all possible valid combinations of the functions listed in the {dplyr} cheat sheet. The reason for this is that I am lazy and not as skilled as I would like to think and pretend.
I will try and keep adding more functions overtime. I took this wise suggestion from Paul Brennan, hope he does not mind me mentioning him here.
Finally, all the mistake that I make here are mine, all, the ones that are stupid and especially the one that are very stupid. All mine.
Also I am assuming people will know about %>%
operator. Yes, not explaining it. Scroll up and look at the lazy panda gif.
I will use the explanations from the Cheat Sheet and reporduce for the readers’ benefit. These will appear verbatim as shown below.
This is how the explanation from the cheat sheet will be reproduced
Summarise Cases
These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back).
summarise funciton
Let us use the summarise function to the obtain the mean bill length, bill depth and flipper length of the penguins.
The functions summarise_all
, summarise_at
and summarise_if
have been suspended after the introduction of the across
function in the {dplyr} release of 1.0.0. Though the across
function is not mentioned in the cheat sheet, I will try and attemp to demonstrate a use case.
across function
Since this function is not mentioned in the cheat sheet, I will reporduce the explanations from the documentation of the {dplyr} 1.0.0
across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in summarise() and mutate().
I want average of bill lenght, bill depth and flipper length. I will attempt the across funciton to achieve this.
across
function can also be used within mutate
function.
Group Cases
Use group_by() to create a “grouped” copy of a table. dplyr functions will manipulate each “group” separately and then combine the results.
Assume one want to get the same mean values but for all different specied of penguins. In such cases group_by
functions prove useful.
Notice the message summarise()
ungrouping output (override with .groups
argument). This is a feature of the new {dplyr} 1.0.0 where one doesnot have to explicitly call the ungroup()
Manipulate Cases
Row functions return a subset of rows as a new table.
filter function to find fluffy penguins
Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7)
Say I want a data set that has observations of penguins that are more than 3.5 Kg. In such cases cases the filter function come handy.
Say one is interested in penguins from a particular island (Torgersen) that are fluffy. In that case, multiple conditions can be provided to the filter function.
distinct function
Remove rows with duplicate values.
This funciton can be used to remove duplicate rows from a table. Since penguins
data does not have duplicate rows I will use a dummy data to demonstrate a simple use case of this funciton.
Consider the following data
This table gives a list of dishes from different restaurants and the flavour rating for each dish. However, there is a data entry error, the first and the last dish are the same, from the same restaurant. Its a duplicate entry. I wish to remove the duplicate entry, here is how that can be done using the distinct()
function.
The distint()
can be used to keep observations by using specific variable or column. Say, from the Flavours
data one only wants one observation from each restaurant along with all the variables. In that case, the distinct()
can be used as shown below.
This would give us the first observations for each of the restaurants as they appear in the data.
sample_* functions
sample_frac function
Randomly select fraction of rows.
This function allows us to randomly sample fraction of observation from the data. We can also define if we want the sampling with replacement or without replacement.
In the code below, I randomly sample 50% of the observations without replacement.
sample_n function
Randomly select size rows.
This funcitons allows to select desired number of observations from the data with or without replacement.
Consider the following example where I select 50 rows with replacement.
slice funciton
Select rows by position
This funciton gives us the abilty to select rows by the position in which they appear in the data.
Say, I want observations 25 to 32 from the penguins
data. Below code is how I would do it.
top_n function
Select and order top n entries
This funciton lets one select observations that are the highest or top by a given variable.
Say, I want to select penguins that have mass in the range of the top 5 values that the body_mass_g
varible takes. This is how I would do it.
arrange funciton
Order rows by values of a column or columns (low to high), use with desc() to order from high to low
Say, I want to order the penguins by their body mass.
This arranges the penguins from least fluffy to most fluffy. I can arrange in decreasing order with the use of desc()
within the arrange()
.
add_row funtion
Add one or more rows to a table
Say, I went to Ayub’s again and tried another dish that I want to add to the Flavours
data.
Overview
QuestionsHow can I manipulate dataframes without repeating myself?
To be able to use the six main dataframe manipulation ‘verbs’ with pipes in
dplyr
.To understand how
group_by()
andsummarize()
can be combined to summarize datasets.Be able to analyze a subset of data using logical filtering.
Manipulation of dataframes means many things to many researchers, we oftenselect certain observations (rows) or variables (columns), we often group thedata by a certain variable(s), or we even calculate summary statistics. We cando these operations using the normal base R operations:
But this isn’t very nice because there is a fair bit of repetition. Repeatingyourself will cost you time, both now and later, and potentially introduce somenasty bugs.
The dplyr
package
Luckily, the dplyr
package provides a number of very useful functions for manipulating dataframesin a way that will reduce the above repetition, reduce the probability of makingerrors, and probably even save you some typing. As an added bonus, you mighteven find the dplyr
grammar easier to read.
Here we’re going to cover 5 of the most commonly used functions as well as usingpipes (%>%
) to combine them.
select()
filter()
group_by()
summarize()
mutate()
If you have have not installed this package earlier, please do so:
Now let’s load the package:
Using select()
If, for example, we wanted to move forward with only a few of the variables inour dataframe we could use the select()
function. This will keep only thevariables you select.
If we open up year_country_gdp
we’ll see that it only contains the year,country and gdpPercap. Above we used ‘normal’ grammar, but the strengths ofdplyr
lie in combining several functions using pipes. Since the pipes grammaris unlike anything we’ve seen in R before, let’s repeat what we’ve done aboveusing pipes.
To help you understand why we wrote that in that way, let’s walk through it stepby step. First we summon the gapminder dataframe and pass it on, using the pipesymbol %>%
, to the next step, which is the select()
function. In this casewe don’t specify which data object we use in the select()
function since ingets that from the previous pipe. Fun Fact: There is a good chance you haveencountered pipes before in the shell. In R, a pipe symbol is %>%
while in theshell it is |
but the concept is the same!
Using filter()
If we now wanted to move forward with the above, but only with Europeancountries, we can combine select
and filter
Challenge 1
Write a single command (which can span multiple lines and includes pipes) thatwill produce a dataframe that has the African values for lifeExp
, country
and year
, but not for other Continents. How many rows does your dataframehave and why?
Data Wrangling Dplyr Cheat Sheet
Solution to Challenge 1
As with last time, first we pass the gapminder dataframe to the filter()
function, then we pass the filtered version of the gapminder dataframe to theselect()
function. Note: The order of operations is very important in thiscase. If we used ‘select’ first, filter would not be able to find the variablecontinent since we would have removed it in the previous step.
Using group_by() and summarize()
Now, we were supposed to be reducing the error prone repetitiveness of what canbe done with base R, but up to now we haven’t done that since we would have torepeat the above for each continent. Instead of filter()
, which will only passobservations that meet your criteria (in the above: continent'Europe'
), wecan use group_by()
, which will essentially use every unique criteria that youcould have used in filter.
You will notice that the structure of the dataframe where we used group_by()
(grouped_df
) is not the same as the original gapminder
(data.frame
). Agrouped_df
can be thought of as a list
where each item in the list
is adata.frame
which contains only the rows that correspond to the a particularvalue continent
(at least in the example above).
Using summarize()
The above was a bit on the uneventful side but group_by()
is much moreexciting in conjunction with summarize()
. This will allow us to create newvariable(s) by using functions that repeat for each of the continent-specificdata frames. That is to say, using the group_by()
function, we split ouroriginal dataframe into multiple pieces, then we can run functions(e.g. mean()
or sd()
) within summarize()
.
That allowed us to calculate the mean gdpPercap for each continent, but it getseven better.
Challenge 2
Calculate the average life expectancy per country. Which has the longest average lifeexpectancy and which has the shortest average life expectancy?
Solution to Challenge 2
Another way to do this is to use the dplyr
function arrange()
, whicharranges the rows in a data frame according to the order of one or morevariables from the data frame. It has similar syntax to other functions fromthe dplyr
package. You can use desc()
inside arrange()
to sort indescending order.
The function group_by()
allows us to group by multiple variables. Let’s group by year
and continent
.
That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize()
.
count() and n()
A very common operation is to count the number of observations for eachgroup. The dplyr
package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in thedataset for the year 2002, we can use the count()
function. It takes the nameof one or more columns that contain the groups we are interested in, and we canoptionally sort the results in descending order by adding sort=TRUE
:
If we need to use the number of observations in calculations, the n()
functionis useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectency per continent:
You can also chain together several summary operations; in this case calculating the minimum
, maximum
, mean
and se
of each continent’s per-country life-expectancy:
Using mutate()
Dplyr Cheat Sheet 2020 Excel
We can also create new variables prior to (or even after) summarizing information using mutate()
.
Connect mutate with logical filtering: ifelse
When creating new variables, we can hook this with a logical condition. A simple combination ofmutate()
and ifelse()
facilitates filtering right where it is needed: in the moment of creating something new.This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimensionof the data frame will not change) or for updating values depending on this given condition.
Dplyr Cheat Sheet 2020 Free
Combining dplyr
and ggplot2
First install and load ggplot2:
In the plotting lesson we looked at how to make a multi-panel figure by addinga layer of facet panels using ggplot2
. Here is the code we used (with someextra comments):
This code makes the right plot but it also creates some variables (starts.with
and az.countries
) that we might not have any other uses for. Just as we used%>%
to pipe data along a chain of dplyr
functions we can use it to pass datato ggplot()
. Because %>%
replaces the first argument in a function we don’tneed to specify the data =
argument in the ggplot()
function. By combiningdplyr
and ggplot2
functions we can make the same figure without creating anynew variables or modifying the data.
Using dplyr
functions also helps us simplify things, for example we couldcombine the first two steps:
Advanced Challenge
Calculate the average life expectancy in 2002 of 2 randomly selected countriesfor each continent. Then arrange the continent names in reverse order.Hint: Use the dplyr
functions arrange()
and sample_n()
, they havesimilar syntax to other dplyr functions.
Dplyr Cheat Sheet 2020 Printable
Solution to Advanced Challenge
Dplyr Cheat Sheet 2020 Pdf
Other great resources
Key Points
Use the
dplyr
package to manipulate dataframes.Use
select()
to choose variables from a dataframe.Use
filter()
to choose data based on values.Use
group_by()
andsummarize()
to work with subsets of data.Use
mutate()
to create new variables.