Data carpentry with the tidyverse

The tidyverse

The tidyverse is a collection of R packages designed for data science, as a suite aimed at easening the data analysis in all its steps.

All packages share an underlying design philosophy, grammar, and data structures.

So what’s exactly in the tidyverse?

  • ggplot2 a system for creating graphics, based on the Grammar of Graphics

  • readr a fast and friendly way to read rectangular data (csv, txt…)

  • tibble a tibble is a re-imagining version of the data frame, keeping what time has proven to be effective and throwing out what has not

  • stringr provides a cohesive set of functions designed to make working with strings as easy as possible

  • forcats provides a suite of useful tools that solve common problems with factors

  • dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges

  • tidyr provides a set of functions that help you get to tidy data

  • purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors

tidy data

  • Data in tidy format eases the processing and analysis, particularly in vectorized languages as R.

dplyr

dplyr

Installing:

install.packages("dplyr")
library(dplyr)

dplyr

4 main verbs of dplyr:

  • filter: keep the rows that match a condition
  • select: keep columns by name, index or condition
  • mutate: transform existent variables or create new ones
  • summarise: do some summary statistics and reduce data

dplyr

common structure (for most of the tidyverse)

verb(data, ...)
  • first argument: data (as data.frame or tbl_df)
  • the rest of arguments specify what to do with the data frame
  • output is always another data frame (tbl_df or data.frame)
  • unless we are assigning (<-), never modifies the original data frame

dplyr

Data

Let’s work with some data. We are going to use data from the Olympic Games.
An explanation of the dataset can be found here

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-08-06/olympics.csv') 

dplyr

Data

Let’s work with some data. We are going to use data from the Olympic Games.
An explanation of the dataset can be found here

olympics
# A tibble: 271,116 × 15
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24    180     80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23    170     60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
# ℹ 271,106 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

dplyr

4 main verbs of dplyr:

  • filter: keep the rows that match a condition

  • select: keep columns by name

  • mutate: transform existent variables or create new ones

  • summarise: do some summary statistics and reduce data

Selecting rows

filter

filter(olympics, medal == "Gold")
# A tibble: 13,372 × 15
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 2    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
 3    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
 4    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
 5    20 Kjetil … M        20    176     85 Norw… NOR   1992…  1992 Winter Albe…
 6    20 Kjetil … M        30    176     85 Norw… NOR   2002…  2002 Winter Salt…
 7    20 Kjetil … M        30    176     85 Norw… NOR   2002…  2002 Winter Salt…
 8    20 Kjetil … M        34    176     85 Norw… NOR   2006…  2006 Winter Tori…
 9    21 Ragnhil… F        27    163     NA Norw… NOR   2008…  2008 Summer Beij…
10    40 Roald E… M        31     NA     NA Norw… NOR   1960…  1960 Winter Squa…
# ℹ 13,362 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

Selecting rows

filter

filter(olympics, year > 1996)
# A tibble: 88,863 × 15
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     2 A Lamusi M        23    170   60   China CHN   2012…  2012 Summer Lond…
 2     9 Antti S… M        26    186   96   Finl… FIN   2002…  2002 Winter Salt…
 3    12 Jyri Ta… M        31    172   70   Finl… FIN   2000…  2000 Summer Sydn…
 4    13 Minna M… F        34    159   55.5 Finl… FIN   2000…  2000 Summer Sydn…
 5    16 Juhamat… M        28    184   85   Finl… FIN   2014…  2014 Winter Sochi
 6    18 Timo An… M        31    189  130   Finl… FIN   2000…  2000 Summer Sydn…
 7    20 Kjetil … M        26    176   85   Norw… NOR   1998…  1998 Winter Naga…
 8    20 Kjetil … M        26    176   85   Norw… NOR   1998…  1998 Winter Naga…
 9    20 Kjetil … M        26    176   85   Norw… NOR   1998…  1998 Winter Naga…
10    20 Kjetil … M        26    176   85   Norw… NOR   1998…  1998 Winter Naga…
# ℹ 88,853 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

Selecting rows

filter

filter(olympics, team %in% c("Spain", "Portugal"))
# A tibble: 6,696 × 15
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
 2    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
 3    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
 4    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
 5    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
 6    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
 7    55 Antonio… M        26    170     65 Spain ESP   2016…  2016 Summer Rio …
 8    63 Jos Lui… M        30    194     87 Spain ESP   2008…  2008 Summer Beij…
 9    69 Tamara … F        19    185     72 Spain ESP   2008…  2008 Summer Beij…
10    80 Jos Mar… M        22    186     83 Spain ESP   1996…  1996 Summer Atla…
# ℹ 6,686 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

Selecting rows

filter

filter(
  olympics,
  medal == "Gold",
  year > 1996,
  team %in% c("Spain", "Portugal")
)
# A tibble: 27 × 15
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1  2210 Marina … F        26    164     55 Spain ESP   2012…  2012 Summer Lond…
 2  9464 Ruth Be… F        37    191     72 Spain ESP   2016…  2016 Summer Rio …
 3  9750 Mireia … F        25    170     59 Spain ESP   2016…  2016 Summer Rio …
 4 17348 David C… M        21    183     86 Spain ESP   2004…  2004 Summer Athi…
 5 21174 Maialen… F        33    161     55 Spain ESP   2016…  2016 Summer Rio …
 6 23703 Sal Cra… M        23    192     98 Spain ESP   2008…  2008 Summer Beij…
 7 23703 Sal Cra… M        31    192     98 Spain ESP   2016…  2016 Summer Rio …
 8 26983 Gervasi… M        19    166     69 Spain ESP   2000…  2000 Summer Sydn…
 9 26983 Gervasi… M        23    166     69 Spain ESP   2004…  2004 Summer Athi…
10 31296 Fernand… M        35    180     76 Spain ESP   2008…  2008 Summer Beij…
# ℹ 17 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

Selecting columns

select

select(olympics, name, sport, games)
# A tibble: 271,116 × 3
   name                     sport         games      
   <chr>                    <chr>         <chr>      
 1 A Dijiang                Basketball    1992 Summer
 2 A Lamusi                 Judo          2012 Summer
 3 Gunnar Nielsen Aaby      Football      1920 Summer
 4 Edgar Lindenau Aabye     Tug-Of-War    1900 Summer
 5 Christine Jacoba Aaftink Speed Skating 1988 Winter
 6 Christine Jacoba Aaftink Speed Skating 1988 Winter
 7 Christine Jacoba Aaftink Speed Skating 1992 Winter
 8 Christine Jacoba Aaftink Speed Skating 1992 Winter
 9 Christine Jacoba Aaftink Speed Skating 1994 Winter
10 Christine Jacoba Aaftink Speed Skating 1994 Winter
# ℹ 271,106 more rows

Selecting columns

select

select(olympics, !name)
# A tibble: 271,116 × 14
      id sex     age height weight team     noc   games  year season city  sport
   <dbl> <chr> <dbl>  <dbl>  <dbl> <chr>    <chr> <chr> <dbl> <chr>  <chr> <chr>
 1     1 M        24    180     80 China    CHN   1992…  1992 Summer Barc… Bask…
 2     2 M        23    170     60 China    CHN   2012…  2012 Summer Lond… Judo 
 3     3 M        24     NA     NA Denmark  DEN   1920…  1920 Summer Antw… Foot…
 4     4 M        34     NA     NA Denmark… DEN   1900…  1900 Summer Paris Tug-…
 5     5 F        21    185     82 Netherl… NED   1988…  1988 Winter Calg… Spee…
 6     5 F        21    185     82 Netherl… NED   1988…  1988 Winter Calg… Spee…
 7     5 F        25    185     82 Netherl… NED   1992…  1992 Winter Albe… Spee…
 8     5 F        25    185     82 Netherl… NED   1992…  1992 Winter Albe… Spee…
 9     5 F        27    185     82 Netherl… NED   1994…  1994 Winter Lill… Spee…
10     5 F        27    185     82 Netherl… NED   1994…  1994 Winter Lill… Spee…
# ℹ 271,106 more rows
# ℹ 2 more variables: event <chr>, medal <chr>

Selecting columns

select

select(olympics, id:games)
# A tibble: 271,116 × 9
      id name                     sex     age height weight team     noc   games
   <dbl> <chr>                    <chr> <dbl>  <dbl>  <dbl> <chr>    <chr> <chr>
 1     1 A Dijiang                M        24    180     80 China    CHN   1992…
 2     2 A Lamusi                 M        23    170     60 China    CHN   2012…
 3     3 Gunnar Nielsen Aaby      M        24     NA     NA Denmark  DEN   1920…
 4     4 Edgar Lindenau Aabye     M        34     NA     NA Denmark… DEN   1900…
 5     5 Christine Jacoba Aaftink F        21    185     82 Netherl… NED   1988…
 6     5 Christine Jacoba Aaftink F        21    185     82 Netherl… NED   1988…
 7     5 Christine Jacoba Aaftink F        25    185     82 Netherl… NED   1992…
 8     5 Christine Jacoba Aaftink F        25    185     82 Netherl… NED   1992…
 9     5 Christine Jacoba Aaftink F        27    185     82 Netherl… NED   1994…
10     5 Christine Jacoba Aaftink F        27    185     82 Netherl… NED   1994…
# ℹ 271,106 more rows

Selecting columns

select

Special functions:

  • starts_with(x): names that start with x
  • ends_with(x): names that end with x
  • contains(x): selects all variables whose name contains x
  • matches(x): selects all variables whose name contains the regular expression x
  • num_range("x", 1:5, width = 2): selects all variables (numerically) from x01 to x05
  • all_of(c("x", "y", "z")): selects variables provided in a character vector
  • any_of(c("x", "y", "z")): selects variables provided in a character vector
  • everything(): selects all variables

Selecting columns

select

select(iris, starts_with("Petal"))
    Petal.Length Petal.Width
1            1.4         0.2
2            1.4         0.2
3            1.3         0.2
4            1.5         0.2
5            1.4         0.2
6            1.7         0.4
7            1.4         0.3
8            1.5         0.2
9            1.4         0.2
10           1.5         0.1
11           1.5         0.2
12           1.6         0.2
13           1.4         0.1
14           1.1         0.1
15           1.2         0.2
16           1.5         0.4
17           1.3         0.4
18           1.4         0.3
19           1.7         0.3
20           1.5         0.3
21           1.7         0.2
22           1.5         0.4
23           1.0         0.2
24           1.7         0.5
25           1.9         0.2
26           1.6         0.2
27           1.6         0.4
28           1.5         0.2
29           1.4         0.2
30           1.6         0.2
31           1.6         0.2
32           1.5         0.4
33           1.5         0.1
34           1.4         0.2
35           1.5         0.2
36           1.2         0.2
37           1.3         0.2
38           1.4         0.1
39           1.3         0.2
40           1.5         0.2
41           1.3         0.3
42           1.3         0.3
43           1.3         0.2
44           1.6         0.6
45           1.9         0.4
46           1.4         0.3
47           1.6         0.2
48           1.4         0.2
49           1.5         0.2
50           1.4         0.2
51           4.7         1.4
52           4.5         1.5
53           4.9         1.5
54           4.0         1.3
55           4.6         1.5
56           4.5         1.3
57           4.7         1.6
58           3.3         1.0
59           4.6         1.3
60           3.9         1.4
61           3.5         1.0
62           4.2         1.5
63           4.0         1.0
64           4.7         1.4
65           3.6         1.3
66           4.4         1.4
67           4.5         1.5
68           4.1         1.0
69           4.5         1.5
70           3.9         1.1
71           4.8         1.8
72           4.0         1.3
73           4.9         1.5
74           4.7         1.2
75           4.3         1.3
76           4.4         1.4
77           4.8         1.4
78           5.0         1.7
79           4.5         1.5
80           3.5         1.0
81           3.8         1.1
82           3.7         1.0
83           3.9         1.2
84           5.1         1.6
85           4.5         1.5
86           4.5         1.6
87           4.7         1.5
88           4.4         1.3
89           4.1         1.3
90           4.0         1.3
91           4.4         1.2
92           4.6         1.4
93           4.0         1.2
94           3.3         1.0
95           4.2         1.3
96           4.2         1.2
97           4.2         1.3
98           4.3         1.3
99           3.0         1.1
100          4.1         1.3
101          6.0         2.5
102          5.1         1.9
103          5.9         2.1
104          5.6         1.8
105          5.8         2.2
106          6.6         2.1
107          4.5         1.7
108          6.3         1.8
109          5.8         1.8
110          6.1         2.5
111          5.1         2.0
112          5.3         1.9
113          5.5         2.1
114          5.0         2.0
115          5.1         2.4
116          5.3         2.3
117          5.5         1.8
118          6.7         2.2
119          6.9         2.3
120          5.0         1.5
121          5.7         2.3
122          4.9         2.0
123          6.7         2.0
124          4.9         1.8
125          5.7         2.1
126          6.0         1.8
127          4.8         1.8
128          4.9         1.8
129          5.6         2.1
130          5.8         1.6
131          6.1         1.9
132          6.4         2.0
133          5.6         2.2
134          5.1         1.5
135          5.6         1.4
136          6.1         2.3
137          5.6         2.4
138          5.5         1.8
139          4.8         1.8
140          5.4         2.1
141          5.6         2.4
142          5.1         2.3
143          5.1         1.9
144          5.9         2.3
145          5.7         2.5
146          5.2         2.3
147          5.0         1.9
148          5.2         2.0
149          5.4         2.3
150          5.1         1.8
select(iris, ends_with("Width"))
    Sepal.Width Petal.Width
1           3.5         0.2
2           3.0         0.2
3           3.2         0.2
4           3.1         0.2
5           3.6         0.2
6           3.9         0.4
7           3.4         0.3
8           3.4         0.2
9           2.9         0.2
10          3.1         0.1
11          3.7         0.2
12          3.4         0.2
13          3.0         0.1
14          3.0         0.1
15          4.0         0.2
16          4.4         0.4
17          3.9         0.4
18          3.5         0.3
19          3.8         0.3
20          3.8         0.3
21          3.4         0.2
22          3.7         0.4
23          3.6         0.2
24          3.3         0.5
25          3.4         0.2
26          3.0         0.2
27          3.4         0.4
28          3.5         0.2
29          3.4         0.2
30          3.2         0.2
31          3.1         0.2
32          3.4         0.4
33          4.1         0.1
34          4.2         0.2
35          3.1         0.2
36          3.2         0.2
37          3.5         0.2
38          3.6         0.1
39          3.0         0.2
40          3.4         0.2
41          3.5         0.3
42          2.3         0.3
43          3.2         0.2
44          3.5         0.6
45          3.8         0.4
46          3.0         0.3
47          3.8         0.2
48          3.2         0.2
49          3.7         0.2
50          3.3         0.2
51          3.2         1.4
52          3.2         1.5
53          3.1         1.5
54          2.3         1.3
55          2.8         1.5
56          2.8         1.3
57          3.3         1.6
58          2.4         1.0
59          2.9         1.3
60          2.7         1.4
61          2.0         1.0
62          3.0         1.5
63          2.2         1.0
64          2.9         1.4
65          2.9         1.3
66          3.1         1.4
67          3.0         1.5
68          2.7         1.0
69          2.2         1.5
70          2.5         1.1
71          3.2         1.8
72          2.8         1.3
73          2.5         1.5
74          2.8         1.2
75          2.9         1.3
76          3.0         1.4
77          2.8         1.4
78          3.0         1.7
79          2.9         1.5
80          2.6         1.0
81          2.4         1.1
82          2.4         1.0
83          2.7         1.2
84          2.7         1.6
85          3.0         1.5
86          3.4         1.6
87          3.1         1.5
88          2.3         1.3
89          3.0         1.3
90          2.5         1.3
91          2.6         1.2
92          3.0         1.4
93          2.6         1.2
94          2.3         1.0
95          2.7         1.3
96          3.0         1.2
97          2.9         1.3
98          2.9         1.3
99          2.5         1.1
100         2.8         1.3
101         3.3         2.5
102         2.7         1.9
103         3.0         2.1
104         2.9         1.8
105         3.0         2.2
106         3.0         2.1
107         2.5         1.7
108         2.9         1.8
109         2.5         1.8
110         3.6         2.5
111         3.2         2.0
112         2.7         1.9
113         3.0         2.1
114         2.5         2.0
115         2.8         2.4
116         3.2         2.3
117         3.0         1.8
118         3.8         2.2
119         2.6         2.3
120         2.2         1.5
121         3.2         2.3
122         2.8         2.0
123         2.8         2.0
124         2.7         1.8
125         3.3         2.1
126         3.2         1.8
127         2.8         1.8
128         3.0         1.8
129         2.8         2.1
130         3.0         1.6
131         2.8         1.9
132         3.8         2.0
133         2.8         2.2
134         2.8         1.5
135         2.6         1.4
136         3.0         2.3
137         3.4         2.4
138         3.1         1.8
139         3.0         1.8
140         3.1         2.1
141         3.1         2.4
142         3.1         2.3
143         2.7         1.9
144         3.2         2.3
145         3.3         2.5
146         3.0         2.3
147         2.5         1.9
148         3.0         2.0
149         3.4         2.3
150         3.0         1.8
select(iris, contains("al"))
    Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
4            4.6         3.1          1.5         0.2
5            5.0         3.6          1.4         0.2
6            5.4         3.9          1.7         0.4
7            4.6         3.4          1.4         0.3
8            5.0         3.4          1.5         0.2
9            4.4         2.9          1.4         0.2
10           4.9         3.1          1.5         0.1
11           5.4         3.7          1.5         0.2
12           4.8         3.4          1.6         0.2
13           4.8         3.0          1.4         0.1
14           4.3         3.0          1.1         0.1
15           5.8         4.0          1.2         0.2
16           5.7         4.4          1.5         0.4
17           5.4         3.9          1.3         0.4
18           5.1         3.5          1.4         0.3
19           5.7         3.8          1.7         0.3
20           5.1         3.8          1.5         0.3
21           5.4         3.4          1.7         0.2
22           5.1         3.7          1.5         0.4
23           4.6         3.6          1.0         0.2
24           5.1         3.3          1.7         0.5
25           4.8         3.4          1.9         0.2
26           5.0         3.0          1.6         0.2
27           5.0         3.4          1.6         0.4
28           5.2         3.5          1.5         0.2
29           5.2         3.4          1.4         0.2
30           4.7         3.2          1.6         0.2
31           4.8         3.1          1.6         0.2
32           5.4         3.4          1.5         0.4
33           5.2         4.1          1.5         0.1
34           5.5         4.2          1.4         0.2
35           4.9         3.1          1.5         0.2
36           5.0         3.2          1.2         0.2
37           5.5         3.5          1.3         0.2
38           4.9         3.6          1.4         0.1
39           4.4         3.0          1.3         0.2
40           5.1         3.4          1.5         0.2
41           5.0         3.5          1.3         0.3
42           4.5         2.3          1.3         0.3
43           4.4         3.2          1.3         0.2
44           5.0         3.5          1.6         0.6
45           5.1         3.8          1.9         0.4
46           4.8         3.0          1.4         0.3
47           5.1         3.8          1.6         0.2
48           4.6         3.2          1.4         0.2
49           5.3         3.7          1.5         0.2
50           5.0         3.3          1.4         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
54           5.5         2.3          4.0         1.3
55           6.5         2.8          4.6         1.5
56           5.7         2.8          4.5         1.3
57           6.3         3.3          4.7         1.6
58           4.9         2.4          3.3         1.0
59           6.6         2.9          4.6         1.3
60           5.2         2.7          3.9         1.4
61           5.0         2.0          3.5         1.0
62           5.9         3.0          4.2         1.5
63           6.0         2.2          4.0         1.0
64           6.1         2.9          4.7         1.4
65           5.6         2.9          3.6         1.3
66           6.7         3.1          4.4         1.4
67           5.6         3.0          4.5         1.5
68           5.8         2.7          4.1         1.0
69           6.2         2.2          4.5         1.5
70           5.6         2.5          3.9         1.1
71           5.9         3.2          4.8         1.8
72           6.1         2.8          4.0         1.3
73           6.3         2.5          4.9         1.5
74           6.1         2.8          4.7         1.2
75           6.4         2.9          4.3         1.3
76           6.6         3.0          4.4         1.4
77           6.8         2.8          4.8         1.4
78           6.7         3.0          5.0         1.7
79           6.0         2.9          4.5         1.5
80           5.7         2.6          3.5         1.0
81           5.5         2.4          3.8         1.1
82           5.5         2.4          3.7         1.0
83           5.8         2.7          3.9         1.2
84           6.0         2.7          5.1         1.6
85           5.4         3.0          4.5         1.5
86           6.0         3.4          4.5         1.6
87           6.7         3.1          4.7         1.5
88           6.3         2.3          4.4         1.3
89           5.6         3.0          4.1         1.3
90           5.5         2.5          4.0         1.3
91           5.5         2.6          4.4         1.2
92           6.1         3.0          4.6         1.4
93           5.8         2.6          4.0         1.2
94           5.0         2.3          3.3         1.0
95           5.6         2.7          4.2         1.3
96           5.7         3.0          4.2         1.2
97           5.7         2.9          4.2         1.3
98           6.2         2.9          4.3         1.3
99           5.1         2.5          3.0         1.1
100          5.7         2.8          4.1         1.3
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1
104          6.3         2.9          5.6         1.8
105          6.5         3.0          5.8         2.2
106          7.6         3.0          6.6         2.1
107          4.9         2.5          4.5         1.7
108          7.3         2.9          6.3         1.8
109          6.7         2.5          5.8         1.8
110          7.2         3.6          6.1         2.5
111          6.5         3.2          5.1         2.0
112          6.4         2.7          5.3         1.9
113          6.8         3.0          5.5         2.1
114          5.7         2.5          5.0         2.0
115          5.8         2.8          5.1         2.4
116          6.4         3.2          5.3         2.3
117          6.5         3.0          5.5         1.8
118          7.7         3.8          6.7         2.2
119          7.7         2.6          6.9         2.3
120          6.0         2.2          5.0         1.5
121          6.9         3.2          5.7         2.3
122          5.6         2.8          4.9         2.0
123          7.7         2.8          6.7         2.0
124          6.3         2.7          4.9         1.8
125          6.7         3.3          5.7         2.1
126          7.2         3.2          6.0         1.8
127          6.2         2.8          4.8         1.8
128          6.1         3.0          4.9         1.8
129          6.4         2.8          5.6         2.1
130          7.2         3.0          5.8         1.6
131          7.4         2.8          6.1         1.9
132          7.9         3.8          6.4         2.0
133          6.4         2.8          5.6         2.2
134          6.3         2.8          5.1         1.5
135          6.1         2.6          5.6         1.4
136          7.7         3.0          6.1         2.3
137          6.3         3.4          5.6         2.4
138          6.4         3.1          5.5         1.8
139          6.0         3.0          4.8         1.8
140          6.9         3.1          5.4         2.1
141          6.7         3.1          5.6         2.4
142          6.9         3.1          5.1         2.3
143          5.8         2.7          5.1         1.9
144          6.8         3.2          5.9         2.3
145          6.7         3.3          5.7         2.5
146          6.7         3.0          5.2         2.3
147          6.3         2.5          5.0         1.9
148          6.5         3.0          5.2         2.0
149          6.2         3.4          5.4         2.3
150          5.9         3.0          5.1         1.8
select(olympics, all_of(c("name", "medal", "noc", "year")))
# A tibble: 271,116 × 4
   name                     medal noc    year
   <chr>                    <chr> <chr> <dbl>
 1 A Dijiang                <NA>  CHN    1992
 2 A Lamusi                 <NA>  CHN    2012
 3 Gunnar Nielsen Aaby      <NA>  DEN    1920
 4 Edgar Lindenau Aabye     Gold  DEN    1900
 5 Christine Jacoba Aaftink <NA>  NED    1988
 6 Christine Jacoba Aaftink <NA>  NED    1988
 7 Christine Jacoba Aaftink <NA>  NED    1992
 8 Christine Jacoba Aaftink <NA>  NED    1992
 9 Christine Jacoba Aaftink <NA>  NED    1994
10 Christine Jacoba Aaftink <NA>  NED    1994
# ℹ 271,106 more rows
select(olympics, any_of(c("name", "medal", "tururu", "nonexistentvariable")))
# A tibble: 271,116 × 2
   name                     medal
   <chr>                    <chr>
 1 A Dijiang                <NA> 
 2 A Lamusi                 <NA> 
 3 Gunnar Nielsen Aaby      <NA> 
 4 Edgar Lindenau Aabye     Gold 
 5 Christine Jacoba Aaftink <NA> 
 6 Christine Jacoba Aaftink <NA> 
 7 Christine Jacoba Aaftink <NA> 
 8 Christine Jacoba Aaftink <NA> 
 9 Christine Jacoba Aaftink <NA> 
10 Christine Jacoba Aaftink <NA> 
# ℹ 271,106 more rows

Transforming / Creating variables

mutate

mutate(
  olympics,
  height = height / 100
)
# A tibble: 271,116 × 15
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24   1.8      80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23   1.7      60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24  NA        NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34  NA        NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21   1.85     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21   1.85     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25   1.85     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25   1.85     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27   1.85     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27   1.85     82 Neth… NED   1994…  1994 Winter Lill…
# ℹ 271,106 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

Transforming / Creating variables

mutate

mutate(
  olympics,
  bmi = weight / ((height / 100)^2)
)
# A tibble: 271,116 × 16
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24    180     80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23    170     60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
# ℹ 271,106 more rows
# ℹ 4 more variables: sport <chr>, event <chr>, medal <chr>, bmi <dbl>

Transforming / Creating variables

mutate

mutate(
  olympics,
  height = height / 100,
  bmi = weight / (height^2),
  imb = 1 / bmi
)
# A tibble: 271,116 × 17
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24   1.8      80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23   1.7      60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24  NA        NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34  NA        NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21   1.85     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21   1.85     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25   1.85     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25   1.85     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27   1.85     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27   1.85     82 Neth… NED   1994…  1994 Winter Lill…
# ℹ 271,106 more rows
# ℹ 5 more variables: sport <chr>, event <chr>, medal <chr>, bmi <dbl>,
#   imb <dbl>

Reducing / Summarising data

summarise

summarise(olympics, n = n(), average_height = mean(height, na.rm = TRUE))
# A tibble: 1 × 2
       n average_height
   <int>          <dbl>
1 271116           175.

Reducing / Summarising data

summarise

Summary functions

  • min(x), max(x), quantile(x, p)

  • mean(x), median(x),

  • sd(x), var(x), IQR(x)

  • n(), n_distinct(x)

  • sum(x > 10), mean(x > 10)

Grouped operations

.by argument

summarise(
  olympics,
  n = n(),
  average_height = mean(height, na.rm = TRUE),
  .by = c(sex, season, sport)
)
# A tibble: 120 × 5
   sex   season sport                    n average_height
   <chr> <chr>  <chr>                <int>          <dbl>
 1 M     Summer Basketball            3280           195.
 2 M     Summer Judo                  2708           177.
 3 M     Summer Football              5733           177.
 4 M     Summer Tug-Of-War             170           182.
 5 F     Winter Speed Skating         2081           167.
 6 M     Winter Cross Country Skiing  5748           178.
 7 F     Summer Athletics            11666           169.
 8 M     Winter Ice Hockey            4702           181.
 9 M     Summer Swimming             13345           184.
10 M     Summer Badminton              717           180.
# ℹ 110 more rows

Grouped operations

group_by

olympics_grouped <- group_by(olympics, sex, season, sport)
olympics_grouped
# A tibble: 271,116 × 15
# Groups:   sex, season, sport [120]
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24    180     80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23    170     60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
# ℹ 271,106 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

Grouped operations

group_by

summarise(
  olympics_grouped,
  n = n(),
  average_height = mean(height, na.rm = TRUE)
)
# A tibble: 120 × 5
# Groups:   sex, season [4]
   sex   season sport                n average_height
   <chr> <chr>  <chr>            <int>          <dbl>
 1 F     Summer Alpinism             1           NaN 
 2 F     Summer Archery           1015           167.
 3 F     Summer Art Competitions   377           160 
 4 F     Summer Athletics        11666           169.
 5 F     Summer Badminton          740           168.
 6 F     Summer Basketball        1256           182.
 7 F     Summer Beach Volleyball   276           179.
 8 F     Summer Boxing              72           169.
 9 F     Summer Canoeing          1380           170.
10 F     Summer Croquet              6           NaN 
# ℹ 110 more rows

pipes

pipes

Data pipelines (|>)

  • Often, we want to use several verbs (filter, arrange, group_by, summarise…)

  • Multiple operations are difficult to read, or require to create multiple intermediate objects:

nested

res_nested <- summarise(
  group_by(
    filter(
      olympics,
      medal == "Gold",
      year > 1996,
      team %in% c("Spain", "Portugal")
    ),
    sex, season, sport
  ),
  n = n(),
  average_height = mean(height, na.rm = TRUE)
)

steps

last_years <- filter(
  olympics,
  medal == "Gold",
  year > 1996,
  team %in% c("Spain", "Portugal")
)
last_years_grouped <- group_by(
  last_years, sex, season, sport
)
res_steps <- summarise(
  last_years_grouped,
  n = n(),
  average_height = mean(height, na.rm = TRUE)
)

pipes

Data pipelines (|>)

  • Often, we want to use several verbs (filter, arrange, group_by, summarise…)

  • Multiple operations are difficult to read, or require to create multiple intermediate objects:

pipes

res_pipe <- olympics |>
  filter(
    medal == "Gold",
    year > 1996,
    team %in% c("Spain", "Portugal")
  ) |>
  group_by(sex, season, sport) |>
  summarise(
    n = n(),
    average_height = mean(height, na.rm = TRUE)
  )

Other useful verbs

  • pull

  • case_when

  • arrange

  • bind_cols, bind_rows, left_join, inner_join and other joins (not explained here)

Pulling variables to vectors (pull)

olympics |>
  pull(sport) |>
  unique()
 [1] "Basketball"                "Judo"                     
 [3] "Football"                  "Tug-Of-War"               
 [5] "Speed Skating"             "Cross Country Skiing"     
 [7] "Athletics"                 "Ice Hockey"               
 [9] "Swimming"                  "Badminton"                
[11] "Sailing"                   "Biathlon"                 
[13] "Gymnastics"                "Art Competitions"         
[15] "Alpine Skiing"             "Handball"                 
[17] "Weightlifting"             "Wrestling"                
[19] "Luge"                      "Water Polo"               
[21] "Hockey"                    "Rowing"                   
[23] "Bobsleigh"                 "Fencing"                  
[25] "Equestrianism"             "Shooting"                 
[27] "Boxing"                    "Taekwondo"                
[29] "Cycling"                   "Diving"                   
[31] "Canoeing"                  "Tennis"                   
[33] "Modern Pentathlon"         "Figure Skating"           
[35] "Golf"                      "Softball"                 
[37] "Archery"                   "Volleyball"               
[39] "Synchronized Swimming"     "Table Tennis"             
[41] "Nordic Combined"           "Baseball"                 
[43] "Rhythmic Gymnastics"       "Freestyle Skiing"         
[45] "Rugby Sevens"              "Trampolining"             
[47] "Beach Volleyball"          "Triathlon"                
[49] "Ski Jumping"               "Curling"                  
[51] "Snowboarding"              "Rugby"                    
[53] "Short Track Speed Skating" "Skeleton"                 
[55] "Lacrosse"                  "Polo"                     
[57] "Cricket"                   "Racquets"                 
[59] "Motorboating"              "Military Ski Patrol"      
[61] "Croquet"                   "Jeu De Paume"             
[63] "Roque"                     "Alpinism"                 
[65] "Basque Pelota"             "Aeronautics"              

Conditional cases (case_when)

olympics |>
  mutate(
    age_class = case_when(
      age < 16 ~ "junior",
      age > 50 ~ "senior",
      .default = "elite"
    )
  )
# A tibble: 271,116 × 16
      id name     sex     age height weight team  noc   games  year season city 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24    180     80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23    170     60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
# ℹ 271,106 more rows
# ℹ 4 more variables: sport <chr>, event <chr>, medal <chr>, age_class <chr>

Arranging data (arrange)

olympics |>
  arrange(year, season, sport, medal)
# A tibble: 271,116 × 15
       id name    sex     age height weight team  noc   games  year season city 
    <dbl> <chr>   <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1  22700 "James… M        27    175     72 Unit… USA   1896…  1896 Summer Athi…
 2  25072 "Evang… M        NA     NA     NA Gree… GRE   1896…  1896 Summer Athi…
 3  40851 "Charl… M        23     NA     NA Grea… GBR   1896…  1896 Summer Athi…
 4  41160 "Dimit… M        21     NA     NA Gree… GRE   1896…  1896 Summer Athi…
 5  58546 "Gyula… M        24     NA     NA Hung… HUN   1896…  1896 Summer Athi…
 6  66362 "Franc… M        21    170     69 Unit… USA   1896…  1896 Summer Athi…
 7  68911 "Albin… M        21     NA     NA Fran… FRA   1896…  1896 Summer Athi…
 8  91250 "Georg… M        NA     NA     NA Gree… GRE   1896…  1896 Summer Athi…
 9  93606 "Ioann… M        NA     NA     NA Gree… GRE   1896…  1896 Summer Athi…
10 117571 "Alajo… M        24     NA     NA Hung… HUN   1896…  1896 Summer Athi…
# ℹ 271,106 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>