Data visualization with the tidyverse

So what’s exactly in the tidyverse?

  • ggplot2 a system for creating graphics, based on the Grammar of Graphics

  • readr a fast and friendly way to read rectangular data (csv, txt…)

  • tibble a tibble is a re-imagining version of the data frame, keeping what time has proven to be effective and throwing out what has not

  • stringr provides a cohesive set of functions designed to make working with strings as easy as possible

  • forcats provides a suite of useful tools that solve common problems with factors

  • dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges

  • tidyr provides a set of functions that help you get to tidy data

  • purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors

Data

Let’s work with the same data as with the carpentry lesson.

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-08-06/olympics.csv') 

ggplot2

ggplot2

  • ggplot2 package
    • library(ggplot2)
  • Based on Leland Wilkinson’s “The Grammar of Graphs”
  • Automatically in charge of plot formatting (text, titles, margins, colors…)
    • It does a lot of thing by default
    • But they can be changed as we want
  • Easy to use

ggplot2

Installing:

install.packages("ggplot2")
library(ggplot2)

ggplot2

ggplot(olympics) +
  geom_boxplot(aes(x = sport, y = age))

ggplot2

Based on Leland Wilkinson’s “The Grammar of Graphs”:

In summary, the grammar of graphs says that an statistical plot consists on mapping the data to aesthetics attributes (position, colour, shape, size…) of geometric objects (points, lines, bars).
This mapping may also include statistical transformations (logs, smooths…) and specific coordinate systems (cartesian, polar…).

ggplot2

  • Main components of a ggplot plot
    • Data: a data.frame
    • aesthetics: How and to where mapping the data
    • geom: geometric objects mapped
    • facets: conditional panels
    • stats: statistical transformations (ablines, histograms…)
    • scales: mapping scales (color, sizes, axes)
    • coordinate system
    • themes: predefined and custom themes and modifications

Practical examples

ggplot2

Histogram

# age hist
olympics_histogram <-
  ggplot(data = olympics, aes(x = age)) +
  geom_histogram()

olympics_histogram

ggplot2

Scatter plot (points)

# body scatter
olympics_scatter <-
  ggplot(data = olympics, aes(x = weight, y = height)) +
  geom_point()

olympics_scatter

ggplot2

Boxplots

# age-sport boxplot
olympics_box <-
  ggplot(olympics) +
  geom_boxplot(aes(x = sport, y = age))

olympics_box

Is that all??

ggplot2

  • Main components of a ggplot plot
    • Data: a data.frame
    • aesthetics: How and to where mapping the data
    • geom: geometric objects mapped
    • facets: conditional panels
    • stats: statistical transformations (ablines, histograms…)
    • scales: mapping scales (color, sizes, axes)
    • coordinate system
    • themes: predefined and custom themes and modifications

aesthetics

Color, size, shapes and other aesthetics

We can fix the value for different aesthetics on the geometry, outside the aes.

aesthetics

Color, size, shapes and other aesthetics

We can fix the value for different aesthetics on the geometry, outside the aes.

ggplot(data = olympics, aes(x = weight, y = height)) +
  geom_point(color = "blue", alpha = 0.3, shape = 5)

aesthetics

Color, size, shapes and other aesthetics

Or we can map the aesthetics to some data variable with aes.

aesthetics

Color, size, shapes and other aesthetics

Or we can map the aesthetics to some data variable with aes.

ggplot(data = olympics, aes(x = weight, y = height)) +
  geom_point(aes(color = sport, shape = sex))

aesthetics

Common aesthetics:

  • position (x, y)
  • color (color, fill, alpha)
  • shape (shape, linetype)
  • size (size, linewidth)
olympics |>
  select(sport, age, sex) |>
  filter(sport %in% c("Basketball", "Luge")) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age)
  ) +
  geom_boxplot(
    aes(x = sport, y = age)
  )

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, color = medal)
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent'
  )

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 3
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent'
  )

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 3
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent',
    outliers = FALSE
  )

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 3
  ) +
  geom_boxplot(
    aes(x = sport, y = age, linetype = sport),
    fill = 'transparent', linewidth = 2,
    outliers = FALSE
  )

ggplot2

  • Main components of a ggplot plot
    • Data: a data.frame
    • aesthetics: How and to where mapping the data
    • geom: geometric objects mapped
    • facets: conditional panels
    • stats: statistical transformations (ablines, histograms…)
    • scales: mapping scales (color, sizes, axes)
    • coordinate system
    • themes: predefined and custom themes and modifications

facets

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 3
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent',
    outliers = FALSE
  )

facets

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 3
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent',
    outliers = FALSE
  ) +
  facet_grid(cols = vars(sex))

facets

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 3
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent',
    outliers = FALSE
  ) +
  facet_grid(cols = vars(sex), rows = vars(medal))

ggplot2

  • Main components of a ggplot plot
    • Data: a data.frame
    • aesthetics: How and to where mapping the data
    • geom: geometric objects mapped
    • facets: conditional panels
    • stats: statistical transformations (ablines, histograms…)
    • scales: mapping scales (color, sizes, axes)
    • coordinate system
    • themes: predefined and custom themes and modifications

stats

Statistical transformation can happen automatically when mapping data to some geometries, in fact we have seen already a couple of examples of this:

stats

But sometimes we need/want to add a transformation layer ourselves.

ggplot(olympics, aes(x = weight, y = height)) +
  geom_point(alpha = 0.1) +
  stat_smooth(method = "lm")

stats

But sometimes we need/want to add a transformation layer ourselves.

ggplot(olympics, aes(x = weight, y = height, color = sex)) +
  geom_point(alpha = 0.1, color = "black") +
  stat_smooth(method = "lm") +
  stat_ellipse(linewidth = 3)

ggplot2

  • Main components of a ggplot plot
    • Data: a data.frame
    • aesthetics: How and to where mapping the data
    • geom: geometric objects mapped
    • facets: conditional panels
    • stats: statistical transformations (ablines, histograms…)
    • scales: mapping scales (color, sizes, axes)
    • coordinate system
    • themes: predefined and custom themes and modifications

scales

Scales are for any aesthetic mapped, not only position (x, y…), but also colors, shapes, sizes…

scales

Scales are for any aesthetic mapped, not only position (x, y…), but also colors, shapes, sizes…

scales

Scales are for any aesthetic mapped, not only position (x, y…), but also colors, shapes, sizes…

scales

Scales are for any aesthetic mapped, not only position (x, y…), but also colors, shapes, sizes…

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 8
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent',
    outliers = FALSE
  ) +
  scale_colour_manual(values = c("darkorange2", "gold", "gray50")) +
  scale_shape_manual(values = c("♀", "♂"))

ggplot2

  • Main components of a ggplot plot
    • Data: a data.frame
    • aesthetics: How and to where mapping the data
    • geom: geometric objects mapped
    • facets: conditional panels
    • stats: statistical transformations (ablines, histograms…)
    • scales: mapping scales (color, sizes, axes)
    • coordinate system
    • themes: predefined and custom themes and modifications

coords

ggplot(olympics, aes(x = weight, y = height, color = sex)) +
  geom_point(alpha = 0.1, color = "black") +
  stat_smooth(method = "lm") +
  stat_ellipse(linewidth = 3)

coords

ggplot(olympics, aes(x = weight, y = height, color = sex)) +
  geom_point(alpha = 0.1, color = "black") +
  stat_smooth(method = "lm") +
  stat_ellipse(linewidth = 3) +
  coord_fixed(ratio = 1)

coords

ggplot(olympics, aes(x = weight, y = height, color = sex)) +
  geom_point(alpha = 0.1, color = "black") +
  stat_smooth(method = "lm") +
  stat_ellipse(linewidth = 3) +
  coord_flip()

ggplot2

  • Main components of a ggplot plot
    • Data: a data.frame
    • aesthetics: How and to where mapping the data
    • geom: geometric objects mapped
    • facets: conditional panels
    • stats: statistical transformations (ablines, histograms…)
    • scales: mapping scales (color, sizes, axes)
    • coordinate system
    • themes: predefined and custom themes and modifications

themes

There are gplot2 predefined themes:

olympics |>
  select(sport, age, sex, medal) |>
  filter(sport %in% c("Basketball", "Luge"), !is.na(medal)) |>
  ggplot() +
  geom_point(
    aes(x = sport, y = age, shape = sex, color = medal),
    alpha = 0.3, position = 'jitter', size = 8
  ) +
  geom_boxplot(
    aes(x = sport, y = age),
    fill = 'transparent',
    outliers = FALSE
  ) +
  scale_colour_manual(values = c("darkorange2", "gold", "gray50")) +
  scale_shape_manual(values = c("♀", "♂")) +
  theme_minimal()

themes

But all the elements on the themes can be modified with theme

?theme