EMF Web | Parallel computation in R

Parallel computation in R

2023-12-04

R parallelization programming

SOURCE

Víctor Granda-García

Ecosystem Modelling Facility - CREAF

Miquel de Cáceres

Ecosystem Modelling Facility - CREAF

Introduction

This document will explain the basic concepts of parallel computing in R, with code examples to illustrate the concepts presented here. The topics covered include:

What is parallel computing?
When can we use it?
A little introduction to loops and maps in R (for, lapply, map…)
Ways to use parallelization in your code (parallel, furrr…)
How to check if is worth the hassle

What is parallel computing?

Figure 1: Modern CPU and cores

First of all we need to understand a little about CPUs (Central Processing Unit) and cores. Modern computers (Figure 1) have multiple CPUs, and those can have one or multiple cores. Each core is responsible of running individual processes.

Think of a simple algebraic operation, adding two numbers (1 + 1). In a nutshell, that operation is translated to machine code and a process is started in one core to perform the operation (Figure 2).

Figure 2: One core is performing the ‘1 +1’ operation. This leaves the other cores available to concurrently start other procesess

So CPU cores can be used to run the the same process with different data in parallel to speed up long tasks. In theory, this can make things $1/n_{cores}$ faster, but in reality, other factors must be added (time consumed transferring data to each process, time consumed gathering results from different processes and join them, time consumed spawning new processes…) so the time gain highly depends on the type of operations, data types used…

Note

In fact, sometimes workflows are slower when parallelized, so we always need to check if we are really saving time and effort. See below for more information on this.

R is a single process software

R is designed to run in only one CPU process. This is due to the time when S (R predecessor) and R were developed, where CPUs had mostly one core and multitasking and parallel computation were still not widely available or technologies were still undeveloped.

Given the R limitations explained before, parallel computing in R is not available out-of-the-box. We will need to use extra packages/libraries and we can only use it in specific cases and data types.

When can we use it?

You have been using parallel computing in R without knowing it. A lot of R base functions are calls to methods written in languages that support multitasking (C++, Rust…). For example, matrix operations and other linear algebra functions (common operations when calculating regression coefficients when using lm and other model functions) are calls to C++ methods that are parallelized and use the CPU cores available in your system (Listing 1)

1observations <- matrix(rnorm(1e6 * 100), 1e6, 100)
2predictions <- rnorm(100)
3system.time(outcome <- drop(observations %*% predictions) + rnorm(1e6))

Listing 1: Time consumed by matrix operations. We can see that the user time is bigger than the elapsed time. This means that the task (matrix product) was parallelized consuming more CPU time (user), but less real time (elapsed)

   user  system elapsed 
  0.237   0.000   0.237

Note

Some other R packages have implemented the methods we’ll explain in the following sections and offer arguments to the user to choose if and how parallelization must be done. For example, boot package offer parallelization options when bootstrapping model coefficients.

Working with embarrassingly parallel problems

Embarrassingly parallel problems are those where we can easily separate the problem into several parallel tasks ¹. This kind of problems are very usual in scientific and statistics fields. Think of the classic data analysis showed below (Figure 3).

Figure 3: Classic data analysis workflow

In this process, we need to ingest the data, processing it to clean/transform/… it, modelling the data and finally visualize/store the results. Now imagine we have to repeat the same process for hundred or thousands of data files (i.e., remote sensing images, genomic analyses, historical and projections climatic analyses…). Instead of processing each task one after another (in a sequential way) we can divide the input (names of the files to read) in chunks and send each chunk to CPU processes that run in parallel, which can save a lot of time and effort (Figure 4).

Figure 4: Same data analysis workflow as before but running in parallel, each process in a different CPU core

This kind of embarrasingly parallel tasks are the ones that beneficiate most of parallelization.

A little introduction to loops and maps in R (`for`, `lapply`, `map`…)

Loops

We talk before about embarrassingly parallel problems, repetitive tasks that have little or not connection between each other more than the origin of the inputs and can be easily separated into parallel tasks.
These tasks are usually the ones we think about when we talk about for loops. One example can be bootstrapping model coefficients (Listing 2). For example, we are interested in the relationship between sepal length and Iris species (example extracted from the doParallel package vignette):

 1# libraries
 2library(dplyr)
 3
 4# data needed
 5n_repetitions <- 1e4
 6res_coefs <- list()
 7iris_data <- iris |>
 8  dplyr::filter(Species != "setosa")
 9# we measure the time for illustration purposes
10system.time({
11  for (index in 1:n_repetitions) {
12    sample_individuals <- sample(85, 85, replace = TRUE)
13    model_res <- glm(
14      iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
15      family = binomial
16    )
17    res_coefs[[index]] <- coefficients(model_res)
18  }
19})

Listing 2: Boostrapping model coefficients in iris dataset with a for loop

   user  system elapsed 
 14.834   0.000  14.854

We can see the user time (CPU time) is roughly the same as the elapsed time (real time), as we should expect from a sequential for loop.

lapply

The same problem can be solved with lapply, but we need to encapsulate the logic of the for loop in a function (Listing 3):

 1# create the function to process data from one state
 2coef_function <- function(repetition) {
 3  sample_individuals <- sample(85, 85, replace = TRUE)
 4  model_res <- glm(
 5    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 6    family = binomial
 7  )
 8  return(coefficients(model_res))
 9}
10# number of repetitions
11n_repetitions <- 1e4
12# data
13iris_data <- iris |>
14  dplyr::filter(Species != "setosa")
15
16# and now the lapply (we monitorize the time again for illustration purposes)
17system.time(
18  res_coefs <- lapply(1:n_repetitions, coef_function)
19)

Listing 3: Boostrapping model coefficients in iris dataset with lapply

   user  system elapsed 
 15.707   0.000  15.727

As we see, the time is the same as with the for loop, something we would expect.

map

If using tidyverse packages, instead of lapply we will use map function in the purrr package (Listing 4):

 1# libraries
 2library(purrr)
 3
 4coef_function <- function(repetition) {
 5  sample_individuals <- sample(85, 85, replace = TRUE)
 6  model_res <- glm(
 7    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 8    family = binomial
 9  )
10  return(coefficients(model_res))
11}
12# number of repetitions
13n_repetitions <- 1e4
14# data
15iris_data <- iris |>
16  dplyr::filter(Species != "setosa")
17
18# and now the map (we monitorize the time again for illustration purposes)
19system.time({
20  res_coefs <- purrr::map(1:n_repetitions, .f = coef_function)
21})

Listing 4: Boostrapping model coefficients in iris dataset with map

   user  system elapsed 
 13.966   0.000  13.981

Again times are similar to the other workflows.

Ways to use parallelization in your code (`parallel`, `furrr`…)

If we can use loops, lapply or map, then we can parallelize without any problem. In this section we will see the different options we can do it.

Preparations

Before we start, we need to know how many cores are available in our system. This can be done with parallel::detectCores(). In the system this document has been created the available cores are:

1library(parallel)
2parallel::detectCores()

Listing 5: Numer of cores available

[1] 24

Tip

In the following examples we will be using 4 cores, but if your system has less than that, please change it to a valid number.

`foreach` and `doParallel`

In a very similar way to a for loop we can use the foreach and doParallel packages to build a loop that will run the files in paralell (Listing 6):

 1# libraries
 2library(parallel)
 3library(foreach)
 4library(doParallel)
 5
 6# data needed
 7n_repetitions <- 1e4
 8res_coefs <- list()
 9iris_data <- iris |>
10  dplyr::filter(Species != "setosa")
11
12# set the number of cores to use, in this example 4
13doParallel::registerDoParallel(cores = 4)
14
15# foreach loop (for illustration purposes, we check the time used)
16system.time(
17  {res_coefs <- foreach::foreach(index = 1:n_repetitions) %dopar% {
18    sample_individuals <- sample(85, 85, replace = TRUE)
19    model_res <- glm(
20      iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
21      family = binomial
22    )
23    coefficients(model_res)
24  }}
25)

Listing 6: Boostrapping model coefficients in iris dataset in parallel with a foreach

   user  system elapsed 
 20.601   0.252   6.204

As we can see, time has reduced almost four times when compared with processing the files sequentially. We are really close to the ideal $1/4$ reduction in time we should expect from using 4 cores, but not quite, as starting the extra R processes, sending the data and retrieving the results takes some time. With bigger datasets we can see that elapsed time increases because of this communication overload.

`mclapply`

If we prefer the lapply syntax, we can use mclapply to run the same expression concurrently. mclapply belongs to the parallel pacakge and works exactly the same as lapply (Listing 7):

 1# create the function to process data from one state
 2coef_function <- function(repetition) {
 3  sample_individuals <- sample(85, 85, replace = TRUE)
 4  model_res <- glm(
 5    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 6    family = binomial
 7  )
 8  return(coefficients(model_res))
 9}
10# number of repetitions
11n_repetitions <- 1e4
12# data
13iris_data <- iris |>
14  dplyr::filter(Species != "setosa")
15
16# and now the lapply (we monitorize the time again for illustration purposes)
17system.time({
18  res_coefs <- mclapply(1:n_repetitions, coef_function, mc.cores = 4)
19})

Listing 7: Boostrapping model coefficients in iris dataset in parallel with a mclapply

   user  system elapsed 
 12.817   0.140   4.871

We see again the time reduction in time with mclapply.

`future_map`

furrr package offers parallelized versions of purrr::map family of functions. We can use it to run the map example above in parallel (Listing 8):

 1# libraries
 2library(future)
 3library(furrr)
 4
 5coef_function <- function(repetition) {
 6  sample_individuals <- sample(85, 85, replace = TRUE)
 7  model_res <- glm(
 8    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 9    family = binomial
10  )
11  return(coefficients(model_res))
12}
13# number of repetitions
14n_repetitions <- 1e4
15# data
16iris_data <- iris |>
17  dplyr::filter(Species != "setosa")
18
19# setting core options
20future::plan(future.callr::callr, workers = 4)
21
22# and now the map (we monitorize the time again for illustration purposes)
23system.time({
24  state_models <- furrr::future_map(1:n_repetitions, .f = coef_function)
25})

Listing 8: Boostrapping model coefficients in iris dataset in parallel with a furrr

   user  system elapsed 
 21.032   0.299   5.746

This is the method that returns the worst time running in parallel (but better than sequential). This is because future_map works setting a more complete environment in the parallelized processes that takes more time. In larger datasets and more complex functions, future_map is a good option for paralellization.

Tip

Calculating bootstrapped coefficients is a toy example for illustration purposes. There are R packages that can bootstrap in more efficient ways, including parallelization already included, like the boot package.

How to check if parallelization is worthy the hassle

We have been using system.time to check the time our code takes to run. This is one way of check if the benefit of parallelization outweighs the inconvenience of setting it up. Other ways include using other benchmarking libraries, like bench or rbenchmark. Their documentation explain everything you need to know about timing code.

In any case, as a rule of thumb, parallelization is not worthy in the following scenarios:

We are parallelizing fast operations: In this scenario, when the operations we want to paralellize are fast (math operations, indexing or assigning) the overhead of sending data to cores, and retrieving and joining the results usually is greater than the time of performing the operation.
We are parallelizing processes that involve sending a lot of data to each spawned parallel process. Think for example in working with a global raster at 5km resolution. If we want to parallelize a process involving calculation with this kind of objects then the overhead of sending the data could be bigger than the process itself.
Related to the previous, when performing memory intensive processes. Take into account that each spawned parallel process is going to need to use memory as in the sequential process. Parallelizing in the same computing multiplies the memory needed for almost the number of cores used, so is easy to run out of memory if we are not careful.

Also known as perfectly parallel, delightfully parallel or pleasingly parallel problems, but those names don’t have that ring on it ↩︎

Parallel computation in R

Víctor Granda-García

Miquel de Cáceres

Introduction

What is parallel computing?

R is a single process software

When can we use it?

Working with embarrassingly parallel problems

A little introduction to loops and maps in R (`for`, `lapply`, `map`…)

Loops

lapply

map

Ways to use parallelization in your code (`parallel`, `furrr`…)

Preparations

`foreach` and `doParallel`

`mclapply`

`future_map`

How to check if parallelization is worthy the hassle

WHAT WE DO

CONTACT

Parallel computation in R

Víctor Granda-García

Miquel de Cáceres

Introduction

What is parallel computing?

R is a single process software

When can we use it?

Working with embarrassingly parallel problems

A little introduction to loops and maps in R (for, lapply, map…)

Loops

lapply

map

Ways to use parallelization in your code (parallel, furrr…)

Preparations

foreach and doParallel

mclapply

future_map

How to check if parallelization is worthy the hassle

WHAT WE DO

CONTACT

A little introduction to loops and maps in R (`for`, `lapply`, `map`…)

Ways to use parallelization in your code (`parallel`, `furrr`…)

`foreach` and `doParallel`

`mclapply`

`future_map`