Parallel computation in R

2023-12-04

Parallel computation in R

Introduction

This document will explain the basic concepts of parallel computing in R, with code examples to illustrate the concepts presented here. The topics covered include:

  • What is parallel computing?
  • When can we use it?
  • A little introduction to loops and maps in R (for, lapply, map…)
  • Ways to use parallelization in your code (parallel, furrr…)
  • How to check if is worth the hassle

What is parallel computing?

G cluster_cpu1 CPU #1 cluster_cpu2 CPU #2 core2 core2 core4 core4 core1 core1 core3 core3 core6 core6 core8 core8 core5 core5 core7 core7

Figure 1: Modern CPU and cores

First of all we need to understand a little about CPUs (Central Processing Unit) and cores. Modern computers (Figure 1) have multiple CPUs, and those can have one or multiple cores. Each core is responsible of running individual processes.

Think of a simple algebraic operation, adding two numbers (1 + 1). In a nutshell, that operation is translated to machine code and a process is started in one core to perform the operation (Figure 2).

G cluster_cpu1 CPU #1 cluster_cpu2 CPU #2 core2 Available core core4 Available core core1 Available core core3 '1 + 1' process running core6 Available core core8 Available core core5 Available core core7 Available core

Figure 2: One core is performing the ‘1 +1’ operation. This leaves the other cores available to concurrently start other procesess

So CPU cores can be used to run the the same process with different data in parallel to speed up long tasks. In theory, this can make things $1/n_{cores}$ faster, but in reality, other factors must be added (time consumed transferring data to each process, time consumed gathering results from different processes and join them, time consumed spawning new processes…) so the time gain highly depends on the type of operations, data types used…

Note

In fact, sometimes workflows are slower when parallelized, so we always need to check if we are really saving time and effort. See below for more information on this.

R is a single process software

R is designed to run in only one CPU process. This is due to the time when S (R predecessor) and R were developed, where CPUs had mostly one core and multitasking and parallel computation were still not widely available or technologies were still undeveloped.

Given the R limitations explained before, parallel computing in R is not available out-of-the-box. We will need to use extra packages/libraries and we can only use it in specific cases and data types.

When can we use it?

You have been using parallel computing in R without knowing it. A lot of R base functions are calls to methods written in languages that support multitasking (C++, Rust…). For example, matrix operations and other linear algebra functions (common operations when calculating regression coefficients when using lm and other model functions) are calls to C++ methods that are parallelized and use the CPU cores available in your system (Listing 1)

1observations <- matrix(rnorm(1e6 * 100), 1e6, 100)
2predictions <- rnorm(100)
3system.time(outcome <- drop(observations %*% predictions) + rnorm(1e6))

Listing 1: Time consumed by matrix operations. We can see that the user time is bigger than the elapsed time. This means that the task (matrix product) was parallelized consuming more CPU time (user), but less real time (elapsed)

   user  system elapsed 
  0.139   0.004   0.143 

Note

Some other R packages have implemented the methods we’ll explain in the following sections and offer arguments to the user to choose if and how parallelization must be done. For example, boot package offer parallelization options when bootstrapping model coefficients.

Working with embarrassingly parallel problems

Embarrassingly parallel problems are those where we can easily separate the problem into several parallel tasks 1. This kind of problems are very usual in scientific and statistics fields. Think of the classic data analysis showed below (Figure 3).

Process 1Visualize resultsRead fileTransform dataModel dataRepeat afterwe finishwith a file

Figure 3: Classic data analysis workflow

In this process, we need to ingest the data, processing it to clean/transform/… it, modelling the data and finally visualize/store the results. Now imagine we have to repeat the same process for hundred or thousands of data files (i.e., remote sensing images, genomic analyses, historical and projections climatic analyses…). Instead of processing each task one after another (in a sequential way) we can divide the input (names of the files to read) in chunks and send each chunk to CPU processes that run in parallel, which can save a lot of time and effort (Figure 4).

Process 3Visualize resultsRead fileTransform dataModel dataProcess 2Visualize resultsRead fileTransform dataModel dataProcess 1Visualize resultsRead fileTransform dataModel dataFile namesdonedonedone

Figure 4: Same data analysis workflow as before but running in parallel, each process in a different CPU core

This kind of embarrasingly parallel tasks are the ones that beneficiate most of parallelization.

A little introduction to loops and maps in R (for, lapply, map…)

Loops

We talk before about embarrassingly parallel problems, repetitive tasks that have little or not connection between each other more than the origin of the inputs and can be easily separated into parallel tasks.
These tasks are usually the ones we think about when we talk about for loops. One example can be bootstrapping model coefficients (Listing 2). For example, we are interested in the relationship between sepal length and Iris species (example extracted from the doParallel package vignette):

 1# libraries
 2library(dplyr)
 3
 4# data needed
 5n_repetitions <- 1e4
 6res_coefs <- list()
 7iris_data <- iris |>
 8  dplyr::filter(Species != "setosa")
 9# we measure the time for illustration purposes
10system.time({
11  for (index in 1:n_repetitions) {
12    sample_individuals <- sample(85, 85, replace = TRUE)
13    model_res <- glm(
14      iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
15      family = binomial
16    )
17    res_coefs[[index]] <- coefficients(model_res)
18  }
19})

Listing 2: Boostrapping model coefficients in iris dataset with a for loop

   user  system elapsed 
 11.097   0.016  11.120 

We can see the user time (CPU time) is roughly the same as the elapsed time (real time), as we should expect from a sequential for loop.

lapply

The same problem can be solved with lapply, but we need to encapsulate the logic of the for loop in a function (Listing 3):

 1# create the function to process data from one state
 2coef_function <- function(repetition) {
 3  sample_individuals <- sample(85, 85, replace = TRUE)
 4  model_res <- glm(
 5    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 6    family = binomial
 7  )
 8  return(coefficients(model_res))
 9}
10# number of repetitions
11n_repetitions <- 1e4
12# data
13iris_data <- iris |>
14  dplyr::filter(Species != "setosa")
15
16# and now the lapply (we monitorize the time again for illustration purposes)
17system.time(
18  res_coefs <- lapply(1:n_repetitions, coef_function)
19)

Listing 3: Boostrapping model coefficients in iris dataset with lapply

   user  system elapsed 
 11.131   0.004  11.142 

As we see, the time is the same as with the for loop, something we would expect.

map

If using tidyverse packages, instead of lapply we will use map function in the purrr package (Listing 4):

 1# libraries
 2library(purrr)
 3
 4coef_function <- function(repetition) {
 5  sample_individuals <- sample(85, 85, replace = TRUE)
 6  model_res <- glm(
 7    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 8    family = binomial
 9  )
10  return(coefficients(model_res))
11}
12# number of repetitions
13n_repetitions <- 1e4
14# data
15iris_data <- iris |>
16  dplyr::filter(Species != "setosa")
17
18# and now the map (we monitorize the time again for illustration purposes)
19system.time({
20  res_coefs <- purrr::map(1:n_repetitions, .f = coef_function)
21})

Listing 4: Boostrapping model coefficients in iris dataset with map

   user  system elapsed 
 11.189   0.000  11.197 

Again times are similar to the other workflows.

Ways to use parallelization in your code (parallel, furrr…)

If we can use loops, lapply or map, then we can parallelize without any problem. In this section we will see the different options we can do it.

Preparations

Before we start, we need to know how many cores are available in our system. This can be done with parallel::detectCores(). In the system this document has been created the available cores are:

1library(parallel)
2parallel::detectCores()

Listing 5: Numer of cores available

[1] 24

Tip

In the following examples we will be using 4 cores, but if your system has less than that, please change it to a valid number.

foreach and doParallel

In a very similar way to a for loop we can use the foreach and doParallel packages to build a loop that will run the files in paralell (Listing 6):

 1# libraries
 2library(parallel)
 3library(foreach)
 4library(doParallel)
 5
 6# data needed
 7n_repetitions <- 1e4
 8res_coefs <- list()
 9iris_data <- iris |>
10  dplyr::filter(Species != "setosa")
11
12# set the number of cores to use, in this example 4
13doParallel::registerDoParallel(cores = 4)
14
15# foreach loop (for illustration purposes, we check the time used)
16system.time(
17  {res_coefs <- foreach::foreach(index = 1:n_repetitions) %dopar% {
18    sample_individuals <- sample(85, 85, replace = TRUE)
19    model_res <- glm(
20      iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
21      family = binomial
22    )
23    coefficients(model_res)
24  }}
25)

Listing 6: Boostrapping model coefficients in iris dataset in parallel with a foreach

   user  system elapsed 
 12.268   0.375   3.754 

As we can see, time has reduced almost four times when compared with processing the files sequentially. We are really close to the ideal $1/4$ reduction in time we should expect from using 4 cores, but not quite, as starting the extra R processes, sending the data and retrieving the results takes some time. With bigger datasets we can see that elapsed time increases because of this communication overload.

mclapply

If we prefer the lapply syntax, we can use mclapply to run the same expression concurrently. mclapply belongs to the parallel pacakge and works exactly the same as lapply (Listing 7):

 1# create the function to process data from one state
 2coef_function <- function(repetition) {
 3  sample_individuals <- sample(85, 85, replace = TRUE)
 4  model_res <- glm(
 5    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 6    family = binomial
 7  )
 8  return(coefficients(model_res))
 9}
10# number of repetitions
11n_repetitions <- 1e4
12# data
13iris_data <- iris |>
14  dplyr::filter(Species != "setosa")
15
16# and now the lapply (we monitorize the time again for illustration purposes)
17system.time({
18  res_coefs <- mclapply(1:n_repetitions, coef_function, mc.cores = 4)
19})

Listing 7: Boostrapping model coefficients in iris dataset in parallel with a mclapply

   user  system elapsed 
  8.590   0.268   2.981 

We see again the time reduction in time with mclapply.

future_map

furrr package offers parallelized versions of purrr::map family of functions. We can use it to run the map example above in parallel (Listing 8):

 1# libraries
 2library(future)
 3library(furrr)
 4
 5coef_function <- function(repetition) {
 6  sample_individuals <- sample(85, 85, replace = TRUE)
 7  model_res <- glm(
 8    iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
 9    family = binomial
10  )
11  return(coefficients(model_res))
12}
13# number of repetitions
14n_repetitions <- 1e4
15# data
16iris_data <- iris |>
17  dplyr::filter(Species != "setosa")
18
19# setting core options
20future::plan(future.callr::callr, workers = 4)
21
22# and now the map (we monitorize the time again for illustration purposes)
23system.time({
24  state_models <- furrr::future_map(1:n_repetitions, .f = coef_function)
25})

Listing 8: Boostrapping model coefficients in iris dataset in parallel with a furrr

   user  system elapsed 
 13.950   0.566   3.786 

This is the method that returns the worst time running in parallel (but better than sequential). This is because future_map works setting a more complete environment in the parallelized processes that takes more time. In larger datasets and more complex functions, future_map is a good option for paralellization.

Tip

Calculating bootstrapped coefficients is a toy example for illustration purposes. There are R packages that can bootstrap in more efficient ways, including parallelization already included, like the boot package.

How to check if parallelization is worthy the hassle

We have been using system.time to check the time our code takes to run. This is one way of check if the benefit of parallelization outweighs the inconvenience of setting it up. Other ways include using other benchmarking libraries, like bench or rbenchmark. Their documentation explain everything you need to know about timing code.

In any case, as a rule of thumb, parallelization is not worthy in the following scenarios:

  1. We are parallelizing fast operations: In this scenario, when the operations we want to paralellize are fast (math operations, indexing or assigning) the overhead of sending data to cores, and retrieving and joining the results usually is greater than the time of performing the operation.

  2. We are parallelizing processes that involve sending a lot of data to each spawned parallel process. Think for example in working with a global raster at 5km resolution. If we want to parallelize a process involving calculation with this kind of objects then the overhead of sending the data could be bigger than the process itself.

  3. Related to the previous, when performing memory intensive processes. Take into account that each spawned parallel process is going to need to use memory as in the sequential process. Parallelizing in the same computing multiplies the memory needed for almost the number of cores used, so is easy to run out of memory if we are not careful.


  1. Also known as perfectly parallel, delightfully parallel or pleasingly parallel problems, but those names don’t have that ring on it ↩︎