Parallel computation in R
2023-12-04
Víctor Granda-García
Ecosystem Modelling Facility - CREAFMiquel de Cáceres
Ecosystem Modelling Facility - CREAFIntroduction
This document will explain the basic concepts of parallel computing in R, with code examples to illustrate the concepts presented here. The topics covered include:
- What is parallel computing?
- When can we use it?
- A little introduction to loops and maps in R (
for
,lapply
,map
…) - Ways to use parallelization in your code (
parallel
,furrr
…) - How to check if is worth the hassle
What is parallel computing?
Figure 1: Modern CPU and cores
First of all we need to understand a little about CPUs (Central Processing Unit) and cores. Modern computers (Figure 1) have multiple CPUs, and those can have one or multiple cores. Each core is responsible of running individual processes.
Think of a simple algebraic operation, adding two numbers (1 + 1
). In a nutshell, that operation
is translated to machine code and a process is started in one core to perform the operation
(Figure 2).
Figure 2: One core is performing the ‘1 +1’ operation. This leaves the other cores available to concurrently start other procesess
So CPU cores can be used to run the the same process with different data in parallel to speed up long tasks. In theory, this can make things $1/n_{cores}$ faster, but in reality, other factors must be added (time consumed transferring data to each process, time consumed gathering results from different processes and join them, time consumed spawning new processes…) so the time gain highly depends on the type of operations, data types used…
Note
In fact, sometimes workflows are slower when parallelized, so we always need to check if we are really saving time and effort. See below for more information on this.
R is a single process software
R
is designed to run in only one CPU process. This is due to the time when S
(R
predecessor)
and R
were developed, where CPUs had mostly one core and multitasking and parallel computation
were still not widely available or technologies were still undeveloped.
Given the R
limitations explained before, parallel computing in R
is not available
out-of-the-box. We will need to use extra packages/libraries and we can only use it in
specific cases and data types.
When can we use it?
You have been using parallel computing in R
without knowing it. A lot of R
base functions are
calls to methods written in languages that support multitasking (C++
, Rust
…). For example,
matrix operations and other linear algebra functions (common operations when calculating
regression coefficients when using lm
and other model functions) are calls to C++
methods that
are parallelized and use the CPU cores available in your system (Listing 1)
1observations <- matrix(rnorm(1e6 * 100), 1e6, 100)
2predictions <- rnorm(100)
3system.time(outcome <- drop(observations %*% predictions) + rnorm(1e6))
Listing 1: Time consumed by matrix operations. We can see that the user time is bigger than the elapsed time. This means that the task (matrix product) was parallelized consuming more CPU time (user), but less real time (elapsed)
user system elapsed
0.139 0.004 0.143
Note
Some other
R
packages have implemented the methods we’ll explain in the following sections and offer arguments to the user to choose if and how parallelization must be done. For example,boot
package offer parallelization options when bootstrapping model coefficients.
Working with embarrassingly parallel problems
Embarrassingly parallel problems are those where we can easily separate the problem into several parallel tasks 1. This kind of problems are very usual in scientific and statistics fields. Think of the classic data analysis showed below (Figure 3).
Figure 3: Classic data analysis workflow
In this process, we need to ingest the data, processing it to clean/transform/… it, modelling the data and finally visualize/store the results. Now imagine we have to repeat the same process for hundred or thousands of data files (i.e., remote sensing images, genomic analyses, historical and projections climatic analyses…). Instead of processing each task one after another (in a sequential way) we can divide the input (names of the files to read) in chunks and send each chunk to CPU processes that run in parallel, which can save a lot of time and effort (Figure 4).
Figure 4: Same data analysis workflow as before but running in parallel, each process in a different CPU core
This kind of embarrasingly parallel tasks are the ones that beneficiate most of parallelization.
A little introduction to loops and maps in R (for
, lapply
, map
…)
Loops
We talk before about embarrassingly parallel problems, repetitive tasks that have little or not
connection between each other more than the origin of the inputs and can be easily separated into
parallel tasks.
These tasks are usually the ones we think about when we talk about for
loops. One example can
be bootstrapping model coefficients (Listing 2). For example, we are interested in the relationship
between sepal length and Iris species (example extracted from the doParallel
package
vignette):
1# libraries
2library(dplyr)
3
4# data needed
5n_repetitions <- 1e4
6res_coefs <- list()
7iris_data <- iris |>
8 dplyr::filter(Species != "setosa")
9# we measure the time for illustration purposes
10system.time({
11 for (index in 1:n_repetitions) {
12 sample_individuals <- sample(85, 85, replace = TRUE)
13 model_res <- glm(
14 iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
15 family = binomial
16 )
17 res_coefs[[index]] <- coefficients(model_res)
18 }
19})
Listing 2: Boostrapping model coefficients in iris dataset with a for loop
user system elapsed
11.097 0.016 11.120
We can see the user time (CPU time) is roughly the same as the elapsed time (real time), as we
should expect from a sequential for
loop.
lapply
The same problem can be solved with lapply
, but we need to encapsulate the logic of the for
loop in a function (Listing 3):
1# create the function to process data from one state
2coef_function <- function(repetition) {
3 sample_individuals <- sample(85, 85, replace = TRUE)
4 model_res <- glm(
5 iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
6 family = binomial
7 )
8 return(coefficients(model_res))
9}
10# number of repetitions
11n_repetitions <- 1e4
12# data
13iris_data <- iris |>
14 dplyr::filter(Species != "setosa")
15
16# and now the lapply (we monitorize the time again for illustration purposes)
17system.time(
18 res_coefs <- lapply(1:n_repetitions, coef_function)
19)
Listing 3: Boostrapping model coefficients in iris dataset with lapply
user system elapsed
11.131 0.004 11.142
As we see, the time is the same as with the for
loop, something we would expect.
map
If using tidyverse
packages, instead of lapply
we will use map
function in the purrr
package (Listing 4):
1# libraries
2library(purrr)
3
4coef_function <- function(repetition) {
5 sample_individuals <- sample(85, 85, replace = TRUE)
6 model_res <- glm(
7 iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
8 family = binomial
9 )
10 return(coefficients(model_res))
11}
12# number of repetitions
13n_repetitions <- 1e4
14# data
15iris_data <- iris |>
16 dplyr::filter(Species != "setosa")
17
18# and now the map (we monitorize the time again for illustration purposes)
19system.time({
20 res_coefs <- purrr::map(1:n_repetitions, .f = coef_function)
21})
Listing 4: Boostrapping model coefficients in iris dataset with map
user system elapsed
11.189 0.000 11.197
Again times are similar to the other workflows.
Ways to use parallelization in your code (parallel
, furrr
…)
If we can use loops, lapply
or map
, then we can parallelize without any problem. In this
section we will see the different options we can do it.
Preparations
Before we start, we need to know how many cores are available in our system. This can be done
with parallel::detectCores()
. In the system this document has been created the available cores
are:
1library(parallel)
2parallel::detectCores()
Listing 5: Numer of cores available
[1] 24
Tip
In the following examples we will be using 4 cores, but if your system has less than that, please change it to a valid number.
foreach
and doParallel
In a very similar way to a for
loop we can use the foreach
and doParallel
packages to build
a loop that will run the files in paralell (Listing 6):
1# libraries
2library(parallel)
3library(foreach)
4library(doParallel)
5
6# data needed
7n_repetitions <- 1e4
8res_coefs <- list()
9iris_data <- iris |>
10 dplyr::filter(Species != "setosa")
11
12# set the number of cores to use, in this example 4
13doParallel::registerDoParallel(cores = 4)
14
15# foreach loop (for illustration purposes, we check the time used)
16system.time(
17 {res_coefs <- foreach::foreach(index = 1:n_repetitions) %dopar% {
18 sample_individuals <- sample(85, 85, replace = TRUE)
19 model_res <- glm(
20 iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
21 family = binomial
22 )
23 coefficients(model_res)
24 }}
25)
Listing 6: Boostrapping model coefficients in iris dataset in parallel with a foreach
user system elapsed
12.268 0.375 3.754
As we can see, time has reduced almost four times when compared with processing the files sequentially. We are really close to the ideal $1/4$ reduction in time we should expect from using 4 cores, but not quite, as starting the extra R processes, sending the data and retrieving the results takes some time. With bigger datasets we can see that elapsed time increases because of this communication overload.
mclapply
If we prefer the lapply
syntax, we can use mclapply
to run the same expression concurrently.
mclapply
belongs to the parallel
pacakge and works exactly the same as lapply (Listing 7):
1# create the function to process data from one state
2coef_function <- function(repetition) {
3 sample_individuals <- sample(85, 85, replace = TRUE)
4 model_res <- glm(
5 iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
6 family = binomial
7 )
8 return(coefficients(model_res))
9}
10# number of repetitions
11n_repetitions <- 1e4
12# data
13iris_data <- iris |>
14 dplyr::filter(Species != "setosa")
15
16# and now the lapply (we monitorize the time again for illustration purposes)
17system.time({
18 res_coefs <- mclapply(1:n_repetitions, coef_function, mc.cores = 4)
19})
Listing 7: Boostrapping model coefficients in iris dataset in parallel with a mclapply
user system elapsed
8.590 0.268 2.981
We see again the time reduction in time with mclapply
.
future_map
furrr
package offers parallelized versions of purrr::map
family of functions. We can use it to
run the map
example above in parallel (Listing 8):
1# libraries
2library(future)
3library(furrr)
4
5coef_function <- function(repetition) {
6 sample_individuals <- sample(85, 85, replace = TRUE)
7 model_res <- glm(
8 iris_data[sample_individuals, "Species"] ~ iris_data[sample_individuals, "Petal.Length"],
9 family = binomial
10 )
11 return(coefficients(model_res))
12}
13# number of repetitions
14n_repetitions <- 1e4
15# data
16iris_data <- iris |>
17 dplyr::filter(Species != "setosa")
18
19# setting core options
20future::plan(future.callr::callr, workers = 4)
21
22# and now the map (we monitorize the time again for illustration purposes)
23system.time({
24 state_models <- furrr::future_map(1:n_repetitions, .f = coef_function)
25})
Listing 8: Boostrapping model coefficients in iris dataset in parallel with a furrr
user system elapsed
13.950 0.566 3.786
This is the method that returns the worst time running in parallel (but better than sequential).
This is because future_map
works setting a more complete environment in the parallelized
processes that takes more time. In larger datasets and more complex functions, future_map
is a
good option for paralellization.
Tip
Calculating bootstrapped coefficients is a toy example for illustration purposes. There are R packages that can bootstrap in more efficient ways, including parallelization already included, like the
boot
package.
How to check if parallelization is worthy the hassle
We have been using system.time
to check the time our code takes to run. This is one way of
check if the benefit of parallelization outweighs the inconvenience of setting it up. Other ways
include using other benchmarking libraries, like bench
or rbenchmark
. Their documentation
explain everything you need to know about timing code.
In any case, as a rule of thumb, parallelization is not worthy in the following scenarios:
-
We are parallelizing fast operations: In this scenario, when the operations we want to paralellize are fast (math operations, indexing or assigning) the overhead of sending data to cores, and retrieving and joining the results usually is greater than the time of performing the operation.
-
We are parallelizing processes that involve sending a lot of data to each spawned parallel process. Think for example in working with a global raster at 5km resolution. If we want to parallelize a process involving calculation with this kind of objects then the overhead of sending the data could be bigger than the process itself.
-
Related to the previous, when performing memory intensive processes. Take into account that each spawned parallel process is going to need to use memory as in the sequential process. Parallelizing in the same computing multiplies the memory needed for almost the number of cores used, so is easy to run out of memory if we are not careful.
-
Also known as perfectly parallel, delightfully parallel or pleasingly parallel problems, but those names don’t have that ring on it ↩︎