8 Code Optimization

openstatsware Workshop: Good Software Engineering Practice for R Packages

April 18, 2024

Acknowledgments

This section is adapted from slides by Lukas A. Widmer and Michael Mayer, which they prepared and released under the CC BY 4.0 license, see their course Go fastR – how to make R code fast(er) and run it on high performance compute (HPC) clusters.

Thanks a lot Lukas and Michael!

Introduction

A Word of Wisdom

(Knuth 1974, 268)

Setting the right priorities

  1. The code needs to be correct, i.e. calculate correct results as expected. → Tests, Debugging
  2. If it is correct, but too slow, then find out which calculations and why they are too slow. → Profiling
  3. Once you identified the slow parts, optimize those. → Code optimization
  4. If executing the code on your laptop is still too slow → consider running it instead on a high performance cluster (HPC), see slides from Lukas and Michael

Tests: see previous section

(modified from Munroe (2007), No. 303)

Debugging: A few pointers

  1. Post-hoc traceback() after code fails, to show the call stack and see how the error came about.
  2. Setting debug(myfun) for the problematic function myfun and then run the code to step through its execution.
  3. Inject browser() calls into problematic code to add “breakpoints” for interactive execution and inspection.
  4. Setting options(error = recover) and then running the problematic code allows you to jump into the call stack and inspect. (Undo with options(error = NULL))

Debugging: Some RStudio specifics

Let’s have a look at RStudio IDE specific details.

  • The “Debug” menu can be useful to explore the options.
  • Editor breakpoints: Can add with click to the left of line number and gives “red dot”.
  • Click on “Source” to run the script and enter debug mode.
  • “Debug” > “On Error” > “Break in Code” again lets you jump into the code on error.
  • Debugging in Rmarkdown documents can be tricky, either proceed chunk by chunk, or try sink()

See Posit website for details.

Profiling

Profiling: Definition

In software engineering, profiling (“program profiling”, “software profiling”) is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

(Wikipedia)

Profiling: Example code

f <- function() {
    profvis::pause(0.1)
    for (i in seq_len(3)) {
        g()
    }
}
g <- function() {
    profvis::pause(0.1)
}

Profiling: Let’s identify the bottlenecks!

In R there are a couple of basic functions for profiling:

  • system.time()
  • Rprof:
    1. start with Rprof()
    2. execute the code
    3. stop with Rprof(NULL)
    4. summarize with summaryRprof()

See e.g. Peng (2016), chapter 19, for details.

Profiling: Classic Rprof output

Rprof()
f()
Rprof(NULL)
summaryRprof()

Profiling: Getting visual

We can use the more modern profvis R package for visualizing where R spends its time during code execution.

library(profvis)
source("profexample.R") # Such that the flame graph includes code lines.
prof <- profvis({
    f()
})
print(prof)

Profiling: profvis output

Code optimization

Code optimization: Explore alternatives

  • If the slow function is from another package, search for a faster one
  • Runtime complexity (runtime as a function of data size) of different algorithms can be wildly different - some work well on small data but take forever on large data
  • Few examples:
    • For wrangling data frames, consider duckplyr and polars as alternatives to dplyr
    • To read and write CSV data, consider vroom instead of base R
    • To read and write objects from and to disk, consider qs instead of base R readRDS, saveRDS

Code optimization: DRY on data frames

  • Remember: DRY (Don’t Repeat Yourself)
  • Data frames are expensive, i.e. take a lot of time.
  • Some examples:
    • Only create data frames, if really necessary.
      • Bad: if(nrow(filter(x, condition)) > 0)
      • Good: if(any(condition))
    • Assemble a data frame only once, not iteratively.
    • When subsetting a data frame and working with a column, first extract the column and then subset.

Code optimization: Reuse, don’t copy

Avoid making unintended copies of objects.

# Bad:
result <- c()
for (i in 1:1000) result <- c(result, fun(i))
# Good:
result <- numeric(1000)
for (i in 1:1000) result[i] <- fun(i)
# Better:
result <- sapply(seq_len(1000), fun)

Other examples:

  • Create a data.frame once from complete column vectors, rather than rbind() it iteratively.
  • Subsetting a matrix or data.frame will create a copy, so use it consciously or better work with columns.

Code optimization: Vectorize where possible

  • Avoiding for loops does not help much anymore.
  • However, using specialized vectorized functions (implemented in compiled code) helps:
    • Base R: rowSums(), colSums(), rowMeans(), colMeans()
    • matrixStats: many! anyMissing(), colQuantiles(), rowMax(), etc.
    • collapse: fmean(), TRA(), GRP(), …

Code optimization: Still too slow?

“R is a language optimized for human performance, not computer performance”

(Hadley Wickham, New York R Conference, 2018)

Code optimization: Shift to C++ with Rcpp

Writing and using C++ code in your R package is not easy.

But it is not too difficult with Rcpp by Eddelbuettel and François (2011).

Note: Adding C++ code to your package will in many cases increase the maintenance effort significantly.

References

Eddelbuettel, Dirk, and Romain François. 2011. Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.
Knuth, Donald E. 1974. “Structured Programming with Go to Statements.” ACM Computing Surveys (CSUR) 6 (4): 261–301. https://web.archive.org/web/20130731202547/http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf.
Munroe, Randall. 2007. “Xkcd.” 2007. http://xkcd.com/.
Peng, Roger D. 2016. R Programming for Data Science. Lulu.com. https://bookdown.org/rdpeng/rprogdatascience/.

Exercise

  • Read bootstrap.R and understand what is going on.
  • Run profbootstrap.R to see where most of the time is spent - where?
  • In a second version of the function, impl_2, only update vectors inside the loop and then create a tibble once at the end.
  • In a third version impl_3 only subset the column instead of the whole data.frame. How much faster does it get?
  • In a fourth version impl_4 use the boot package.
  • Homework: Try to come up with a fifth version impl_5 that uses Rcpp. Was it worth the effort?

License Information