openstatsware
Workshop: Good Software Engineering Practice for R Packages
April 18, 2024
This section is adapted from slides by Lukas A. Widmer and Michael Mayer, which they prepared and released under the CC BY 4.0 license, see their course Go fastR – how to make R code fast(er) and run it on high performance compute (HPC) clusters.
Thanks a lot Lukas and Michael!
traceback()
after code fails, to show the call stack and see how the error came about.debug(myfun)
for the problematic function myfun
and then run the code to step through its execution.browser()
calls into problematic code to add “breakpoints” for interactive execution and inspection.options(error = recover)
and then running the problematic code allows you to jump into the call stack and inspect. (Undo with options(error = NULL)
)Let’s have a look at RStudio IDE specific details.
sink()
See Posit website for details.
In software engineering, profiling (“program profiling”, “software profiling”) is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.
In R there are a couple of basic functions for profiling:
system.time()
Rprof
:
Rprof()
Rprof(NULL)
summaryRprof()
See e.g. Peng (2016), chapter 19, for details.
Rprof
outputWe can use the more modern profvis
R package for visualizing where R spends its time during code execution.
profvis
outputif(nrow(filter(x, condition)) > 0)
if(any(condition))
Avoid making unintended copies of objects.
Other examples:
data.frame
once from complete column vectors, rather than rbind()
it iteratively.data.frame
will create a copy, so use it consciously or better work with columns.rowSums()
, colSums()
, rowMeans()
, colMeans()
matrixStats
: many! anyMissing()
, colQuantiles()
, rowMax()
, etc.collapse
: fmean()
, TRA()
, GRP()
, …“R is a language optimized for human performance, not computer performance”
(Hadley Wickham, New York R Conference, 2018)
Rcpp
Writing and using C++ code in your R package is not easy.
But it is not too difficult with Rcpp
by Eddelbuettel and François (2011).
Rcpp::Rcpp.package.skeleton()
RcppArmadillo
for linear algebraNote: Adding C++ code to your package will in many cases increase the maintenance effort significantly.
bootstrap.R
and understand what is going on.profbootstrap.R
to see where most of the time is spent - where?impl_2
, only update vectors inside the loop and then create a tibble
once at the end.impl_3
only subset the column instead of the whole data.frame
. How much faster does it get?impl_4
use the boot
package.impl_5
that uses Rcpp
. Was it worth the effort?