3 An R Package Engineering Workflow

openstatsware Workshop: Good Software Engineering Practice for R Packages

April 18, 2024

Motivation

From an idea to a production-grade R package

Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.

The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient, and on top of that, you want to share your helpful R functions with colleagues and the world…

Professional Workflow

Photo CC0 by ELEVATE on pexels.com

Typical work steps

  1. Idea
  2. Concept creation
  3. Validation planning
  4. Specification:
    1. User Requirements Spec (URS),
    2. Functional Spec (FS), and
    3. Software Design Spec (SDS)
  1. R package programming
  2. Documented verification
  3. Completion of formal validation
  4. R package release
  5. Use in production
  6. Maintenance

Workflow in Practice

Photo CC0 by Chevanon Photography on pexels.com

Frequently Used Workflow in Practice

  1. Idea
  2. R package programming
  3. Use in production
  4. Bug fixing
  5. Use in production
  1. Bug fixing + Documentation
  2. Use in production
  3. Bug fixing + Further development
  4. Use in production
  5. Bug fixing + …

Bad practice!

Why?

Why practice good engineering?

Cost distribution among software process activities

doi:10.14569/IJACSA.2020.0110375

Why practice good engineering?

Origin of errors in system development

Boehm, B. (1981). Software Engineering Economics. Prentice Hall.

Why practice good engineering?

  • Don’t waste time on maintenance
  • Be faster with release on CRAN
  • Don’t waste time with inefficient and buggy further development
  • Fulfill regulatory requirements1
  • Save refactoring time when the Proof-of-Concept (PoC) becomes the release version
  • You don’t have to be shy any longer about inviting other developers to contribute to the package on GitHub

Why practice good engineering?

Invest time in

  • requirements analysis,
  • software design, and
  • architecture…

… but in many cases the workflow must be workable for a single developer or a small team.

Workable Workflow

Photo CC0 by Kateryna Babaieva on pexels.com

Suggestion for a Workable Workflow

  1. Idea
  2. Design docs
  3. R package programming
  4. Quality check (see Ensuring Quality)
  5. Publication (see Publication)
  6. Use in production

Example - Step 1: Idea

Let’s assume that you used some lines of code to create simulated data in multiple projects:

dat <- data.frame(
    group = c(rep(1, 50), rep(2, 50)),
    values = c(
        rnorm(n = 50, mean = 8, sd = 12),
        rnorm(n = 50, mean = 14, sd = 11)
    )
)

Idea: put the code into a package

Example - Step 2: Design docs

  1. Describe the purpose and scope of the package
  2. Analyse and describe the requirements in clear and simple terms (“prose”)
Obligation level Key word1 Description
Duty must, shall “must have”
Desire should “nice to have”
Intention may “optional”

Example - Step 2: Design docs

Purpose and Scope

The R package simulatr shall enable the creation of reproducible fake data.

Package Requirements

simulatr shall provide a function to generate normal distributed random data for two independent groups. The function must allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data must be ensured via an optional seed. It should be possible to print the function result. The package may also facilitate graphical presentation of the simulated data.

Example - Step 2: Design docs

Useful formats / tools for design docs:

UML Diagram

Example - Step 3: Packaging

R package programming

  1. Create basic package project (see R Packages)
  2. C&P existing R scripts (one-off scripts, prototype functions) and refactor1 it if necessary
  3. Create R generic functions
  4. Document all functions

Example - Step 3: Packaging

One-off script as starting point:

sim.data <- function(n1, n2, m1, m2, s1, s2) {
    data.frame(
        group = c(rep(1, n1), rep(2, n2)),
        values = c(
            rnorm(n = n1, mean = m1, sd = s1),
            rnorm(n = n2, mean = m2, sd = s2)
        )
    )
}

Example - Step 3: Packaging

Refactored script:

getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
    data.frame(
        group = c(rep(1, n1), rep(2, n2)),
        values = c(
            rnorm(n = n1, mean = mean1, sd = sd1),
            rnorm(n = n2, mean = mean2, sd = sd2)
        )
    )
}

Almost all functions, arguments, and objects should be self-explanatory due to their names.

Example - Step 3: Packaging

Define that the result is a list1 which is defined as class2:

getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
    result <- list(n1 = n1, n2 = n2, 
         mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
    result$data <- data.frame(
        group = c(rep(1, n1), rep(2, n2)),
        values = c(
            rnorm(n = n1, mean = mean1, sd = sd1),
            rnorm(n = n2, mean = mean2, sd = sd2)
        )
    )
    # set the class attribute
    result <- structure(result, class = "SimulationResult")
    return(result)
}

Example - Step 3: Packaging

The output is impractical, e.g., we need to scroll down:

x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5, mean2 = 7, sd1 = 3, sd2 = 4)
x
$n1
[1] 50

$n2
[1] 50

$mean1
[1] 5

$mean2
[1] 7

$sd1
[1] 3

$sd2
[1] 4

$data
    group     values
1       1  8.7377399
2       1  3.7158320
3       1  5.0510873
4       1  9.9602539
5       1  3.0377111
6       1  3.3287450
7       1  6.4891500
8       1  4.1814122
9       1  6.0237598
10      1  5.7318349
11      1  6.0644403
12      1 -0.4014787
13      1  7.2180578
14      1  8.7394013
15      1  8.1871733
16      1  1.5536080
17      1  2.9286013
18      1  7.0697317
19      1  6.4942363
20      1  2.8592213
21      1  3.0478447
22      1  7.0600567
23      1  7.6363895
24      1  2.3539451
25      1  4.9531579
26      1 11.4358253
27      1  7.6782958
28      1  7.8405420
29      1  3.9701071
30      1  2.8004189
31      1  7.2409403
32      1 -1.7704723
33      1  8.9935125
34      1  8.3858408
35      1  3.2824340
36      1  2.4763773
37      1  7.0789594
38      1  5.9724666
39      1  6.3831444
40      1  2.1306980
41      1 11.9050772
42      1  2.2612274
43      1  4.2671333
44      1 10.1193642
45      1  4.5421769
46      1  4.2426930
47      1  4.6945184
48      1  8.2786148
49      1  4.5191565
50      1 10.9325219
51      2  1.7318377
52      2  7.1546960
53      2  5.0013782
54      2  6.0612668
55      2 11.0689645
56      2  5.2654874
57      2  5.8035644
58      2  2.1660880
59      2  9.7137738
60      2 -5.6748983
61      2  8.0545522
62      2  8.3104953
63      2  2.6285491
64      2  1.7044837
65      2  7.2034559
66      2  4.6326492
67      2 11.4546412
68      2 12.3218887
69      2  5.3016642
70      2  7.7287980
71      2  1.6880939
72      2  7.1062903
73      2  2.0586764
74      2  3.9175541
75      2  6.7981126
76      2 10.4545799
77      2  7.9916043
78      2  3.8025642
79      2  5.7432792
80      2  7.3815380
81      2  9.2708818
82      2 -0.7855882
83      2  4.3502792
84      2 10.0151469
85      2  9.1486546
86      2  6.5605364
87      2  4.4579934
88      2 11.3521622
89      2  9.5204248
90      2 10.7974266
91      2  5.1650756
92      2  9.7707656
93      2  7.4908737
94      2  3.2230284
95      2 15.9315367
96      2 12.5557396
97      2  3.4706958
98      2  8.9273949
99      2  9.5625799
100     2 11.8408796

attr(,"class")
[1] "SimulationResult"

Solution: implement generic function print

Example - Step 3: Packaging

Generic function print:

print.SimulationResult <- function(x, ...) {
    args <- list(n1 = x$n1, n2 = x$n2, 
        mean1 = x$mean1, mean2 = x$mean2, sd1 = x$sd1, sd2 = x$sd2)
    
    print(list(
        args = format(args), 
        data = dplyr::tibble(x$data)
    ), ...)
}
x
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#' 
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5, 
#'      mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
   n1    n2 mean1 mean2   sd1   sd2 
 "50"  "50"   "5"   "7"   "3"   "4" 

$data
# A tibble: 100 × 2
   group values
   <dbl>  <dbl>
 1     1   8.74
 2     1   3.72
 3     1   5.05
 4     1   9.96
 5     1   3.04
 6     1   3.33
 7     1   6.49
 8     1   4.18
 9     1   6.02
10     1   5.73
# ℹ 90 more rows

Exercise

Photo CC0 by Pixabay on pexels.com

Preparation

  1. Download the unfinished R package simulatr
  2. Extract the package zip file
  3. Open the project with RStudio
  4. Complete the tasks below

Tasks

Add assertions to improve the usability and user experience

Tip on assertions

Use the package checkmate to validate input arguments.

Example:

playWithAssertions <- function(n1) {
  checkmate::assertInt(n1, lower = 1)
}
playWithAssertions(-1)

Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.

Add three additional results:

  1. n total,
  2. creation time, and
  3. allocation ratio

Tip on creation time

Sys.time(), format(Sys.time(), '%B %d, %Y'), Sys.Date()

Add an additional result: t.test result

Add an optional alternative argument and pass it through t.test:

alternative = c("two.sided", "less", "greater")

Implement the generic functions print and plot.

Tip on print

Use the plot example function from above and extend it.

Tip on plot

Use R base plot or ggplot2 to create a grouped boxplot of the fake data.

Optional extra tasks:

  • Implement the generic functions summary and cat

  • Implement the function kable known from the package knitr as generic. Tip: use

    kable <- function(x) UseMethod("kable")

    to define kable as generic

Optional extra task1:

Document your functions with Roxygen2

  1. If you are already familiar with Roxygen2

References

  • Gillespie, C., & Lovelace, R. (2017). Efficient R Programming: A Practical Guide to Smarter Programming. O’Reilly UK Ltd. [Book | Online]
  • Grolemund, G. (2014). Hands-On Programming with R: Write Your Own Functions and Simulations (1. Aufl.).
    O’Reilly and Associates. [Book | Online]
  • Rupp, C., & SOPHISTen, die. (2009). Requirements-Engineering und -Management: Professionelle, iterative Anforderungsanalyse für die Praxis (5. Ed.). Carl Hanser Verlag GmbH & Co. KG. [Book]
  • Wickham, H. (2015). R Packages: Organize, Test, Document, and Share Your Code (1. Aufl.). O’Reilly and Associates. [Book | Online]
  • Wickham, H. (2019). Advanced R, Second Edition.
    Taylor & Francis Ltd. [Book | Online]

License information