Cost distribution among software process activities
Tutorial: Good Software Engineering Practice for R Packages
July 8, 2024
From an idea to a production-grade R package
Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.
The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient, and on top of that, you want to share your helpful R functions with colleagues and the world…
Extensive documentation, huge paperwork, lots of manual work, lots of signatures, …
Bad practice!
Why?
Cost distribution among software process activities
Origin of errors in system development
Invest time in
… but in many cases the workflow must be workable for a single developer or a small team.
Let’s assume that you used some lines of code to create simulated data in multiple projects:
Idea: put the code into a package
Obligation level | Key word1 | Description |
---|---|---|
Duty | must2 | “must have” |
Desire | should | “nice to have” |
Intention | may | “optional” |
Purpose and Scope
The R package simulatr is intended to enable the creation of reproducible fake data.
Package Requirements
simulatr must provide a function to generate normal distributed random data for two independent groups. The function must allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data must be ensured via an optional seed. It should be possible to print the function result. The package may also facilitate graphical presentation of the simulated data.
Useful formats / tools for design docs:
UML Diagram
R package programming
One-off script as starting point:
Refactored script:
Almost all functions, arguments, and objects should be self-explanatory due to their names.
Define that the result is a list1 which is defined as class2:
getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
result <- list(n1 = n1, n2 = n2,
mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
result$data <- data.frame(
group = c(rep(1, n1), rep(2, n2)),
values = c(
rnorm(n = n1, mean = mean1, sd = sd1),
rnorm(n = n2, mean = mean2, sd = sd2)
)
)
# set the class attribute
result <- structure(result, class = "SimulationResult")
return(result)
}
The output is impractical, e.g., we need to scroll down:
$n1
[1] 50
$n2
[1] 50
$mean1
[1] 5
$mean2
[1] 7
$sd1
[1] 3
$sd2
[1] 4
$data
group values
1 1 11.5693750
2 1 7.3404635
3 1 1.9951822
4 1 1.5486142
5 1 3.6997158
6 1 1.8197855
7 1 4.7974151
8 1 5.9363158
9 1 1.9576641
10 1 9.5840016
11 1 8.2897101
12 1 3.8041150
13 1 7.4080279
14 1 7.8556250
15 1 1.7393297
16 1 7.6601508
17 1 7.0192073
18 1 6.3546494
19 1 2.7823651
20 1 4.2246955
21 1 -3.6293900
22 1 4.7855269
23 1 -0.3131968
24 1 1.5188898
25 1 1.7574595
26 1 5.0273721
27 1 6.0304331
28 1 2.6196664
29 1 6.0885153
30 1 3.7927841
31 1 3.8770709
32 1 7.9345873
33 1 1.3859051
34 1 4.5973378
35 1 7.2393474
36 1 7.4264402
37 1 4.2208044
38 1 -1.5457377
39 1 7.9211873
40 1 8.7994999
41 1 9.6472600
42 1 1.1720068
43 1 5.3464461
44 1 4.6280664
45 1 11.0531705
46 1 1.6871314
47 1 4.1355045
48 1 2.2057299
49 1 4.9713388
50 1 10.4105115
51 2 2.5909838
52 2 13.5848519
53 2 -0.2860471
54 2 4.4231713
55 2 2.3596517
56 2 9.6541313
57 2 8.5516826
58 2 11.0188773
59 2 9.4833127
60 2 11.3461939
61 2 1.3714247
62 2 1.3233199
63 2 11.8880661
64 2 6.4357442
65 2 7.6893199
66 2 4.1248432
67 2 5.8043052
68 2 7.9004182
69 2 8.8014256
70 2 4.9447497
71 2 7.1251847
72 2 11.6735921
73 2 6.4680892
74 2 3.2783245
75 2 4.8847914
76 2 8.0258668
77 2 8.0327691
78 2 4.8022986
79 2 13.3150540
80 2 6.9746742
81 2 12.1579331
82 2 2.5758951
83 2 0.7907483
84 2 6.9307380
85 2 13.1737993
86 2 7.9174856
87 2 14.3368072
88 2 8.8623672
89 2 5.6714398
90 2 1.0337590
91 2 5.9304270
92 2 5.2571077
93 2 3.6875055
94 2 6.9289352
95 2 6.9911529
96 2 9.9797865
97 2 6.0139200
98 2 7.0502888
99 2 11.7914250
100 2 5.1818498
attr(,"class")
[1] "SimulationResult"
Solution: implement generic function print
Generic function print
:
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#'
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5,
#' mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
n1 n2 mean1 mean2 sd1 sd2
"50" "50" "5" "7" "3" "4"
$data
# A tibble: 100 × 2
group values
<dbl> <dbl>
1 1 11.6
2 1 7.34
3 1 2.00
4 1 1.55
5 1 3.70
6 1 1.82
7 1 4.80
8 1 5.94
9 1 1.96
10 1 9.58
# ℹ 90 more rows
Add assertions to improve the usability and user experience
Tip on assertions
Use the package checkmate to validate input arguments.
Example:
Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.
Add three additional results:
Tip on creation time
Sys.time()
, format(Sys.time(), '%B %d, %Y')
, Sys.Date()
Add an additional result: t.test
result
Add an optional alternative argument and pass it through t.test
:
Implement the generic functions print
and plot
.
Tip on print
Use the plot example function from above and extend it.
Optional extra tasks:
Implement the generic functions summary
and cat
Implement the function kable
known from the package knitr as generic. Tip: use
to define kable as generic
Optional extra task1:
Document your functions with Roxygen2
In the current version, changes were done by (later authors): Andrew Bean
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
The source files are hosted at github.com/RCONIS/user2024-tutorial-gswep, which is forked from and a subset of the original version at github.com/RCONIS/workshop-r-swe-zrh.
Important: To use this work you must provide the name of the creators (initial authors), a link to the material, a link to the license, and indicate if changes were made