Cost distribution among software process activities
Short Course: Good Software Engineering Practice for R Packages
August 12, 2025
From an idea to a production-grade R package
Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.
The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient, and on top of that, you want to share your helpful R functions with colleagues and the world…
Photo CC0 by ELEVATE on pexels.com
Extensive documentation, huge paperwork, lots of manual work, lots of signatures, …
Photo CC0 by Chevanon Photography on pexels.com
Bad practice!
Why?
Cost distribution among software process activities
Origin of errors in system development
Boehm, B. (1981). Software Engineering Economics. Prentice Hall.
Invest time in
… but in many cases the workflow must be workable for a single developer or a small team.
Photo CC0 by Kateryna Babaieva on pexels.com
Let’s assume that you used some lines of code to create simulated data in multiple projects:
Idea: put the code into a package
Obligation level | Key word1 | Description |
---|---|---|
Duty | must2 | “must have” |
Desire | should | “nice to have” |
Intention | may | “optional” |
Purpose and Scope
The R package simulatr is intended to enable the creation of reproducible fake data.
Package Requirements
simulatr must provide a function to generate normal distributed random data for two independent groups. The function must allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data must be ensured via an optional seed. It should be possible to print the function result. The package may also facilitate graphical presentation of the simulated data.
Useful formats / tools for design docs:
UML Diagram
R package programming
One-off script as starting point:
Refactored script:
Almost all functions, arguments, and objects should be self-explanatory due to their names.
Define that the result is a list1 which is defined as class2:
getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
result <- list(n1 = n1, n2 = n2,
mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
result$data <- data.frame(
group = c(rep(1, n1), rep(2, n2)),
values = c(
rnorm(n = n1, mean = mean1, sd = sd1),
rnorm(n = n2, mean = mean2, sd = sd2)
)
)
# set the class attribute
result <- structure(result, class = "SimulationResult")
return(result)
}
The output is impractical, e.g., we need to scroll down:
$n1
[1] 50
$n2
[1] 50
$mean1
[1] 5
$mean2
[1] 7
$sd1
[1] 3
$sd2
[1] 4
$data
group values
1 1 8.64218153
2 1 10.91226374
3 1 4.98990735
4 1 3.79359415
5 1 8.93235219
6 1 3.45334903
7 1 15.41808102
8 1 1.21796744
9 1 3.35649889
10 1 2.25498157
11 1 -0.37714311
12 1 3.60553906
13 1 6.13896903
14 1 -2.91886075
15 1 7.29617732
16 1 3.36024176
17 1 2.92418256
18 1 7.07479597
19 1 1.07617830
20 1 9.86282624
21 1 7.98214428
22 1 6.16850528
23 1 7.27313995
24 1 9.13085791
25 1 2.22992214
26 1 1.35226663
27 1 2.87387133
28 1 1.91652946
29 1 10.82354483
30 1 1.75103998
31 1 4.67699187
32 1 0.66234909
33 1 7.84656008
34 1 4.12680527
35 1 3.80270242
36 1 8.15205838
37 1 6.32891444
38 1 4.83506199
39 1 2.46015430
40 1 -1.89478462
41 1 12.35935634
42 1 6.86726023
43 1 3.78855406
44 1 5.04217363
45 1 0.08343361
46 1 7.20014144
47 1 6.18135414
48 1 -1.72764643
49 1 8.18389042
50 1 0.39340892
51 2 4.12868187
52 2 7.17012708
53 2 7.84531088
54 2 13.24874264
55 2 3.56378939
56 2 6.55268334
57 2 16.16230343
58 2 6.56316727
59 2 5.37347462
60 2 3.40636184
61 2 12.71588054
62 2 9.79670036
63 2 5.99278457
64 2 8.27316355
65 2 10.33232979
66 2 6.39699620
67 2 13.95158726
68 2 5.93473956
69 2 4.62064267
70 2 4.17495108
71 2 11.14784347
72 2 -1.20128354
73 2 1.09235579
74 2 0.04896084
75 2 8.69846556
76 2 6.04000300
77 2 5.34799958
78 2 16.78808504
79 2 2.33264842
80 2 10.87346752
81 2 12.52031005
82 2 6.59318294
83 2 -0.15999628
84 2 4.04766756
85 2 11.20859771
86 2 10.02240058
87 2 6.72913834
88 2 12.22252099
89 2 8.78946651
90 2 6.24318357
91 2 4.42832620
92 2 7.85946776
93 2 6.54342340
94 2 9.31896537
95 2 12.09707605
96 2 4.55376951
97 2 2.01371568
98 2 7.45718502
99 2 7.86063272
100 2 6.81068531
attr(,"class")
[1] "SimulationResult"
Solution: implement generic function print
Generic function print
:
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#'
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5,
#' mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
n1 n2 mean1 mean2 sd1 sd2
"50" "50" "5" "7" "3" "4"
$data
# A tibble: 100 × 2
group values
<dbl> <dbl>
1 1 8.64
2 1 10.9
3 1 4.99
4 1 3.79
5 1 8.93
6 1 3.45
7 1 15.4
8 1 1.22
9 1 3.36
10 1 2.25
# ℹ 90 more rows
Photo CC0 by Pixabay on pexels.com
Add assertions to improve the usability and user experience
Tip on assertions
Use the package checkmate to validate input arguments.
Example:
Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.
Add three additional results:
Tip on creation time
Sys.time()
, format(Sys.time(), '%B %d, %Y')
, Sys.Date()
Add an additional result: t.test
result
Add an optional alternative argument and pass it through t.test
:
Implement the generic functions print
and plot
.
Tip on print
Use the plot example function from above and extend it.
Optional extra tasks:
Implement the generic functions summary
and cat
Implement the function kable
known from the package knitr as generic. Tip: use
to define kable as generic
Optional extra task1:
Document your functions with Roxygen2