First we need to load the necessary packages and set some default options for the MCMC sampling. We also set the theme for the plots to theme_bw with a base size of 12.
Show the code
library(bayesplot)library(brms)library(ggplot2)library(gt)library(here)library(janitor)library(jmpost)library(modelr)library(posterior)library(readxl)library(rstan)library(tidybayes)library(tidyverse)library(truncnorm)library(fuzzyjoin)library(sn)library(glue)if (require(cmdstanr)) {# If cmdstanr is available, instruct brms to use cmdstanr as backend# and cache all Stan binariesoptions(brms.backend ="cmdstanr",cmdstanr_write_stan_file_dir =here("_brms-cache") )dir.create(here("_brms-cache"), FALSE) # create cache directory if not yet available} else { rstan::rstan_options(auto_write =TRUE)}# MCMC optionsoptions(mc.cores =4)ITER <-1000# number of sampling iterations after warm upWARMUP <-2000# number of warm up iterationsCHAINS <-4BAYES.SEED <-878REFRESH <-500theme_set(theme_bw(base_size =12))
We also need a small function definition, which is still missing in brms:
id year sld
5394902984416364544 : 16 Min. :-0.1068 Min. : 2.00
7160320731596789760 : 16 1st Qu.: 0.1123 1st Qu.: 19.05
-4159496062492130816: 15 Median : 0.2519 Median : 32.00
-7900338178541499392: 15 Mean : 0.3995 Mean : 38.16
4878279915140903936 : 15 3rd Qu.: 0.5749 3rd Qu.: 51.00
9202154259772598272 : 15 Max. : 2.0753 Max. :228.00
(Other) :4034 NA's :27
arm
Docetaxel:1658
MPDL3280A:2468
Here we have 701 patients, and we know from Fig. 1 in the publication that this is the subset of patients with at least 3 tumor size measurements. We have guessed from the second Excel sheet with parameter estimates that arm 1 means docetaxel and arm 2 means MPDL3280A. This also seems reasonable given the larger number of measurements per patient:
Show the code
# make a tabletumor_data |>group_by(id) |>summarize(arm = arm[1], n =n()) |>group_by(arm) |>summarize(mean_n =mean(n))
# A tibble: 2 × 2
arm mean_n
<fct> <dbl>
1 Docetaxel 5.10
2 MPDL3280A 6.56
For the OS data, we will be using the Supplementary Table 8 from here. Again, we have copied the data set in this GitHub repository.
From the supplementary material, we know that:
PFS/OS: time in months
PFS.CNSR/OS.CNSR: 1 for censor, 0 for event
Therefore we will convert the time to years and the censoring to a logical variable:
# A tibble: 6 × 12
id arm ecog age race sex sld response pfs_time pfs_event os_time
<fct> <fct> <fct> <dbl> <fct> <fct> <dbl> <fct> <dbl> <lgl> <dbl>
1 318 Docet… 1 64 WHITE M 152 PD 0.107 TRUE 0.433
2 1088 Docet… 0 65 WHITE M 36 PD 0.194 TRUE 0.402
3 345 Docet… 1 75 WHITE M 92 <NA> 0.162 TRUE 0.162
4 588 Docet… 0 61 WHITE F 36 SD 1.02 TRUE 2.05
5 306 MPDL3… 1 53 WHITE F 86 PD 0.118 TRUE 0.498
6 718 Docet… 1 80 WHITE M 45.7 SD 0.712 TRUE 1.60
# ℹ 1 more variable: os_event <lgl>
Show the code
summary(os_data)
id arm ecog age race sex
1000 : 1 Docetaxel:425 0:315 Min. :33.00 ASIAN:180 F:330
1001 : 1 MPDL3280A:425 1:535 1st Qu.:57.00 OTHER: 72 M:520
1002 : 1 Median :64.00 WHITE:598
1003 : 1 Mean :63.23
1004 : 1 3rd Qu.:70.00
1005 : 1 Max. :85.00
(Other):844
sld response pfs_time pfs_event
Min. : 10.00 CR : 7 Min. :0.002738 Mode :logical
1st Qu.: 41.00 PR :108 1st Qu.:0.117728 FALSE:95
Median : 66.00 SD :327 Median :0.240931 TRUE :755
Mean : 76.69 PD :304 Mean :0.443174
3rd Qu.: 99.00 NE : 24 3rd Qu.:0.580424
Max. :316.00 NA's: 80 Max. :2.242300
NA's :1
os_time os_event
Min. :0.002738 Mode :logical
1st Qu.:0.364134 FALSE:281
Median :0.803559 TRUE :569
Mean :0.949841
3rd Qu.:1.620808
Max. :2.253251
Now we would like to join the two data sets, i.e. know which tumor growth data belongs to which baseline covariates and overall survival time and event indicator. Unfortunately we cannot use the id variable as it is not consistent between the two data sets, because both were anonymized. Instead, we will use the sld variable in combination with the treatment arm. We can also double check with the best overall response variable, because that is closely related to the tumor growth.
First, we will summarize the patient information int he tumor data set. We will calculate the baseline SLD, determine the time of the last tumor measurement and approximate the best overall response by looking at the change from baseline:
Now let’s try to fuzzy join this with the OS data set, based on the baseline SLD and the treatment arm. In addition, we know that the last time point in the tumor data needs to be before the OS time point. We will also check that the best overall response is the same in both data sets. There will be some errors here: For example, we don’t know whether there were new lesions leading to a progression assessment, or whether the patient had later better results. So we need to expect that the matching will not be perfect.
We see that we cannot match the two data sets exactly:
We have less patients to start with in the tumor size data set.
We have patients in the tumor size data set which match multiple patients in the OS data set based on the treatment arm and the baseline SLD, and also the last measurement time point cannot disambiguate them.
We have patients in the tumor size data set which do not match any patient in the OS data set.
However, for the purpose of this training we don’t need to be able to recover the true full data set. Therefore we will just subset the matched data set to a reasonable combination of the two data sets:
---title: "0. Setup and data preparation"author: - Francois Mercier - Daniel Sabanés Bovédate: last-modifiededitor_options: chunk_output_type: inlineformat: html: code-fold: show---## Setup{{< include _setup.qmd >}}## Data preparationWe prepare now the same dataset as before in the OS session.{{< include _load_data.qmd >}}