Data Distillery Dash Episode 02

The start of a prediction pipeline for psychological interventions.

Michael Mullarkey, PhD (Click to return to my website) https://mcmullarkey.github.io
05-13-2021

What Is Data Distillery Dash?

I take a research topic at random1 or dataset I’ve never seen2, wrangle the data, create a predictive model or two, and write up a blog post: all within 2 hours live on Twitch! All the words and code in this post were written during an ~2 hour livestream. If you want to come join the party, I plan on doing this on Thursdays from 3:30-5:30 EST here.

What Data Are We Using Today?

We’re going to predict subjective feelings of being ready to go back to work at the end of an online randomized clinical trial!3 I learned about this data from a paper by Dr. Nick Jacobson and the data is from this paper.

This is mostly so I can practice setting up a predictive modeling pipeline that allows us to predict who would benefit more from one condition or another in a clinical trial. Even in situations like this where the control group seems minimal, they can actually have some therapeutic benefit. In fact, sometimes people seem to respond better to what we’ve described as psychological placebos compared to what we’ve described as active treatments!4

Let’s start with what we need to do for any predictive modeling pipleline: data cleaning/munging! This is all the more important here because the data was stored in a .sav SPSS data format. This means we need to read_sav from the haven package to read the data in, and to get a sense of what the variables actually are we use the view_df function from the sjPlot package.

I also decided to not use any of the non-collected at baseline variables in this particular modeling pipeline. Why? Right now I’m most interested in better matching people to treatments before they start. I would want to use the data over time if I was more interested in trying to tailor the treatment as it was being delivered. Even though it might not look like much, cleaning/munging the data (especially since I’d never seen it before!) took a lot of the time during this stream.

library(haven)
library(tidyverse)
library(tidymodels)
library(janitor)

data <- read_sav("online_rct.sav") %>% 
  clean_names()

# glimpse(data)

library(sjPlot)

view_df(data)
Data frame: data
ID Name Label Values Value Labels
1 imputation Imputation code 0 original data
2 gruppenart Study group 1
2
CG
IG
3 sibar1bt0 Age range: 25-59
4 sibar1at0 Sex 1
2
female
male
5 indikation Medical Indication 1
2
3
psychosomatic
cardiology
orthopedic
6 schulab Education 1
2
3
9 years
10 years
13 years
7 behandlungsdauer Time between intake and discharge (weeks) range: 2-8
8 sibar5at1 Duration of sick leaves in the last 12 months
(weeks)
range: 0.0-58.0
9 arbeitsunf Status of work ability at T1 (physicians’ rating) 0
1
3
4
5
9
Maßnahme nicht ordnungsgemäß abgeschlossen, gestorben
arbeitsfähig
arbeitsunfähig
Kinder-Reha
Hausfrau / Hausmann
Beurteilung nicht erforderlich (Altersrentner, Angehöriger)
10 leistft1 Work capacity T1 1
2
fully capable
limited to not at all capable
11 rentet1 Application for pension T1 1
2
applied or plans to apply
not applied and no plans
12 sibar2t1 Do you consider your work ability permanently
threatened by your health status?
1
2
yes
no
13 sibar5t1 When you consider your current health and work
ability: Do you believe that you will be working
until retirement age?
1
2
yes
no
14 login Login after discharge 0
1
2
no
yes
not yet
15 tn_blog Participation intervention (min. 1 Blog) 0
1
no
yes
16 blogs Number of blogs range: 0-13
17 zufig1at2 Frequency Reading blogs (IG) 1
2
3
4
daily
several times a week
seldom
never
18 zufig1bt2 Frequency Reading therapeutic feedback (IG) 1
2
3
4
daily
several times a week
seldom
never
19 zufig1ct2 Frequency Using work sheets (IG) 1
2
3
4
daily
several times a week
seldom
never
20 zufig1dt2 Frequency Listening to relaxation techniques (IG) 1
2
3
4
daily
several times a week
seldom
never
21 zufig1et2 Frequency Reading posts in the forum (IG) 1
2
3
4
daily
several times a week
seldom
never
22 zufig1ft2 Frequency Writing post in the forum (IG) 1
2
3
4
daily
several times a week
seldom
never
23 zufig2at2 Helpful Writing blogs (IG) 1
5
not at all
very
24 zufig2bt2 Helpful Reading therapeutic feedback (IG) 1
5
not at all
very
25 zufig2ct2 Helpful Using work sheets (IG) 1
5
not at all
very
26 zufig2dt2 Helpful Listening to relaxation techniques (IG) 1
5
not at all
very
27 zufig2et2 Helpful Reading posts in the forum (IG) 1
5
not at all
very
28 zufig2ft2 Helpful Writing posts in the forum (IG) 1
5
not at all
very
29 zufig2gt2 Helpful Online aftercare alltogether (IG) 1
5
not at all
very
30 zufig3t2 Helpful Alliance with online therapist (IG) 1
5
not at all
very
31 zufkg1at2 Frequency Stress management info (CG) 1
2
3
4
daily
several times a week
seldom
never
32 zufkg1bt2 Frequency Healthy diet info (CG) 1
2
3
4
daily
several times a week
seldom
never
33 zufkg1ct2 Frequency Sports/ physical activity info (CG) 1
2
3
4
daily
several times a week
seldom
never
34 zufkg1dt2 Frequency Sleep info (CG) 1
2
3
4
daily
several times a week
seldom
never
35 zufkg1et2 Frequency Relaxation info (CG) 1
2
3
4
daily
several times a week
seldom
never
36 zufkg2at2 Helpful Stress management info (CG) 1
5
not at all
very
37 zufkg2bt2 Helpful Healthy diet info (CG) 1
5
not at all
very
38 zufkg2ct2 Helpful Sports/ physical activity info (CG) 1
5
not at all
very
39 zufkg2dt2 Helpful Sleep info (CG) 1
5
not at all
very
40 zufkg2et2 Helpful Relaxation info (CG) 1
5
not at all
very
41 zufkg2ft2 Helpful Online aftercare alltogether (CG) 1
5
not at all
very
42 zufges_t2_dichotom Dichotomised Helpfulness Online aftercare
alltogether
range: 1-2
43 spe_sum_t0 Sumscore SPE T0 range: 0-3
44 spe_sum_t1 Sumscore SPE T1 range: 0-3
45 spe_sum_t2 Sumscore SPE T2 range: 0-3
46 spe_sum_t3 Sumscore SPE T3 range: 0-3
47 phq_dept0 Sumscore (0-27) depressive symptoms (PHQ-9) T0 range: 0.0-27.0
48 phq_dept1 Sumscore (0-27) depressive symptoms (PHQ-9) T1 range: 0.0-27.0
49 phq_dept2 Sumscore (0-27) depressive symptoms (PHQ-9) T2 range: 0.0-27.0
50 phq_dept3 Sumscore (0-27) depressive symptoms (PHQ-9) T3 range: 0.0-27.0
51 phq_som_t0 Sumscore (0-30) somatoform symptoms (PHQ-15) T0 range: 0.0-26.8
52 phq_som_t1 Sumscore (0-30) somatoform symptoms (PHQ-15) T1 range: 0.0-26.0
53 phq_som_t2 Sumscore (0-30) somatoform symptoms (PHQ-15) T2 range: 0.0-27.7
54 phq_som_t3 Sumscore (0-30) somatoform symptoms (PHQ-15) T3 range: 0.0-26.5
55 phq_stress_t0 Sumscore (0-20) PHQ-stress T0 range: 0.0-19.0
56 phq_stress_t1 Sumscore (0-20) PHQ-stress T1 range: 0.0-20.0
57 phq_stress_t2 Sumscore (0-20) PHQ-stress T2 range: 0.0-20.0
58 phq_stress_t3 Sumscore (0-20) PHQ-stress T3 range: 0.0-20.0
59 gad7t0 Sumscore (0-21) generalized anxiety (GAD-7) T0 range: 0.0-21.0
60 gad7t1 Sumscore (0-21) generalized anxiety (GAD-7) T1 range: 0-21
61 gad7t2 Sumscore (0-21) generalized anxiety (GAD-7) T2 range: 0-21
62 gad7t3 Sumscore (0-21) generalized anxiety (GAD-7) T3 range: 0-21
63 ksk12t0 SF-12 Physical Component Summary (T0) range: 12.0-67.0
64 psk12t0 SF-12 Mental Component Summary (T0) range: 7.0-70.0
65 ksk12t1 SF-12 Physical Component Summary (T1) range: 17.0-72.0
66 psk12t1 SF-12 Mental Component Summary (T1) range: 10.0-78.0
67 ksk12t2 SF-12 Physical Component Summary (T2) range: 5.0-80.0
68 psk12t2 SF-12 Mental Component Summary (T2) range: 5.0-72.0
69 ksk12t3 SF-12 Physical Component Summaryt (T3) range: 10.0-74.0
70 psk12t3 SF-12 Mental Component Summary (T3) range: 9.0-79.0
71 av_n1_t0 AVEM Subjective importance of work T0
(Stanine-Werte)
range: 1-9
72 av_n2_t0 AVEM Work related ambition T0 (Stanine-Werte) range: 1-9
73 av_n3_t0 AVEM Willingness to work until exhausted T0
(Stanine-Werte)
range: 1-9
74 av_n4_t0 AVEM Striving for perfection T0 (Stanine-Werte) range: 1-9
75 av_n5_t0 AVEM Distancing ability T0 (Stanine-Werte) range: 2-8
76 av_n6_t0 AVEM Tendency to resignation T0 (Stanine-Werte) range: 1-9
77 av_n7_t0 AVEM Proactive problem solving T0 (Stanine-Werte) range: 1-9
78 av_n8_t0 AVEM Inner calm and balance T0 (Stanine-Werte) range: 1-9
79 av_n9_t0 AVEM Experience of success at work T0
(Stanine-Werte)
range: 1-9
80 av_n10_t0 AVEM Satisfaction with life T0 (Stanine-Werte) range: 1-8
81 av_n11_t0 AVEM Experience of social support T0
(Stanine-Werte)
range: 1-8
82 av_n1_t1 AVEM Subjective importance of work T1
(Stanine-Werte)
range: 1-9
83 av_n2_t1 AVEM Work related ambition T1 (Stanine-Werte) range: 1-9
84 av_n3_t1 AVEM Willingness to work until exhausted T1
(Stanine-Werte)
range: 1-9
85 av_n4_t1 AVEM Striving for perfection T1 (Stanine-Werte) range: 1-9
86 av_n5_t1 AVEM Distancing ability T1 (Stanine-Werte) range: 1-9
87 av_n6_t1 AVEM Tendency to resignation T1 (Stanine-Werte) range: 1-9
88 av_n7_t1 AVEM Proactive problem solving T1 (Stanine-Werte) range: 1-9
89 av_n8_t1 AVEM Inner calm and balance T1 (Stanine-Werte) range: 1-9
90 av_n9_t1 AVEM Experience of success at work T1
(Stanine-Werte)
range: 1-9
91 av_n10_t1 AVEM Satisfaction with life T1 (Stanine-Werte) range: 1-9
92 av_n11_t1 AVEM Experience of social support T1
(Stanine-Werte)
range: 1-8
93 av_n1_t2 AVEM Subjective importance of work T2
(Stanine-Werte)
range: 1-9
94 av_n2_t2 AVEM Work related ambition T2 (Stanine-Werte) range: 1-9
95 av_n3_t2 AVEM Willingness to work until exhausted T2
(Stanine-Werte)
range: 1-9
96 av_n4_t2 AVEM Striving for perfection T2 (Stanine-Werte) range: 1-9
97 av_n5_t2 AVEM Distancing ability T2 (Stanine-Werte) range: 1-9
98 av_n6_t2 AVEM Tendency to resignation T2 (Stanine-Werte) range: 1-9
99 av_n7_t2 AVEM Proactive problem solving T2 (Stanine-Werte) range: 1-9
100 av_n8_t2 AVEM Inner calm and balance T2 (Stanine-Werte) range: 1-9
101 av_n9_t2 AVEM Experience of success at work T2
(Stanine-Werte)
range: 1-9
102 av_n10_t2 AVEM Satisfaction with life T2 (Stanine-Werte) range: 1-9
103 av_n11_t2 AVEM Experience of social support T2
(Stanine-Werte)
range: 1-9
104 av_n1_t3 AVEM Subjective importance of work T3
(Stanine-Werte)
range: 1-9
105 av_n2_t3 AVEM Work related ambition T3 (Stanine-Werte) range: 1-9
106 av_n3_t3 AVEM Willingness to work until exhausted T3
(Stanine-Werte)
range: 1-9
107 av_n4_t3 AVEM Striving for perfection T3 (Stanine-Werte) range: 1-9
108 av_n5_t3 AVEM Distancing ability T3 (Stanine-Werte) range: 1-9
109 av_n6_t3 AVEM Tendency to resignation T3 (Stanine-Werte) range: 1-9
110 av_n7_t3 AVEM Proactive problem solving T3 (Stanine-Werte) range: 1-9
111 av_n8_t3 AVEM Inner calm and balance T3 (Stanine-Werte) range: 1-9
112 av_n9_t3 AVEM Experience of success at work T3
(Stanine-Werte)
range: 1-9
113 av_n10_t3 AVEM Satisfaction with life T3 (Stanine-Werte) range: 1-8
114 av_n11_t3 AVEM Experience of social support T3
(Stanine-Werte)
range: 1-9
115 sibar_risk SIBAR Risk Score (0-19, Cutoff 8) T0 range: 0.0-16.0
116 sibar_risk_di SIBAR Risc Score > 8 T0 0
1
no
yes
117 sibar_risk_t1 SIBAR Risk Score (0-19, Cutoff 8) T1 range: 0.0-16.0
118 sibar_risk_di_t1 SIBAR Risc Score > 8 T1 0
1
no
yes
119 sibar_riskt2 SIBAR Risk Score (0-19, Cutoff 8) T2 range: 0.0-16.0
120 sibar_riskdi_t2 SIBAR Risc Score > 8 T2 0
1
no
yes
121 sibar_riskt3 SIBAR Risk Score (0-19, Cutoff 8) T3 range: 0.0-17.0
122 sibar_riskdi_t3 SIBAR Risc Score > 8 T3 0
1
no
yes
data_init <- data %>%
  rename(cond = gruppenart) %>% 
  dplyr::select(-arbeitsunf,
                -ends_with("_t1"), -ends_with("_t2"),
                -ends_with("t1"), -ends_with("t2"),
                -ends_with("1"),-ends_with("2"),
                -ends_with("_t3"),
                -ends_with("t3"),
                -ends_with("3"),
                -login, -tn_blog,
                -blogs, -zufges_t2_dichotom) %>% 
  glimpse()
Rows: 6,952
Columns: 27
$ imputation       <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cond             <dbl+lbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, …
$ sibar1bt0        <dbl> 48, 47, 48, 47, 40, 31, 53, 43, 45, 54, 48,…
$ sibar1at0        <dbl+lbl> 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, …
$ indikation       <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 2, 1, …
$ schulab          <dbl+lbl> 1, 2, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 2, …
$ behandlungsdauer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ spe_sum_t0       <dbl> 3, 3, 3, 2, 3, 0, 1, 1, 0, 0, 2, 1, 2, 3, 0…
$ phq_dept0        <dbl> 20.00, 21.00, 20.25, 25.00, 23.00, 21.00, 1…
$ phq_som_t0       <dbl> 23.571429, 18.000000, 22.500000, 26.785714,…
$ phq_stress_t0    <dbl> 11.00000, 16.00000, 15.55556, 17.00000, 15.…
$ gad7t0           <dbl> 17, 15, 19, 21, 15, 13, 10, 13, 9, 12, 5, 6…
$ ksk12t0          <dbl> 32.39855, 31.59306, 24.46546, 50.51283, 39.…
$ psk12t0          <dbl> 23.66914, 29.21420, 30.79812, 13.90037, 23.…
$ av_n1_t0         <dbl> 4, 3, 5, 5, 2, 3, 4, 8, 4, 5, 5, 4, 5, 2, 2…
$ av_n2_t0         <dbl> 5, 5, 1, 8, 4, 2, 4, 9, 4, 7, 4, 5, 6, 3, 3…
$ av_n3_t0         <dbl> 8, 5, 9, 9, 8, 8, 7, 9, 4, 5, 3, 8, 6, 6, 4…
$ av_n4_t0         <dbl> 8, 6, 7, 9, 9, 3, 6, 9, 3, 8, 2, 3, 6, 5, 4…
$ av_n5_t0         <dbl> 3, 5, 3, 3, 3, 6, 5, 3, 5, 4, 4, 4, 4, 6, 4…
$ av_n6_t0         <dbl> 8, 8, 8, 8, 9, 6, 4, 7, 3, 4, 3, 2, 9, 7, 7…
$ av_n7_t0         <dbl> 7, 4, 2, 6, 1, 1, 5, 3, 5, 6, 1, 8, 1, 3, 1…
$ av_n8_t0         <dbl> 3, 3, 3, 1, 1, 3, 6, 4, 7, 4, 6, 5, 3, 1, 3…
$ av_n9_t0         <dbl> 5, 2, 3, 6, 1, 3, 3, 7, 5, 9, 3, 5, 1, 8, 3…
$ av_n10_t0        <dbl> 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 4, 5, 1, 4, 2…
$ av_n11_t0        <dbl> 2, 2, 1, 3, 2, 2, 3, 4, 6, 6, 6, 4, 3, 2, 3…
$ sibar_risk       <dbl> 10.0, 8.0, 13.0, 5.0, 11.0, 5.0, 6.0, 2.5, …
$ sibar_risk_di    <dbl+lbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
# Pull out fake t1 variables that are actually at baseline

outcome <- data %>% 
  dplyr::select(spe_sum_t3) # %>% 
  # glimpse()

data_updt <- data_init %>% 
  bind_cols(outcome) %>% 
   mutate(across(
    c(sibar1at0:schulab, imputation,cond,sibar_risk_di),
    as.factor
  )) %>% 
  filter(imputation == "0" & !is.na(spe_sum_t3)) %>%
  mutate(id = row_number()) %>% 
  relocate(id, everything()) %>% 
  glimpse()
Rows: 445
Columns: 29
$ id               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ imputation       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ cond             <fct> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
$ sibar1bt0        <dbl> 47, 47, 40, 31, 53, 54, 48, 49, 40, 31, 35,…
$ sibar1at0        <fct> 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2…
$ indikation       <fct> 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1…
$ schulab          <fct> 2, 3, 2, 2, 2, 3, 3, 2, 3, 2, 2, 1, 2, 1, 2…
$ behandlungsdauer <dbl> NA, NA, NA, NA, NA, NA, NA, 8, 7, 7, 7, 7, …
$ spe_sum_t0       <dbl> 3, 2, 3, 0, 1, 0, 2, 1, 3, 0, 1, 2, 1, 1, 1…
$ phq_dept0        <dbl> 21, 25, 23, 21, 11, 11, 10, 3, 6, 15, 13, 1…
$ phq_som_t0       <dbl> 18.000000, 26.785714, 19.000000, 13.000000,…
$ phq_stress_t0    <dbl> 16.000000, 17.000000, 15.000000, 15.000000,…
$ gad7t0           <dbl> 15, 21, 15, 13, 10, 12, 5, 6, 14, 11, 11, 2…
$ ksk12t0          <dbl> 31.59306, 50.51283, 39.60604, 52.61199, 36.…
$ psk12t0          <dbl> 29.21420, 13.90037, 23.51996, 15.58050, 27.…
$ av_n1_t0         <dbl> 3, 5, 2, 3, 4, 5, 5, 4, 2, 2, 3, 5, 4, 5, 4…
$ av_n2_t0         <dbl> 5, 8, 4, 2, 4, 7, 4, 5, 3, 3, 5, 4, 2, 5, 2…
$ av_n3_t0         <dbl> 5, 9, 8, 8, 7, 5, 3, 8, 6, 4, 8, 5, 8, 6, 3…
$ av_n4_t0         <dbl> 6, 9, 9, 3, 6, 8, 2, 3, 5, 4, 6, 5, 4, 8, 3…
$ av_n5_t0         <dbl> 5, 3, 3, 6, 5, 4, 4, 4, 6, 4, 3, 6, 4, 4, 7…
$ av_n6_t0         <dbl> 8, 8, 9, 6, 4, 4, 3, 2, 7, 7, 9, 6, 5, 9, 4…
$ av_n7_t0         <dbl> 4, 6, 1, 1, 5, 6, 1, 8, 3, 1, 1, 4, 1, 4, 4…
$ av_n8_t0         <dbl> 3, 1, 1, 3, 6, 4, 6, 5, 1, 3, 1, 6, 1, 4, 4…
$ av_n9_t0         <dbl> 2, 6, 1, 3, 3, 9, 3, 5, 8, 3, 1, 5, 1, 3, 3…
$ av_n10_t0        <dbl> 1, 2, 1, 1, 2, 1, 4, 5, 4, 2, 1, 4, 1, 2, 4…
$ av_n11_t0        <dbl> 2, 3, 2, 2, 3, 6, 6, 4, 2, 3, 2, 5, 1, 4, 4…
$ sibar_risk       <dbl> 8.0, 5.0, 11.0, 5.0, 6.0, 1.0, 3.0, 2.5, 9.…
$ sibar_risk_di    <fct> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
$ spe_sum_t3       <dbl> 3, 0, 3, 0, 0, 1, 0, 2, 2, 0, 0, 2, 2, 2, 2…
library(skimr)
# skim(data_updt)

For reasons I’ll go into more in a subsequent post, I want to create predictions separately for folks in the intervention group and folks in the control group. The short version of why: I want to be able to predict how well I would expect people to respond not just to the intervention they actually received, but how well the model thinks they would have done if they had received the other intervention. Building separate models by treatment condition isn’t the only way to do this, but it seems like the cleanest/least likely to lead to data leakage that I can think of right now5

On top of that, I’ll need to test whether those hypothetical predictions actually generalize to hold out data. So, we need to hold out at least some folks from both intervention groups to ultimately test whether our predictions about the “optimal intervention” are any good. If I could turn back time, I would just use the “initial_split” argument from the rsample package within tidymodels to create the test sets for both the intervention and control groups. That’s primarily the case because I’m not 100% sure this is reproducible and I’m not stratifying by the outcome6

# Breaking out the predictions into the separate groups

data_intv_init <- data_updt %>% 
  filter(cond == "2") %>% 
  as.data.frame() # %>% 
  #glimpse()

data_ctrl_init <- data_updt %>% 
  filter(cond == "1") %>% 
  as.data.frame() # %>% 
  #glimpse()

# Pulling out folks for test set later

test_intv <- data_intv_init %>% 
  slice_sample(n = 40)

data_intv <- data_intv_init %>% 
  anti_join(test_intv, by = "id") #%>% 
  #glimpse()

test_ctrl <- data_ctrl_init %>% 
  slice_sample(n = 40)

data_ctrl <- data_ctrl_init %>% 
  anti_join(test_ctrl, by = "id") #%>% 
  #glimpse()

Then I used the recipes package to preprocess the data! I wanted to remove imputation (which didn’t vary/wasn’t relevant), designate id as an id variable rather than a predictor, one hot encode the categorical variables7, excluded near-zero variance predictors, normalized all predictors, and imputed any missing data in predictors using k-nearest neighbors.

I then set up a tidymodels workflow using an out of the box xgboost model8 and made my predictions for the intervention group first.

intv_rec <-
  recipe(spe_sum_t3 ~ ., data = data_intv) %>%
  step_rm(imputation) %>% 
  update_role(id, new_role = "id") %>% 
  step_other(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_nzv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_impute_knn(all_predictors())

# Testing the recipe to make sure it produces data

intv_rec %>% 
  prep(verbose = TRUE) %>% 
  bake(new_data = NULL) # %>% 
oper 1 step rm [training] 
oper 2 step other [training] 
oper 3 step dummy [training] 
oper 4 step nzv [training] 
oper 5 step normalize [training] 
oper 6 step impute knn [training] 
The retained training set is ~ 0.06 Mb  in memory.
# A tibble: 173 x 33
      id sibar1bt0 behandlungsdauer spe_sum_t0 phq_dept0 phq_som_t0
   <int>     <dbl>            <dbl>      <dbl>     <dbl>      <dbl>
 1     2   -0.326             0.805     0.930     2.41        2.79 
 2     3   -1.38              1.14      1.94      2.08        1.33 
 3     6    0.724            -0.866    -1.08      0.0881     -0.725
 4     7   -0.176            -1.03      0.930    -0.0777      0.210
 5     8   -0.0260            3.15     -0.0756   -1.24       -0.538
 6     9   -1.38              2.31      1.94     -0.741      -0.164
 7    10   -2.72              2.31     -1.08      0.752      -0.538
 8    12   -2.87              2.31      0.930     0.420       0.383
 9    13   -0.0260            2.31     -0.0756    2.74        1.14 
10    14   -1.83              2.31     -0.0756    1.25        0.383
# … with 163 more rows, and 27 more variables: phq_stress_t0 <dbl>,
#   gad7t0 <dbl>, ksk12t0 <dbl>, psk12t0 <dbl>, av_n1_t0 <dbl>,
#   av_n2_t0 <dbl>, av_n3_t0 <dbl>, av_n4_t0 <dbl>, av_n5_t0 <dbl>,
#   av_n6_t0 <dbl>, av_n7_t0 <dbl>, av_n8_t0 <dbl>, av_n9_t0 <dbl>,
#   av_n10_t0 <dbl>, av_n11_t0 <dbl>, sibar_risk <dbl>,
#   spe_sum_t3 <dbl>, sibar1at0_X1 <dbl>, sibar1at0_X2 <dbl>,
#   indikation_X1 <dbl>, indikation_X2 <dbl>, indikation_X3 <dbl>,
#   schulab_X1 <dbl>, schulab_X2 <dbl>, schulab_X3 <dbl>,
#   sibar_risk_di_X0 <dbl>, sibar_risk_di_X1 <dbl>
  #skim()

# Create a model

xg_mod <- boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

# Creating a workflow

xg_wf <-
  workflow() %>% 
  add_recipe(intv_rec) %>% 
  add_model(xg_mod)
intv_folds <- vfold_cv(data_intv, v = 5, repeats = 5, strata = spe_sum_t3)

keep_pred <- control_resamples(save_pred = TRUE)

library(tictoc)
tic()
set.seed(33)
xg_rs <-
  xg_wf %>% 
  fit_resamples(intv_folds, control = keep_pred)
toc()
10.855 sec elapsed

Overall we predictions that aren’t terribly far off from the predictions of changes in anxiety and depression from Nick’s original paper (which used an ensemble model stacked on a bunch of base learners). We can’t make an apples to oranges comparison since he and his co-author were predicting different outcomes, but at least I’m not wildly off from where they were.

xg_rs %>% 
  collect_metrics(summarize = TRUE)
# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   0.967    25  0.0177 Preprocessor1_Model1
2 rsq     standard   0.193    25  0.0202 Preprocessor1_Model1
r2_intv <- xg_rs %>% 
  collect_metrics(summarize = TRUE) %>% 
  filter(str_detect(.metric, "rsq") == TRUE) %>% 
  dplyr::select(mean) %>% 
  deframe() 

sqrt(r2_intv)
[1] 0.4387755
intv_preds <- xg_rs %>% 
  collect_predictions(summarize = TRUE)

# In order to get predictions in the placebo group I have to fit a workflow first

# fit_workflow <- fit(xg_wf, data_ctrl)

I then do the same preprocessing and model fitting for the control group!

ctrl_rec <-
  recipe(spe_sum_t3 ~ ., data = data_ctrl) %>%
  step_rm(imputation) %>% 
  update_role(id, new_role = "id") %>% 
  step_other(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_nzv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_impute_knn(all_predictors())

# Testing the recipe to make sure it produces data

ctrl_rec %>% 
  prep(verbose = TRUE) %>% 
  bake(new_data = NULL) #%>% 
oper 1 step rm [training] 
oper 2 step other [training] 
oper 3 step dummy [training] 
oper 4 step nzv [training] 
oper 5 step normalize [training] 
oper 6 step impute knn [training] 
The retained training set is ~ 0.06 Mb  in memory.
# A tibble: 192 x 33
      id sibar1bt0 behandlungsdauer spe_sum_t0 phq_dept0 phq_som_t0
   <int>     <dbl>            <dbl>      <dbl>     <dbl>      <dbl>
 1     4   -2.60               1.16     -1.26      1.88      0.313 
 2    11   -2.03               2.64     -0.199     0.481    -0.672 
 3    16   -0.191              1.72      0.863     0.830     1.10  
 4    19   -0.0487             1.72     -0.199     0.133    -0.0813
 5    20    0.802              1.72     -0.199     0.656     1.10  
 6    21    0.0931             1.72      0.863     0.656    -0.869 
 7    22   -1.61               1.72     -0.199     1.70      0.904 
 8    23   -1.61               1.72     -0.199    -0.565    -0.672 
 9    24    0.518              1.72      1.93      0.656     0.313 
10    25   -1.75               1.72      0.863     0.656    -0.278 
# … with 182 more rows, and 27 more variables: phq_stress_t0 <dbl>,
#   gad7t0 <dbl>, ksk12t0 <dbl>, psk12t0 <dbl>, av_n1_t0 <dbl>,
#   av_n2_t0 <dbl>, av_n3_t0 <dbl>, av_n4_t0 <dbl>, av_n5_t0 <dbl>,
#   av_n6_t0 <dbl>, av_n7_t0 <dbl>, av_n8_t0 <dbl>, av_n9_t0 <dbl>,
#   av_n10_t0 <dbl>, av_n11_t0 <dbl>, sibar_risk <dbl>,
#   spe_sum_t3 <dbl>, sibar1at0_X1 <dbl>, sibar1at0_X2 <dbl>,
#   indikation_X1 <dbl>, indikation_X2 <dbl>, indikation_X3 <dbl>,
#   schulab_X1 <dbl>, schulab_X2 <dbl>, schulab_X3 <dbl>,
#   sibar_risk_di_X0 <dbl>, sibar_risk_di_X1 <dbl>
  # skim()

# Create a model

xg_mod <- boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

# Creating a workflow

xg_wf_ctrl <-
  workflow() %>% 
  add_recipe(ctrl_rec) %>% 
  add_model(xg_mod)
ctrl_folds <- vfold_cv(data_ctrl, v = 5, repeats = 5, strata = spe_sum_t3)

keep_pred <- control_resamples(save_pred = TRUE)

library(tictoc)
tic()
set.seed(33)
xg_rs_ctrl <-
  xg_wf_ctrl %>% 
  fit_resamples(ctrl_folds, control = keep_pred)
toc()
10.639 sec elapsed

Overall the predictions for the control group aren’t quite as good, but we’re still doing ok relative to the original predictions on other kinds of outcomes from Nick’s paper.

xg_rs_ctrl %>% 
  collect_metrics(summarize = TRUE)
# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   1.08     25  0.0203 Preprocessor1_Model1
2 rsq     standard   0.164    25  0.0175 Preprocessor1_Model1
r2_ctrl <- xg_rs_ctrl %>% 
  collect_metrics(summarize = TRUE) %>% 
  filter(str_detect(.metric, "rsq") == TRUE) %>% 
  dplyr::select(mean) %>% 
  deframe() 

sqrt(r2_ctrl)
[1] 0.4047978
ctrl_preds <- xg_rs_ctrl %>% 
  collect_predictions(summarize = TRUE)

What We’ll Get To in Part 2

In a non-Dash post, I’ll write a part 2 of this dataset/analysis where I finish out the modeling process.

So now, we need to use these models to predict how well we think folks would do if they had ended up in the opposite intervention (since we already have predictions of how they would do in the intervention they actually got). Once we have those estimates, we’ll be able to see if folks in a hold out sample who we predict would benefit more from one intervention compared to another actually do!


  1. Yikes↩︎

  2. double yikes↩︎

  3. Or at least make a good start, I don’t think there’s any way I’ll finish in one 2 hour chunk↩︎

  4. Or at least respond just as well↩︎

  5. I’m looking forward to continuing to learn more about this area and might change my mind!↩︎

  6. With the 2 hour time limit I didn’t have time to re-write the code!↩︎

  7. like dummy coding, but everything gets its own set of codes plus it doesn’t sound vaguely ableist↩︎

  8. no time for tuning in Data Distillery Dash!↩︎