The start of a prediction pipeline for psychological interventions.
I take a research topic at random1 or dataset I’ve never seen2, wrangle the data, create a predictive model or two, and write up a blog post: all within 2 hours live on Twitch! All the words and code in this post were written during an ~2 hour livestream. If you want to come join the party, I plan on doing this on Thursdays from 3:30-5:30 EST here.
We’re going to predict subjective feelings of being ready to go back to work at the end of an online randomized clinical trial!3 I learned about this data from a paper by Dr. Nick Jacobson and the data is from this paper.
This is mostly so I can practice setting up a predictive modeling pipeline that allows us to predict who would benefit more from one condition or another in a clinical trial. Even in situations like this where the control group seems minimal, they can actually have some therapeutic benefit. In fact, sometimes people seem to respond better to what we’ve described as psychological placebos compared to what we’ve described as active treatments!4
Let’s start with what we need to do for any predictive modeling pipleline: data cleaning/munging! This is all the more important here because the data was stored in a .sav SPSS data format. This means we need to read_sav from the haven package to read the data in, and to get a sense of what the variables actually are we use the view_df function from the sjPlot package.
I also decided to not use any of the non-collected at baseline variables in this particular modeling pipeline. Why? Right now I’m most interested in better matching people to treatments before they start. I would want to use the data over time if I was more interested in trying to tailor the treatment as it was being delivered. Even though it might not look like much, cleaning/munging the data (especially since I’d never seen it before!) took a lot of the time during this stream.
library(haven)
library(tidyverse)
library(tidymodels)
library(janitor)
data <- read_sav("online_rct.sav") %>%
clean_names()
# glimpse(data)
library(sjPlot)
view_df(data)
ID | Name | Label | Values | Value Labels |
---|---|---|---|---|
1 | imputation | Imputation code | 0 | original data |
2 | gruppenart | Study group |
1 2 |
CG IG |
3 | sibar1bt0 | Age | range: 25-59 | |
4 | sibar1at0 | Sex |
1 2 |
female male |
5 | indikation | Medical Indication |
1 2 3 |
psychosomatic cardiology orthopedic |
6 | schulab | Education |
1 2 3 |
9 years 10 years 13 years |
7 | behandlungsdauer | Time between intake and discharge (weeks) | range: 2-8 | |
8 | sibar5at1 |
Duration of sick leaves in the last 12 months (weeks) |
range: 0.0-58.0 | |
9 | arbeitsunf | Status of work ability at T1 (physicians’ rating) |
0 1 3 4 5 9 |
Maßnahme nicht ordnungsgemäß abgeschlossen, gestorben arbeitsfähig arbeitsunfähig Kinder-Reha Hausfrau / Hausmann Beurteilung nicht erforderlich (Altersrentner, Angehöriger) |
10 | leistft1 | Work capacity T1 |
1 2 |
fully capable limited to not at all capable |
11 | rentet1 | Application for pension T1 |
1 2 |
applied or plans to apply not applied and no plans |
12 | sibar2t1 |
Do you consider your work ability permanently threatened by your health status? |
1 2 |
yes no |
13 | sibar5t1 |
When you consider your current health and work ability: Do you believe that you will be working until retirement age? |
1 2 |
yes no |
14 | login | Login after discharge |
0 1 2 |
no yes not yet |
15 | tn_blog | Participation intervention (min. 1 Blog) |
0 1 |
no yes |
16 | blogs | Number of blogs | range: 0-13 | |
17 | zufig1at2 | Frequency Reading blogs (IG) |
1 2 3 4 |
daily several times a week seldom never |
18 | zufig1bt2 | Frequency Reading therapeutic feedback (IG) |
1 2 3 4 |
daily several times a week seldom never |
19 | zufig1ct2 | Frequency Using work sheets (IG) |
1 2 3 4 |
daily several times a week seldom never |
20 | zufig1dt2 | Frequency Listening to relaxation techniques (IG) |
1 2 3 4 |
daily several times a week seldom never |
21 | zufig1et2 | Frequency Reading posts in the forum (IG) |
1 2 3 4 |
daily several times a week seldom never |
22 | zufig1ft2 | Frequency Writing post in the forum (IG) |
1 2 3 4 |
daily several times a week seldom never |
23 | zufig2at2 | Helpful Writing blogs (IG) |
1 5 |
not at all very |
24 | zufig2bt2 | Helpful Reading therapeutic feedback (IG) |
1 5 |
not at all very |
25 | zufig2ct2 | Helpful Using work sheets (IG) |
1 5 |
not at all very |
26 | zufig2dt2 | Helpful Listening to relaxation techniques (IG) |
1 5 |
not at all very |
27 | zufig2et2 | Helpful Reading posts in the forum (IG) |
1 5 |
not at all very |
28 | zufig2ft2 | Helpful Writing posts in the forum (IG) |
1 5 |
not at all very |
29 | zufig2gt2 | Helpful Online aftercare alltogether (IG) |
1 5 |
not at all very |
30 | zufig3t2 | Helpful Alliance with online therapist (IG) |
1 5 |
not at all very |
31 | zufkg1at2 | Frequency Stress management info (CG) |
1 2 3 4 |
daily several times a week seldom never |
32 | zufkg1bt2 | Frequency Healthy diet info (CG) |
1 2 3 4 |
daily several times a week seldom never |
33 | zufkg1ct2 | Frequency Sports/ physical activity info (CG) |
1 2 3 4 |
daily several times a week seldom never |
34 | zufkg1dt2 | Frequency Sleep info (CG) |
1 2 3 4 |
daily several times a week seldom never |
35 | zufkg1et2 | Frequency Relaxation info (CG) |
1 2 3 4 |
daily several times a week seldom never |
36 | zufkg2at2 | Helpful Stress management info (CG) |
1 5 |
not at all very |
37 | zufkg2bt2 | Helpful Healthy diet info (CG) |
1 5 |
not at all very |
38 | zufkg2ct2 | Helpful Sports/ physical activity info (CG) |
1 5 |
not at all very |
39 | zufkg2dt2 | Helpful Sleep info (CG) |
1 5 |
not at all very |
40 | zufkg2et2 | Helpful Relaxation info (CG) |
1 5 |
not at all very |
41 | zufkg2ft2 | Helpful Online aftercare alltogether (CG) |
1 5 |
not at all very |
42 | zufges_t2_dichotom |
Dichotomised Helpfulness Online aftercare alltogether |
range: 1-2 | |
43 | spe_sum_t0 | Sumscore SPE T0 | range: 0-3 | |
44 | spe_sum_t1 | Sumscore SPE T1 | range: 0-3 | |
45 | spe_sum_t2 | Sumscore SPE T2 | range: 0-3 | |
46 | spe_sum_t3 | Sumscore SPE T3 | range: 0-3 | |
47 | phq_dept0 | Sumscore (0-27) depressive symptoms (PHQ-9) T0 | range: 0.0-27.0 | |
48 | phq_dept1 | Sumscore (0-27) depressive symptoms (PHQ-9) T1 | range: 0.0-27.0 | |
49 | phq_dept2 | Sumscore (0-27) depressive symptoms (PHQ-9) T2 | range: 0.0-27.0 | |
50 | phq_dept3 | Sumscore (0-27) depressive symptoms (PHQ-9) T3 | range: 0.0-27.0 | |
51 | phq_som_t0 | Sumscore (0-30) somatoform symptoms (PHQ-15) T0 | range: 0.0-26.8 | |
52 | phq_som_t1 | Sumscore (0-30) somatoform symptoms (PHQ-15) T1 | range: 0.0-26.0 | |
53 | phq_som_t2 | Sumscore (0-30) somatoform symptoms (PHQ-15) T2 | range: 0.0-27.7 | |
54 | phq_som_t3 | Sumscore (0-30) somatoform symptoms (PHQ-15) T3 | range: 0.0-26.5 | |
55 | phq_stress_t0 | Sumscore (0-20) PHQ-stress T0 | range: 0.0-19.0 | |
56 | phq_stress_t1 | Sumscore (0-20) PHQ-stress T1 | range: 0.0-20.0 | |
57 | phq_stress_t2 | Sumscore (0-20) PHQ-stress T2 | range: 0.0-20.0 | |
58 | phq_stress_t3 | Sumscore (0-20) PHQ-stress T3 | range: 0.0-20.0 | |
59 | gad7t0 | Sumscore (0-21) generalized anxiety (GAD-7) T0 | range: 0.0-21.0 | |
60 | gad7t1 | Sumscore (0-21) generalized anxiety (GAD-7) T1 | range: 0-21 | |
61 | gad7t2 | Sumscore (0-21) generalized anxiety (GAD-7) T2 | range: 0-21 | |
62 | gad7t3 | Sumscore (0-21) generalized anxiety (GAD-7) T3 | range: 0-21 | |
63 | ksk12t0 | SF-12 Physical Component Summary (T0) | range: 12.0-67.0 | |
64 | psk12t0 | SF-12 Mental Component Summary (T0) | range: 7.0-70.0 | |
65 | ksk12t1 | SF-12 Physical Component Summary (T1) | range: 17.0-72.0 | |
66 | psk12t1 | SF-12 Mental Component Summary (T1) | range: 10.0-78.0 | |
67 | ksk12t2 | SF-12 Physical Component Summary (T2) | range: 5.0-80.0 | |
68 | psk12t2 | SF-12 Mental Component Summary (T2) | range: 5.0-72.0 | |
69 | ksk12t3 | SF-12 Physical Component Summaryt (T3) | range: 10.0-74.0 | |
70 | psk12t3 | SF-12 Mental Component Summary (T3) | range: 9.0-79.0 | |
71 | av_n1_t0 |
AVEM Subjective importance of work T0 (Stanine-Werte) |
range: 1-9 | |
72 | av_n2_t0 | AVEM Work related ambition T0 (Stanine-Werte) | range: 1-9 | |
73 | av_n3_t0 |
AVEM Willingness to work until exhausted T0 (Stanine-Werte) |
range: 1-9 | |
74 | av_n4_t0 | AVEM Striving for perfection T0 (Stanine-Werte) | range: 1-9 | |
75 | av_n5_t0 | AVEM Distancing ability T0 (Stanine-Werte) | range: 2-8 | |
76 | av_n6_t0 | AVEM Tendency to resignation T0 (Stanine-Werte) | range: 1-9 | |
77 | av_n7_t0 | AVEM Proactive problem solving T0 (Stanine-Werte) | range: 1-9 | |
78 | av_n8_t0 | AVEM Inner calm and balance T0 (Stanine-Werte) | range: 1-9 | |
79 | av_n9_t0 |
AVEM Experience of success at work T0 (Stanine-Werte) |
range: 1-9 | |
80 | av_n10_t0 | AVEM Satisfaction with life T0 (Stanine-Werte) | range: 1-8 | |
81 | av_n11_t0 |
AVEM Experience of social support T0 (Stanine-Werte) |
range: 1-8 | |
82 | av_n1_t1 |
AVEM Subjective importance of work T1 (Stanine-Werte) |
range: 1-9 | |
83 | av_n2_t1 | AVEM Work related ambition T1 (Stanine-Werte) | range: 1-9 | |
84 | av_n3_t1 |
AVEM Willingness to work until exhausted T1 (Stanine-Werte) |
range: 1-9 | |
85 | av_n4_t1 | AVEM Striving for perfection T1 (Stanine-Werte) | range: 1-9 | |
86 | av_n5_t1 | AVEM Distancing ability T1 (Stanine-Werte) | range: 1-9 | |
87 | av_n6_t1 | AVEM Tendency to resignation T1 (Stanine-Werte) | range: 1-9 | |
88 | av_n7_t1 | AVEM Proactive problem solving T1 (Stanine-Werte) | range: 1-9 | |
89 | av_n8_t1 | AVEM Inner calm and balance T1 (Stanine-Werte) | range: 1-9 | |
90 | av_n9_t1 |
AVEM Experience of success at work T1 (Stanine-Werte) |
range: 1-9 | |
91 | av_n10_t1 | AVEM Satisfaction with life T1 (Stanine-Werte) | range: 1-9 | |
92 | av_n11_t1 |
AVEM Experience of social support T1 (Stanine-Werte) |
range: 1-8 | |
93 | av_n1_t2 |
AVEM Subjective importance of work T2 (Stanine-Werte) |
range: 1-9 | |
94 | av_n2_t2 | AVEM Work related ambition T2 (Stanine-Werte) | range: 1-9 | |
95 | av_n3_t2 |
AVEM Willingness to work until exhausted T2 (Stanine-Werte) |
range: 1-9 | |
96 | av_n4_t2 | AVEM Striving for perfection T2 (Stanine-Werte) | range: 1-9 | |
97 | av_n5_t2 | AVEM Distancing ability T2 (Stanine-Werte) | range: 1-9 | |
98 | av_n6_t2 | AVEM Tendency to resignation T2 (Stanine-Werte) | range: 1-9 | |
99 | av_n7_t2 | AVEM Proactive problem solving T2 (Stanine-Werte) | range: 1-9 | |
100 | av_n8_t2 | AVEM Inner calm and balance T2 (Stanine-Werte) | range: 1-9 | |
101 | av_n9_t2 |
AVEM Experience of success at work T2 (Stanine-Werte) |
range: 1-9 | |
102 | av_n10_t2 | AVEM Satisfaction with life T2 (Stanine-Werte) | range: 1-9 | |
103 | av_n11_t2 |
AVEM Experience of social support T2 (Stanine-Werte) |
range: 1-9 | |
104 | av_n1_t3 |
AVEM Subjective importance of work T3 (Stanine-Werte) |
range: 1-9 | |
105 | av_n2_t3 | AVEM Work related ambition T3 (Stanine-Werte) | range: 1-9 | |
106 | av_n3_t3 |
AVEM Willingness to work until exhausted T3 (Stanine-Werte) |
range: 1-9 | |
107 | av_n4_t3 | AVEM Striving for perfection T3 (Stanine-Werte) | range: 1-9 | |
108 | av_n5_t3 | AVEM Distancing ability T3 (Stanine-Werte) | range: 1-9 | |
109 | av_n6_t3 | AVEM Tendency to resignation T3 (Stanine-Werte) | range: 1-9 | |
110 | av_n7_t3 | AVEM Proactive problem solving T3 (Stanine-Werte) | range: 1-9 | |
111 | av_n8_t3 | AVEM Inner calm and balance T3 (Stanine-Werte) | range: 1-9 | |
112 | av_n9_t3 |
AVEM Experience of success at work T3 (Stanine-Werte) |
range: 1-9 | |
113 | av_n10_t3 | AVEM Satisfaction with life T3 (Stanine-Werte) | range: 1-8 | |
114 | av_n11_t3 |
AVEM Experience of social support T3 (Stanine-Werte) |
range: 1-9 | |
115 | sibar_risk | SIBAR Risk Score (0-19, Cutoff 8) T0 | range: 0.0-16.0 | |
116 | sibar_risk_di | SIBAR Risc Score > 8 T0 |
0 1 |
no yes |
117 | sibar_risk_t1 | SIBAR Risk Score (0-19, Cutoff 8) T1 | range: 0.0-16.0 | |
118 | sibar_risk_di_t1 | SIBAR Risc Score > 8 T1 |
0 1 |
no yes |
119 | sibar_riskt2 | SIBAR Risk Score (0-19, Cutoff 8) T2 | range: 0.0-16.0 | |
120 | sibar_riskdi_t2 | SIBAR Risc Score > 8 T2 |
0 1 |
no yes |
121 | sibar_riskt3 | SIBAR Risk Score (0-19, Cutoff 8) T3 | range: 0.0-17.0 | |
122 | sibar_riskdi_t3 | SIBAR Risc Score > 8 T3 |
0 1 |
no yes |
data_init <- data %>%
rename(cond = gruppenart) %>%
dplyr::select(-arbeitsunf,
-ends_with("_t1"), -ends_with("_t2"),
-ends_with("t1"), -ends_with("t2"),
-ends_with("1"),-ends_with("2"),
-ends_with("_t3"),
-ends_with("t3"),
-ends_with("3"),
-login, -tn_blog,
-blogs, -zufges_t2_dichotom) %>%
glimpse()
Rows: 6,952
Columns: 27
$ imputation <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cond <dbl+lbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, …
$ sibar1bt0 <dbl> 48, 47, 48, 47, 40, 31, 53, 43, 45, 54, 48,…
$ sibar1at0 <dbl+lbl> 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, …
$ indikation <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 2, 1, …
$ schulab <dbl+lbl> 1, 2, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 2, …
$ behandlungsdauer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ spe_sum_t0 <dbl> 3, 3, 3, 2, 3, 0, 1, 1, 0, 0, 2, 1, 2, 3, 0…
$ phq_dept0 <dbl> 20.00, 21.00, 20.25, 25.00, 23.00, 21.00, 1…
$ phq_som_t0 <dbl> 23.571429, 18.000000, 22.500000, 26.785714,…
$ phq_stress_t0 <dbl> 11.00000, 16.00000, 15.55556, 17.00000, 15.…
$ gad7t0 <dbl> 17, 15, 19, 21, 15, 13, 10, 13, 9, 12, 5, 6…
$ ksk12t0 <dbl> 32.39855, 31.59306, 24.46546, 50.51283, 39.…
$ psk12t0 <dbl> 23.66914, 29.21420, 30.79812, 13.90037, 23.…
$ av_n1_t0 <dbl> 4, 3, 5, 5, 2, 3, 4, 8, 4, 5, 5, 4, 5, 2, 2…
$ av_n2_t0 <dbl> 5, 5, 1, 8, 4, 2, 4, 9, 4, 7, 4, 5, 6, 3, 3…
$ av_n3_t0 <dbl> 8, 5, 9, 9, 8, 8, 7, 9, 4, 5, 3, 8, 6, 6, 4…
$ av_n4_t0 <dbl> 8, 6, 7, 9, 9, 3, 6, 9, 3, 8, 2, 3, 6, 5, 4…
$ av_n5_t0 <dbl> 3, 5, 3, 3, 3, 6, 5, 3, 5, 4, 4, 4, 4, 6, 4…
$ av_n6_t0 <dbl> 8, 8, 8, 8, 9, 6, 4, 7, 3, 4, 3, 2, 9, 7, 7…
$ av_n7_t0 <dbl> 7, 4, 2, 6, 1, 1, 5, 3, 5, 6, 1, 8, 1, 3, 1…
$ av_n8_t0 <dbl> 3, 3, 3, 1, 1, 3, 6, 4, 7, 4, 6, 5, 3, 1, 3…
$ av_n9_t0 <dbl> 5, 2, 3, 6, 1, 3, 3, 7, 5, 9, 3, 5, 1, 8, 3…
$ av_n10_t0 <dbl> 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 4, 5, 1, 4, 2…
$ av_n11_t0 <dbl> 2, 2, 1, 3, 2, 2, 3, 4, 6, 6, 6, 4, 3, 2, 3…
$ sibar_risk <dbl> 10.0, 8.0, 13.0, 5.0, 11.0, 5.0, 6.0, 2.5, …
$ sibar_risk_di <dbl+lbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
# Pull out fake t1 variables that are actually at baseline
outcome <- data %>%
dplyr::select(spe_sum_t3) # %>%
# glimpse()
data_updt <- data_init %>%
bind_cols(outcome) %>%
mutate(across(
c(sibar1at0:schulab, imputation,cond,sibar_risk_di),
as.factor
)) %>%
filter(imputation == "0" & !is.na(spe_sum_t3)) %>%
mutate(id = row_number()) %>%
relocate(id, everything()) %>%
glimpse()
Rows: 445
Columns: 29
$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ imputation <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ cond <fct> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
$ sibar1bt0 <dbl> 47, 47, 40, 31, 53, 54, 48, 49, 40, 31, 35,…
$ sibar1at0 <fct> 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2…
$ indikation <fct> 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1…
$ schulab <fct> 2, 3, 2, 2, 2, 3, 3, 2, 3, 2, 2, 1, 2, 1, 2…
$ behandlungsdauer <dbl> NA, NA, NA, NA, NA, NA, NA, 8, 7, 7, 7, 7, …
$ spe_sum_t0 <dbl> 3, 2, 3, 0, 1, 0, 2, 1, 3, 0, 1, 2, 1, 1, 1…
$ phq_dept0 <dbl> 21, 25, 23, 21, 11, 11, 10, 3, 6, 15, 13, 1…
$ phq_som_t0 <dbl> 18.000000, 26.785714, 19.000000, 13.000000,…
$ phq_stress_t0 <dbl> 16.000000, 17.000000, 15.000000, 15.000000,…
$ gad7t0 <dbl> 15, 21, 15, 13, 10, 12, 5, 6, 14, 11, 11, 2…
$ ksk12t0 <dbl> 31.59306, 50.51283, 39.60604, 52.61199, 36.…
$ psk12t0 <dbl> 29.21420, 13.90037, 23.51996, 15.58050, 27.…
$ av_n1_t0 <dbl> 3, 5, 2, 3, 4, 5, 5, 4, 2, 2, 3, 5, 4, 5, 4…
$ av_n2_t0 <dbl> 5, 8, 4, 2, 4, 7, 4, 5, 3, 3, 5, 4, 2, 5, 2…
$ av_n3_t0 <dbl> 5, 9, 8, 8, 7, 5, 3, 8, 6, 4, 8, 5, 8, 6, 3…
$ av_n4_t0 <dbl> 6, 9, 9, 3, 6, 8, 2, 3, 5, 4, 6, 5, 4, 8, 3…
$ av_n5_t0 <dbl> 5, 3, 3, 6, 5, 4, 4, 4, 6, 4, 3, 6, 4, 4, 7…
$ av_n6_t0 <dbl> 8, 8, 9, 6, 4, 4, 3, 2, 7, 7, 9, 6, 5, 9, 4…
$ av_n7_t0 <dbl> 4, 6, 1, 1, 5, 6, 1, 8, 3, 1, 1, 4, 1, 4, 4…
$ av_n8_t0 <dbl> 3, 1, 1, 3, 6, 4, 6, 5, 1, 3, 1, 6, 1, 4, 4…
$ av_n9_t0 <dbl> 2, 6, 1, 3, 3, 9, 3, 5, 8, 3, 1, 5, 1, 3, 3…
$ av_n10_t0 <dbl> 1, 2, 1, 1, 2, 1, 4, 5, 4, 2, 1, 4, 1, 2, 4…
$ av_n11_t0 <dbl> 2, 3, 2, 2, 3, 6, 6, 4, 2, 3, 2, 5, 1, 4, 4…
$ sibar_risk <dbl> 8.0, 5.0, 11.0, 5.0, 6.0, 1.0, 3.0, 2.5, 9.…
$ sibar_risk_di <fct> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
$ spe_sum_t3 <dbl> 3, 0, 3, 0, 0, 1, 0, 2, 2, 0, 0, 2, 2, 2, 2…
For reasons I’ll go into more in a subsequent post, I want to create predictions separately for folks in the intervention group and folks in the control group. The short version of why: I want to be able to predict how well I would expect people to respond not just to the intervention they actually received, but how well the model thinks they would have done if they had received the other intervention. Building separate models by treatment condition isn’t the only way to do this, but it seems like the cleanest/least likely to lead to data leakage that I can think of right now5
On top of that, I’ll need to test whether those hypothetical predictions actually generalize to hold out data. So, we need to hold out at least some folks from both intervention groups to ultimately test whether our predictions about the “optimal intervention” are any good. If I could turn back time, I would just use the “initial_split” argument from the rsample package within tidymodels to create the test sets for both the intervention and control groups. That’s primarily the case because I’m not 100% sure this is reproducible and I’m not stratifying by the outcome6
# Breaking out the predictions into the separate groups
data_intv_init <- data_updt %>%
filter(cond == "2") %>%
as.data.frame() # %>%
#glimpse()
data_ctrl_init <- data_updt %>%
filter(cond == "1") %>%
as.data.frame() # %>%
#glimpse()
# Pulling out folks for test set later
test_intv <- data_intv_init %>%
slice_sample(n = 40)
data_intv <- data_intv_init %>%
anti_join(test_intv, by = "id") #%>%
#glimpse()
test_ctrl <- data_ctrl_init %>%
slice_sample(n = 40)
data_ctrl <- data_ctrl_init %>%
anti_join(test_ctrl, by = "id") #%>%
#glimpse()
Then I used the recipes package to preprocess the data! I wanted to remove imputation (which didn’t vary/wasn’t relevant), designate id as an id variable rather than a predictor, one hot encode the categorical variables7, excluded near-zero variance predictors, normalized all predictors, and imputed any missing data in predictors using k-nearest neighbors.
I then set up a tidymodels workflow using an out of the box xgboost model8 and made my predictions for the intervention group first.
intv_rec <-
recipe(spe_sum_t3 ~ ., data = data_intv) %>%
step_rm(imputation) %>%
update_role(id, new_role = "id") %>%
step_other(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_nzv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_impute_knn(all_predictors())
# Testing the recipe to make sure it produces data
intv_rec %>%
prep(verbose = TRUE) %>%
bake(new_data = NULL) # %>%
oper 1 step rm [training]
oper 2 step other [training]
oper 3 step dummy [training]
oper 4 step nzv [training]
oper 5 step normalize [training]
oper 6 step impute knn [training]
The retained training set is ~ 0.06 Mb in memory.
# A tibble: 173 x 33
id sibar1bt0 behandlungsdauer spe_sum_t0 phq_dept0 phq_som_t0
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 -0.326 0.805 0.930 2.41 2.79
2 3 -1.38 1.14 1.94 2.08 1.33
3 6 0.724 -0.866 -1.08 0.0881 -0.725
4 7 -0.176 -1.03 0.930 -0.0777 0.210
5 8 -0.0260 3.15 -0.0756 -1.24 -0.538
6 9 -1.38 2.31 1.94 -0.741 -0.164
7 10 -2.72 2.31 -1.08 0.752 -0.538
8 12 -2.87 2.31 0.930 0.420 0.383
9 13 -0.0260 2.31 -0.0756 2.74 1.14
10 14 -1.83 2.31 -0.0756 1.25 0.383
# … with 163 more rows, and 27 more variables: phq_stress_t0 <dbl>,
# gad7t0 <dbl>, ksk12t0 <dbl>, psk12t0 <dbl>, av_n1_t0 <dbl>,
# av_n2_t0 <dbl>, av_n3_t0 <dbl>, av_n4_t0 <dbl>, av_n5_t0 <dbl>,
# av_n6_t0 <dbl>, av_n7_t0 <dbl>, av_n8_t0 <dbl>, av_n9_t0 <dbl>,
# av_n10_t0 <dbl>, av_n11_t0 <dbl>, sibar_risk <dbl>,
# spe_sum_t3 <dbl>, sibar1at0_X1 <dbl>, sibar1at0_X2 <dbl>,
# indikation_X1 <dbl>, indikation_X2 <dbl>, indikation_X3 <dbl>,
# schulab_X1 <dbl>, schulab_X2 <dbl>, schulab_X3 <dbl>,
# sibar_risk_di_X0 <dbl>, sibar_risk_di_X1 <dbl>
#skim()
# Create a model
xg_mod <- boost_tree() %>%
set_engine("xgboost") %>%
set_mode("regression")
# Creating a workflow
xg_wf <-
workflow() %>%
add_recipe(intv_rec) %>%
add_model(xg_mod)
intv_folds <- vfold_cv(data_intv, v = 5, repeats = 5, strata = spe_sum_t3)
keep_pred <- control_resamples(save_pred = TRUE)
library(tictoc)
tic()
set.seed(33)
xg_rs <-
xg_wf %>%
fit_resamples(intv_folds, control = keep_pred)
toc()
10.855 sec elapsed
Overall we predictions that aren’t terribly far off from the predictions of changes in anxiety and depression from Nick’s original paper (which used an ensemble model stacked on a bunch of base learners). We can’t make an apples to oranges comparison since he and his co-author were predicting different outcomes, but at least I’m not wildly off from where they were.
xg_rs %>%
collect_metrics(summarize = TRUE)
# A tibble: 2 x 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.967 25 0.0177 Preprocessor1_Model1
2 rsq standard 0.193 25 0.0202 Preprocessor1_Model1
r2_intv <- xg_rs %>%
collect_metrics(summarize = TRUE) %>%
filter(str_detect(.metric, "rsq") == TRUE) %>%
dplyr::select(mean) %>%
deframe()
sqrt(r2_intv)
[1] 0.4387755
intv_preds <- xg_rs %>%
collect_predictions(summarize = TRUE)
# In order to get predictions in the placebo group I have to fit a workflow first
# fit_workflow <- fit(xg_wf, data_ctrl)
I then do the same preprocessing and model fitting for the control group!
ctrl_rec <-
recipe(spe_sum_t3 ~ ., data = data_ctrl) %>%
step_rm(imputation) %>%
update_role(id, new_role = "id") %>%
step_other(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_nzv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_impute_knn(all_predictors())
# Testing the recipe to make sure it produces data
ctrl_rec %>%
prep(verbose = TRUE) %>%
bake(new_data = NULL) #%>%
oper 1 step rm [training]
oper 2 step other [training]
oper 3 step dummy [training]
oper 4 step nzv [training]
oper 5 step normalize [training]
oper 6 step impute knn [training]
The retained training set is ~ 0.06 Mb in memory.
# A tibble: 192 x 33
id sibar1bt0 behandlungsdauer spe_sum_t0 phq_dept0 phq_som_t0
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 -2.60 1.16 -1.26 1.88 0.313
2 11 -2.03 2.64 -0.199 0.481 -0.672
3 16 -0.191 1.72 0.863 0.830 1.10
4 19 -0.0487 1.72 -0.199 0.133 -0.0813
5 20 0.802 1.72 -0.199 0.656 1.10
6 21 0.0931 1.72 0.863 0.656 -0.869
7 22 -1.61 1.72 -0.199 1.70 0.904
8 23 -1.61 1.72 -0.199 -0.565 -0.672
9 24 0.518 1.72 1.93 0.656 0.313
10 25 -1.75 1.72 0.863 0.656 -0.278
# … with 182 more rows, and 27 more variables: phq_stress_t0 <dbl>,
# gad7t0 <dbl>, ksk12t0 <dbl>, psk12t0 <dbl>, av_n1_t0 <dbl>,
# av_n2_t0 <dbl>, av_n3_t0 <dbl>, av_n4_t0 <dbl>, av_n5_t0 <dbl>,
# av_n6_t0 <dbl>, av_n7_t0 <dbl>, av_n8_t0 <dbl>, av_n9_t0 <dbl>,
# av_n10_t0 <dbl>, av_n11_t0 <dbl>, sibar_risk <dbl>,
# spe_sum_t3 <dbl>, sibar1at0_X1 <dbl>, sibar1at0_X2 <dbl>,
# indikation_X1 <dbl>, indikation_X2 <dbl>, indikation_X3 <dbl>,
# schulab_X1 <dbl>, schulab_X2 <dbl>, schulab_X3 <dbl>,
# sibar_risk_di_X0 <dbl>, sibar_risk_di_X1 <dbl>
# skim()
# Create a model
xg_mod <- boost_tree() %>%
set_engine("xgboost") %>%
set_mode("regression")
# Creating a workflow
xg_wf_ctrl <-
workflow() %>%
add_recipe(ctrl_rec) %>%
add_model(xg_mod)
ctrl_folds <- vfold_cv(data_ctrl, v = 5, repeats = 5, strata = spe_sum_t3)
keep_pred <- control_resamples(save_pred = TRUE)
library(tictoc)
tic()
set.seed(33)
xg_rs_ctrl <-
xg_wf_ctrl %>%
fit_resamples(ctrl_folds, control = keep_pred)
toc()
10.639 sec elapsed
Overall the predictions for the control group aren’t quite as good, but we’re still doing ok relative to the original predictions on other kinds of outcomes from Nick’s paper.
xg_rs_ctrl %>%
collect_metrics(summarize = TRUE)
# A tibble: 2 x 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.08 25 0.0203 Preprocessor1_Model1
2 rsq standard 0.164 25 0.0175 Preprocessor1_Model1
r2_ctrl <- xg_rs_ctrl %>%
collect_metrics(summarize = TRUE) %>%
filter(str_detect(.metric, "rsq") == TRUE) %>%
dplyr::select(mean) %>%
deframe()
sqrt(r2_ctrl)
[1] 0.4047978
ctrl_preds <- xg_rs_ctrl %>%
collect_predictions(summarize = TRUE)
In a non-Dash post, I’ll write a part 2 of this dataset/analysis where I finish out the modeling process.
So now, we need to use these models to predict how well we think folks would do if they had ended up in the opposite intervention (since we already have predictions of how they would do in the intervention they actually got). Once we have those estimates, we’ll be able to see if folks in a hold out sample who we predict would benefit more from one intervention compared to another actually do!
Yikes↩︎
double yikes↩︎
Or at least make a good start, I don’t think there’s any way I’ll finish in one 2 hour chunk↩︎
Or at least respond just as well↩︎
I’m looking forward to continuing to learn more about this area and might change my mind!↩︎
With the 2 hour time limit I didn’t have time to re-write the code!↩︎
like dummy coding, but everything gets its own set of codes plus it doesn’t sound vaguely ableist↩︎
no time for tuning in Data Distillery Dash!↩︎