Michael Mullarkey: Data Distillery Dash Episode 02

What Is Data Distillery Dash?

I take a research topic at random¹ or dataset I’ve never seen², wrangle the data, create a predictive model or two, and write up a blog post: all within 2 hours live on Twitch! All the words and code in this post were written during an ~2 hour livestream. If you want to come join the party, I plan on doing this on Thursdays from 3:30-5:30 EST here.

What Data Are We Using Today?

We’re going to predict subjective feelings of being ready to go back to work at the end of an online randomized clinical trial!³ I learned about this data from a paper by Dr. Nick Jacobson and the data is from this paper.

This is mostly so I can practice setting up a predictive modeling pipeline that allows us to predict who would benefit more from one condition or another in a clinical trial. Even in situations like this where the control group seems minimal, they can actually have some therapeutic benefit. In fact, sometimes people seem to respond better to what we’ve described as psychological placebos compared to what we’ve described as active treatments!⁴

Let’s start with what we need to do for any predictive modeling pipleline: data cleaning/munging! This is all the more important here because the data was stored in a .sav SPSS data format. This means we need to read_sav from the haven package to read the data in, and to get a sense of what the variables actually are we use the view_df function from the sjPlot package.

I also decided to not use any of the non-collected at baseline variables in this particular modeling pipeline. Why? Right now I’m most interested in better matching people to treatments before they start. I would want to use the data over time if I was more interested in trying to tailor the treatment as it was being delivered. Even though it might not look like much, cleaning/munging the data (especially since I’d never seen it before!) took a lot of the time during this stream.

library(haven)
library(tidyverse)
library(tidymodels)
library(janitor)

data <- read_sav("online_rct.sav") %>% 
  clean_names()

# glimpse(data)

library(sjPlot)

view_df(data)

Data frame: data
ID	Name	Label	Values	Value Labels
1	imputation	Imputation code	0	original data
2	gruppenart	Study group	1 2	CG IG
3	sibar1bt0	Age	range: 25-59
4	sibar1at0	Sex	1 2	female male
5	indikation	Medical Indication	1 2 3	psychosomatic cardiology orthopedic
6	schulab	Education	1 2 3	9 years 10 years 13 years
7	behandlungsdauer	Time between intake and discharge (weeks)	range: 2-8
8	sibar5at1	Duration of sick leaves in the last 12 months (weeks)	range: 0.0-58.0
9	arbeitsunf	Status of work ability at T1 (physicians’ rating)	0 1 3 4 5 9	Maßnahme nicht ordnungsgemäß abgeschlossen, gestorben arbeitsfähig arbeitsunfähig Kinder-Reha Hausfrau / Hausmann Beurteilung nicht erforderlich (Altersrentner, Angehöriger)
10	leistft1	Work capacity T1	1 2	fully capable limited to not at all capable
11	rentet1	Application for pension T1	1 2	applied or plans to apply not applied and no plans
12	sibar2t1	Do you consider your work ability permanently threatened by your health status?	1 2	yes no
13	sibar5t1	When you consider your current health and work ability: Do you believe that you will be working until retirement age?	1 2	yes no
14	login	Login after discharge	0 1 2	no yes not yet
15	tn_blog	Participation intervention (min. 1 Blog)	0 1	no yes
16	blogs	Number of blogs	range: 0-13
17	zufig1at2	Frequency Reading blogs (IG)	1 2 3 4	daily several times a week seldom never
18	zufig1bt2	Frequency Reading therapeutic feedback (IG)	1 2 3 4	daily several times a week seldom never
19	zufig1ct2	Frequency Using work sheets (IG)	1 2 3 4	daily several times a week seldom never
20	zufig1dt2	Frequency Listening to relaxation techniques (IG)	1 2 3 4	daily several times a week seldom never
21	zufig1et2	Frequency Reading posts in the forum (IG)	1 2 3 4	daily several times a week seldom never
22	zufig1ft2	Frequency Writing post in the forum (IG)	1 2 3 4	daily several times a week seldom never
23	zufig2at2	Helpful Writing blogs (IG)	1 5	not at all very
24	zufig2bt2	Helpful Reading therapeutic feedback (IG)	1 5	not at all very
25	zufig2ct2	Helpful Using work sheets (IG)	1 5	not at all very
26	zufig2dt2	Helpful Listening to relaxation techniques (IG)	1 5	not at all very
27	zufig2et2	Helpful Reading posts in the forum (IG)	1 5	not at all very
28	zufig2ft2	Helpful Writing posts in the forum (IG)	1 5	not at all very
29	zufig2gt2	Helpful Online aftercare alltogether (IG)	1 5	not at all very
30	zufig3t2	Helpful Alliance with online therapist (IG)	1 5	not at all very
31	zufkg1at2	Frequency Stress management info (CG)	1 2 3 4	daily several times a week seldom never
32	zufkg1bt2	Frequency Healthy diet info (CG)	1 2 3 4	daily several times a week seldom never
33	zufkg1ct2	Frequency Sports/ physical activity info (CG)	1 2 3 4	daily several times a week seldom never
34	zufkg1dt2	Frequency Sleep info (CG)	1 2 3 4	daily several times a week seldom never
35	zufkg1et2	Frequency Relaxation info (CG)	1 2 3 4	daily several times a week seldom never
36	zufkg2at2	Helpful Stress management info (CG)	1 5	not at all very
37	zufkg2bt2	Helpful Healthy diet info (CG)	1 5	not at all very
38	zufkg2ct2	Helpful Sports/ physical activity info (CG)	1 5	not at all very
39	zufkg2dt2	Helpful Sleep info (CG)	1 5	not at all very
40	zufkg2et2	Helpful Relaxation info (CG)	1 5	not at all very
41	zufkg2ft2	Helpful Online aftercare alltogether (CG)	1 5	not at all very
42	zufges_t2_dichotom	Dichotomised Helpfulness Online aftercare alltogether	range: 1-2
43	spe_sum_t0	Sumscore SPE T0	range: 0-3
44	spe_sum_t1	Sumscore SPE T1	range: 0-3
45	spe_sum_t2	Sumscore SPE T2	range: 0-3
46	spe_sum_t3	Sumscore SPE T3	range: 0-3
47	phq_dept0	Sumscore (0-27) depressive symptoms (PHQ-9) T0	range: 0.0-27.0
48	phq_dept1	Sumscore (0-27) depressive symptoms (PHQ-9) T1	range: 0.0-27.0
49	phq_dept2	Sumscore (0-27) depressive symptoms (PHQ-9) T2	range: 0.0-27.0
50	phq_dept3	Sumscore (0-27) depressive symptoms (PHQ-9) T3	range: 0.0-27.0
51	phq_som_t0	Sumscore (0-30) somatoform symptoms (PHQ-15) T0	range: 0.0-26.8
52	phq_som_t1	Sumscore (0-30) somatoform symptoms (PHQ-15) T1	range: 0.0-26.0
53	phq_som_t2	Sumscore (0-30) somatoform symptoms (PHQ-15) T2	range: 0.0-27.7
54	phq_som_t3	Sumscore (0-30) somatoform symptoms (PHQ-15) T3	range: 0.0-26.5
55	phq_stress_t0	Sumscore (0-20) PHQ-stress T0	range: 0.0-19.0
56	phq_stress_t1	Sumscore (0-20) PHQ-stress T1	range: 0.0-20.0
57	phq_stress_t2	Sumscore (0-20) PHQ-stress T2	range: 0.0-20.0
58	phq_stress_t3	Sumscore (0-20) PHQ-stress T3	range: 0.0-20.0
59	gad7t0	Sumscore (0-21) generalized anxiety (GAD-7) T0	range: 0.0-21.0
60	gad7t1	Sumscore (0-21) generalized anxiety (GAD-7) T1	range: 0-21
61	gad7t2	Sumscore (0-21) generalized anxiety (GAD-7) T2	range: 0-21
62	gad7t3	Sumscore (0-21) generalized anxiety (GAD-7) T3	range: 0-21
63	ksk12t0	SF-12 Physical Component Summary (T0)	range: 12.0-67.0
64	psk12t0	SF-12 Mental Component Summary (T0)	range: 7.0-70.0
65	ksk12t1	SF-12 Physical Component Summary (T1)	range: 17.0-72.0
66	psk12t1	SF-12 Mental Component Summary (T1)	range: 10.0-78.0
67	ksk12t2	SF-12 Physical Component Summary (T2)	range: 5.0-80.0
68	psk12t2	SF-12 Mental Component Summary (T2)	range: 5.0-72.0
69	ksk12t3	SF-12 Physical Component Summaryt (T3)	range: 10.0-74.0
70	psk12t3	SF-12 Mental Component Summary (T3)	range: 9.0-79.0
71	av_n1_t0	AVEM Subjective importance of work T0 (Stanine-Werte)	range: 1-9
72	av_n2_t0	AVEM Work related ambition T0 (Stanine-Werte)	range: 1-9
73	av_n3_t0	AVEM Willingness to work until exhausted T0 (Stanine-Werte)	range: 1-9
74	av_n4_t0	AVEM Striving for perfection T0 (Stanine-Werte)	range: 1-9
75	av_n5_t0	AVEM Distancing ability T0 (Stanine-Werte)	range: 2-8
76	av_n6_t0	AVEM Tendency to resignation T0 (Stanine-Werte)	range: 1-9
77	av_n7_t0	AVEM Proactive problem solving T0 (Stanine-Werte)	range: 1-9
78	av_n8_t0	AVEM Inner calm and balance T0 (Stanine-Werte)	range: 1-9
79	av_n9_t0	AVEM Experience of success at work T0 (Stanine-Werte)	range: 1-9
80	av_n10_t0	AVEM Satisfaction with life T0 (Stanine-Werte)	range: 1-8
81	av_n11_t0	AVEM Experience of social support T0 (Stanine-Werte)	range: 1-8
82	av_n1_t1	AVEM Subjective importance of work T1 (Stanine-Werte)	range: 1-9
83	av_n2_t1	AVEM Work related ambition T1 (Stanine-Werte)	range: 1-9
84	av_n3_t1	AVEM Willingness to work until exhausted T1 (Stanine-Werte)	range: 1-9
85	av_n4_t1	AVEM Striving for perfection T1 (Stanine-Werte)	range: 1-9
86	av_n5_t1	AVEM Distancing ability T1 (Stanine-Werte)	range: 1-9
87	av_n6_t1	AVEM Tendency to resignation T1 (Stanine-Werte)	range: 1-9
88	av_n7_t1	AVEM Proactive problem solving T1 (Stanine-Werte)	range: 1-9
89	av_n8_t1	AVEM Inner calm and balance T1 (Stanine-Werte)	range: 1-9
90	av_n9_t1	AVEM Experience of success at work T1 (Stanine-Werte)	range: 1-9
91	av_n10_t1	AVEM Satisfaction with life T1 (Stanine-Werte)	range: 1-9
92	av_n11_t1	AVEM Experience of social support T1 (Stanine-Werte)	range: 1-8
93	av_n1_t2	AVEM Subjective importance of work T2 (Stanine-Werte)	range: 1-9
94	av_n2_t2	AVEM Work related ambition T2 (Stanine-Werte)	range: 1-9
95	av_n3_t2	AVEM Willingness to work until exhausted T2 (Stanine-Werte)	range: 1-9
96	av_n4_t2	AVEM Striving for perfection T2 (Stanine-Werte)	range: 1-9
97	av_n5_t2	AVEM Distancing ability T2 (Stanine-Werte)	range: 1-9
98	av_n6_t2	AVEM Tendency to resignation T2 (Stanine-Werte)	range: 1-9
99	av_n7_t2	AVEM Proactive problem solving T2 (Stanine-Werte)	range: 1-9
100	av_n8_t2	AVEM Inner calm and balance T2 (Stanine-Werte)	range: 1-9
101	av_n9_t2	AVEM Experience of success at work T2 (Stanine-Werte)	range: 1-9
102	av_n10_t2	AVEM Satisfaction with life T2 (Stanine-Werte)	range: 1-9
103	av_n11_t2	AVEM Experience of social support T2 (Stanine-Werte)	range: 1-9
104	av_n1_t3	AVEM Subjective importance of work T3 (Stanine-Werte)	range: 1-9
105	av_n2_t3	AVEM Work related ambition T3 (Stanine-Werte)	range: 1-9
106	av_n3_t3	AVEM Willingness to work until exhausted T3 (Stanine-Werte)	range: 1-9
107	av_n4_t3	AVEM Striving for perfection T3 (Stanine-Werte)	range: 1-9
108	av_n5_t3	AVEM Distancing ability T3 (Stanine-Werte)	range: 1-9
109	av_n6_t3	AVEM Tendency to resignation T3 (Stanine-Werte)	range: 1-9
110	av_n7_t3	AVEM Proactive problem solving T3 (Stanine-Werte)	range: 1-9
111	av_n8_t3	AVEM Inner calm and balance T3 (Stanine-Werte)	range: 1-9
112	av_n9_t3	AVEM Experience of success at work T3 (Stanine-Werte)	range: 1-9
113	av_n10_t3	AVEM Satisfaction with life T3 (Stanine-Werte)	range: 1-8
114	av_n11_t3	AVEM Experience of social support T3 (Stanine-Werte)	range: 1-9
115	sibar_risk	SIBAR Risk Score (0-19, Cutoff 8) T0	range: 0.0-16.0
116	sibar_risk_di	SIBAR Risc Score > 8 T0	0 1	no yes
117	sibar_risk_t1	SIBAR Risk Score (0-19, Cutoff 8) T1	range: 0.0-16.0
118	sibar_risk_di_t1	SIBAR Risc Score > 8 T1	0 1	no yes
119	sibar_riskt2	SIBAR Risk Score (0-19, Cutoff 8) T2	range: 0.0-16.0
120	sibar_riskdi_t2	SIBAR Risc Score > 8 T2	0 1	no yes
121	sibar_riskt3	SIBAR Risk Score (0-19, Cutoff 8) T3	range: 0.0-17.0
122	sibar_riskdi_t3	SIBAR Risc Score > 8 T3	0 1	no yes

data_init <- data %>%
  rename(cond = gruppenart) %>% 
  dplyr::select(-arbeitsunf,
                -ends_with("_t1"), -ends_with("_t2"),
                -ends_with("t1"), -ends_with("t2"),
                -ends_with("1"),-ends_with("2"),
                -ends_with("_t3"),
                -ends_with("t3"),
                -ends_with("3"),
                -login, -tn_blog,
                -blogs, -zufges_t2_dichotom) %>% 
  glimpse()

Rows: 6,952
Columns: 27
$ imputation       <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cond             <dbl+lbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, …
$ sibar1bt0        <dbl> 48, 47, 48, 47, 40, 31, 53, 43, 45, 54, 48,…
$ sibar1at0        <dbl+lbl> 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, …
$ indikation       <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 2, 1, …
$ schulab          <dbl+lbl> 1, 2, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 2, …
$ behandlungsdauer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ spe_sum_t0       <dbl> 3, 3, 3, 2, 3, 0, 1, 1, 0, 0, 2, 1, 2, 3, 0…
$ phq_dept0        <dbl> 20.00, 21.00, 20.25, 25.00, 23.00, 21.00, 1…
$ phq_som_t0       <dbl> 23.571429, 18.000000, 22.500000, 26.785714,…
$ phq_stress_t0    <dbl> 11.00000, 16.00000, 15.55556, 17.00000, 15.…
$ gad7t0           <dbl> 17, 15, 19, 21, 15, 13, 10, 13, 9, 12, 5, 6…
$ ksk12t0          <dbl> 32.39855, 31.59306, 24.46546, 50.51283, 39.…
$ psk12t0          <dbl> 23.66914, 29.21420, 30.79812, 13.90037, 23.…
$ av_n1_t0         <dbl> 4, 3, 5, 5, 2, 3, 4, 8, 4, 5, 5, 4, 5, 2, 2…
$ av_n2_t0         <dbl> 5, 5, 1, 8, 4, 2, 4, 9, 4, 7, 4, 5, 6, 3, 3…
$ av_n3_t0         <dbl> 8, 5, 9, 9, 8, 8, 7, 9, 4, 5, 3, 8, 6, 6, 4…
$ av_n4_t0         <dbl> 8, 6, 7, 9, 9, 3, 6, 9, 3, 8, 2, 3, 6, 5, 4…
$ av_n5_t0         <dbl> 3, 5, 3, 3, 3, 6, 5, 3, 5, 4, 4, 4, 4, 6, 4…
$ av_n6_t0         <dbl> 8, 8, 8, 8, 9, 6, 4, 7, 3, 4, 3, 2, 9, 7, 7…
$ av_n7_t0         <dbl> 7, 4, 2, 6, 1, 1, 5, 3, 5, 6, 1, 8, 1, 3, 1…
$ av_n8_t0         <dbl> 3, 3, 3, 1, 1, 3, 6, 4, 7, 4, 6, 5, 3, 1, 3…
$ av_n9_t0         <dbl> 5, 2, 3, 6, 1, 3, 3, 7, 5, 9, 3, 5, 1, 8, 3…
$ av_n10_t0        <dbl> 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 4, 5, 1, 4, 2…
$ av_n11_t0        <dbl> 2, 2, 1, 3, 2, 2, 3, 4, 6, 6, 6, 4, 3, 2, 3…
$ sibar_risk       <dbl> 10.0, 8.0, 13.0, 5.0, 11.0, 5.0, 6.0, 2.5, …
$ sibar_risk_di    <dbl+lbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …

# Pull out fake t1 variables that are actually at baseline

outcome <- data %>% 
  dplyr::select(spe_sum_t3) # %>% 
  # glimpse()

data_updt <- data_init %>% 
  bind_cols(outcome) %>% 
   mutate(across(
    c(sibar1at0:schulab, imputation,cond,sibar_risk_di),
    as.factor
  )) %>% 
  filter(imputation == "0" & !is.na(spe_sum_t3)) %>%
  mutate(id = row_number()) %>% 
  relocate(id, everything()) %>% 
  glimpse()

Rows: 445
Columns: 29
$ id               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ imputation       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ cond             <fct> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
$ sibar1bt0        <dbl> 47, 47, 40, 31, 53, 54, 48, 49, 40, 31, 35,…
$ sibar1at0        <fct> 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2…
$ indikation       <fct> 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1…
$ schulab          <fct> 2, 3, 2, 2, 2, 3, 3, 2, 3, 2, 2, 1, 2, 1, 2…
$ behandlungsdauer <dbl> NA, NA, NA, NA, NA, NA, NA, 8, 7, 7, 7, 7, …
$ spe_sum_t0       <dbl> 3, 2, 3, 0, 1, 0, 2, 1, 3, 0, 1, 2, 1, 1, 1…
$ phq_dept0        <dbl> 21, 25, 23, 21, 11, 11, 10, 3, 6, 15, 13, 1…
$ phq_som_t0       <dbl> 18.000000, 26.785714, 19.000000, 13.000000,…
$ phq_stress_t0    <dbl> 16.000000, 17.000000, 15.000000, 15.000000,…
$ gad7t0           <dbl> 15, 21, 15, 13, 10, 12, 5, 6, 14, 11, 11, 2…
$ ksk12t0          <dbl> 31.59306, 50.51283, 39.60604, 52.61199, 36.…
$ psk12t0          <dbl> 29.21420, 13.90037, 23.51996, 15.58050, 27.…
$ av_n1_t0         <dbl> 3, 5, 2, 3, 4, 5, 5, 4, 2, 2, 3, 5, 4, 5, 4…
$ av_n2_t0         <dbl> 5, 8, 4, 2, 4, 7, 4, 5, 3, 3, 5, 4, 2, 5, 2…
$ av_n3_t0         <dbl> 5, 9, 8, 8, 7, 5, 3, 8, 6, 4, 8, 5, 8, 6, 3…
$ av_n4_t0         <dbl> 6, 9, 9, 3, 6, 8, 2, 3, 5, 4, 6, 5, 4, 8, 3…
$ av_n5_t0         <dbl> 5, 3, 3, 6, 5, 4, 4, 4, 6, 4, 3, 6, 4, 4, 7…
$ av_n6_t0         <dbl> 8, 8, 9, 6, 4, 4, 3, 2, 7, 7, 9, 6, 5, 9, 4…
$ av_n7_t0         <dbl> 4, 6, 1, 1, 5, 6, 1, 8, 3, 1, 1, 4, 1, 4, 4…
$ av_n8_t0         <dbl> 3, 1, 1, 3, 6, 4, 6, 5, 1, 3, 1, 6, 1, 4, 4…
$ av_n9_t0         <dbl> 2, 6, 1, 3, 3, 9, 3, 5, 8, 3, 1, 5, 1, 3, 3…
$ av_n10_t0        <dbl> 1, 2, 1, 1, 2, 1, 4, 5, 4, 2, 1, 4, 1, 2, 4…
$ av_n11_t0        <dbl> 2, 3, 2, 2, 3, 6, 6, 4, 2, 3, 2, 5, 1, 4, 4…
$ sibar_risk       <dbl> 8.0, 5.0, 11.0, 5.0, 6.0, 1.0, 3.0, 2.5, 9.…
$ sibar_risk_di    <fct> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
$ spe_sum_t3       <dbl> 3, 0, 3, 0, 0, 1, 0, 2, 2, 0, 0, 2, 2, 2, 2…

library(skimr)
# skim(data_updt)

For reasons I’ll go into more in a subsequent post, I want to create predictions separately for folks in the intervention group and folks in the control group. The short version of why: I want to be able to predict how well I would expect people to respond not just to the intervention they actually received, but how well the model thinks they would have done if they had received the other intervention. Building separate models by treatment condition isn’t the only way to do this, but it seems like the cleanest/least likely to lead to data leakage that I can think of right now⁵

On top of that, I’ll need to test whether those hypothetical predictions actually generalize to hold out data. So, we need to hold out at least some folks from both intervention groups to ultimately test whether our predictions about the “optimal intervention” are any good. If I could turn back time, I would just use the “initial_split” argument from the rsample package within tidymodels to create the test sets for both the intervention and control groups. That’s primarily the case because I’m not 100% sure this is reproducible and I’m not stratifying by the outcome⁶

# Breaking out the predictions into the separate groups

data_intv_init <- data_updt %>% 
  filter(cond == "2") %>% 
  as.data.frame() # %>% 
  #glimpse()

data_ctrl_init <- data_updt %>% 
  filter(cond == "1") %>% 
  as.data.frame() # %>% 
  #glimpse()

# Pulling out folks for test set later

test_intv <- data_intv_init %>% 
  slice_sample(n = 40)

data_intv <- data_intv_init %>% 
  anti_join(test_intv, by = "id") #%>% 
  #glimpse()

test_ctrl <- data_ctrl_init %>% 
  slice_sample(n = 40)

data_ctrl <- data_ctrl_init %>% 
  anti_join(test_ctrl, by = "id") #%>% 
  #glimpse()

Then I used the recipes package to preprocess the data! I wanted to remove imputation (which didn’t vary/wasn’t relevant), designate id as an id variable rather than a predictor, one hot encode the categorical variables⁷, excluded near-zero variance predictors, normalized all predictors, and imputed any missing data in predictors using k-nearest neighbors.

I then set up a tidymodels workflow using an out of the box xgboost model⁸ and made my predictions for the intervention group first.

intv_rec <-
  recipe(spe_sum_t3 ~ ., data = data_intv) %>%
  step_rm(imputation) %>% 
  update_role(id, new_role = "id") %>% 
  step_other(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_nzv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_impute_knn(all_predictors())

# Testing the recipe to make sure it produces data

intv_rec %>% 
  prep(verbose = TRUE) %>% 
  bake(new_data = NULL) # %>%

oper 1 step rm [training] 
oper 2 step other [training] 
oper 3 step dummy [training] 
oper 4 step nzv [training] 
oper 5 step normalize [training] 
oper 6 step impute knn [training] 
The retained training set is ~ 0.06 Mb  in memory.

# A tibble: 173 x 33
      id sibar1bt0 behandlungsdauer spe_sum_t0 phq_dept0 phq_som_t0
   <int>     <dbl>            <dbl>      <dbl>     <dbl>      <dbl>
 1     2   -0.326             0.805     0.930     2.41        2.79 
 2     3   -1.38              1.14      1.94      2.08        1.33 
 3     6    0.724            -0.866    -1.08      0.0881     -0.725
 4     7   -0.176            -1.03      0.930    -0.0777      0.210
 5     8   -0.0260            3.15     -0.0756   -1.24       -0.538
 6     9   -1.38              2.31      1.94     -0.741      -0.164
 7    10   -2.72              2.31     -1.08      0.752      -0.538
 8    12   -2.87              2.31      0.930     0.420       0.383
 9    13   -0.0260            2.31     -0.0756    2.74        1.14 
10    14   -1.83              2.31     -0.0756    1.25        0.383
# … with 163 more rows, and 27 more variables: phq_stress_t0 <dbl>,
#   gad7t0 <dbl>, ksk12t0 <dbl>, psk12t0 <dbl>, av_n1_t0 <dbl>,
#   av_n2_t0 <dbl>, av_n3_t0 <dbl>, av_n4_t0 <dbl>, av_n5_t0 <dbl>,
#   av_n6_t0 <dbl>, av_n7_t0 <dbl>, av_n8_t0 <dbl>, av_n9_t0 <dbl>,
#   av_n10_t0 <dbl>, av_n11_t0 <dbl>, sibar_risk <dbl>,
#   spe_sum_t3 <dbl>, sibar1at0_X1 <dbl>, sibar1at0_X2 <dbl>,
#   indikation_X1 <dbl>, indikation_X2 <dbl>, indikation_X3 <dbl>,
#   schulab_X1 <dbl>, schulab_X2 <dbl>, schulab_X3 <dbl>,
#   sibar_risk_di_X0 <dbl>, sibar_risk_di_X1 <dbl>

  #skim()

# Create a model

xg_mod <- boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

# Creating a workflow

xg_wf <-
  workflow() %>% 
  add_recipe(intv_rec) %>% 
  add_model(xg_mod)

intv_folds <- vfold_cv(data_intv, v = 5, repeats = 5, strata = spe_sum_t3)

keep_pred <- control_resamples(save_pred = TRUE)

library(tictoc)
tic()
set.seed(33)
xg_rs <-
  xg_wf %>% 
  fit_resamples(intv_folds, control = keep_pred)
toc()

10.855 sec elapsed

Overall we predictions that aren’t terribly far off from the predictions of changes in anxiety and depression from Nick’s original paper (which used an ensemble model stacked on a bunch of base learners). We can’t make an apples to oranges comparison since he and his co-author were predicting different outcomes, but at least I’m not wildly off from where they were.

xg_rs %>% 
  collect_metrics(summarize = TRUE)

# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   0.967    25  0.0177 Preprocessor1_Model1
2 rsq     standard   0.193    25  0.0202 Preprocessor1_Model1

r2_intv <- xg_rs %>% 
  collect_metrics(summarize = TRUE) %>% 
  filter(str_detect(.metric, "rsq") == TRUE) %>% 
  dplyr::select(mean) %>% 
  deframe() 

sqrt(r2_intv)

[1] 0.4387755

intv_preds <- xg_rs %>% 
  collect_predictions(summarize = TRUE)

# In order to get predictions in the placebo group I have to fit a workflow first

# fit_workflow <- fit(xg_wf, data_ctrl)

I then do the same preprocessing and model fitting for the control group!

ctrl_rec <-
  recipe(spe_sum_t3 ~ ., data = data_ctrl) %>%
  step_rm(imputation) %>% 
  update_role(id, new_role = "id") %>% 
  step_other(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_nzv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_impute_knn(all_predictors())

# Testing the recipe to make sure it produces data

ctrl_rec %>% 
  prep(verbose = TRUE) %>% 
  bake(new_data = NULL) #%>%

oper 1 step rm [training] 
oper 2 step other [training] 
oper 3 step dummy [training] 
oper 4 step nzv [training] 
oper 5 step normalize [training] 
oper 6 step impute knn [training] 
The retained training set is ~ 0.06 Mb  in memory.

# A tibble: 192 x 33
      id sibar1bt0 behandlungsdauer spe_sum_t0 phq_dept0 phq_som_t0
   <int>     <dbl>            <dbl>      <dbl>     <dbl>      <dbl>
 1     4   -2.60               1.16     -1.26      1.88      0.313 
 2    11   -2.03               2.64     -0.199     0.481    -0.672 
 3    16   -0.191              1.72      0.863     0.830     1.10  
 4    19   -0.0487             1.72     -0.199     0.133    -0.0813
 5    20    0.802              1.72     -0.199     0.656     1.10  
 6    21    0.0931             1.72      0.863     0.656    -0.869 
 7    22   -1.61               1.72     -0.199     1.70      0.904 
 8    23   -1.61               1.72     -0.199    -0.565    -0.672 
 9    24    0.518              1.72      1.93      0.656     0.313 
10    25   -1.75               1.72      0.863     0.656    -0.278 
# … with 182 more rows, and 27 more variables: phq_stress_t0 <dbl>,
#   gad7t0 <dbl>, ksk12t0 <dbl>, psk12t0 <dbl>, av_n1_t0 <dbl>,
#   av_n2_t0 <dbl>, av_n3_t0 <dbl>, av_n4_t0 <dbl>, av_n5_t0 <dbl>,
#   av_n6_t0 <dbl>, av_n7_t0 <dbl>, av_n8_t0 <dbl>, av_n9_t0 <dbl>,
#   av_n10_t0 <dbl>, av_n11_t0 <dbl>, sibar_risk <dbl>,
#   spe_sum_t3 <dbl>, sibar1at0_X1 <dbl>, sibar1at0_X2 <dbl>,
#   indikation_X1 <dbl>, indikation_X2 <dbl>, indikation_X3 <dbl>,
#   schulab_X1 <dbl>, schulab_X2 <dbl>, schulab_X3 <dbl>,
#   sibar_risk_di_X0 <dbl>, sibar_risk_di_X1 <dbl>

  # skim()

# Create a model

xg_mod <- boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

# Creating a workflow

xg_wf_ctrl <-
  workflow() %>% 
  add_recipe(ctrl_rec) %>% 
  add_model(xg_mod)

ctrl_folds <- vfold_cv(data_ctrl, v = 5, repeats = 5, strata = spe_sum_t3)

keep_pred <- control_resamples(save_pred = TRUE)

library(tictoc)
tic()
set.seed(33)
xg_rs_ctrl <-
  xg_wf_ctrl %>% 
  fit_resamples(ctrl_folds, control = keep_pred)
toc()

10.639 sec elapsed

Overall the predictions for the control group aren’t quite as good, but we’re still doing ok relative to the original predictions on other kinds of outcomes from Nick’s paper.

xg_rs_ctrl %>% 
  collect_metrics(summarize = TRUE)

# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   1.08     25  0.0203 Preprocessor1_Model1
2 rsq     standard   0.164    25  0.0175 Preprocessor1_Model1

r2_ctrl <- xg_rs_ctrl %>% 
  collect_metrics(summarize = TRUE) %>% 
  filter(str_detect(.metric, "rsq") == TRUE) %>% 
  dplyr::select(mean) %>% 
  deframe() 

sqrt(r2_ctrl)

[1] 0.4047978

ctrl_preds <- xg_rs_ctrl %>% 
  collect_predictions(summarize = TRUE)

What We’ll Get To in Part 2

In a non-Dash post, I’ll write a part 2 of this dataset/analysis where I finish out the modeling process.

So now, we need to use these models to predict how well we think folks would do if they had ended up in the opposite intervention (since we already have predictions of how they would do in the intervention they actually got). Once we have those estimates, we’ll be able to see if folks in a hold out sample who we predict would benefit more from one intervention compared to another actually do!

Yikes↩︎
double yikes↩︎
Or at least make a good start, I don’t think there’s any way I’ll finish in one 2 hour chunk↩︎
Or at least respond just as well↩︎
I’m looking forward to continuing to learn more about this area and might change my mind!↩︎
With the 2 hour time limit I didn’t have time to re-write the code!↩︎
like dummy coding, but everything gets its own set of codes plus it doesn’t sound vaguely ableist↩︎
no time for tuning in Data Distillery Dash!↩︎