Interactive model decision trees in Stata
When I was doing statistical consulting work in a university’s medical department, my clients often wanted me to teach them about the analyses I was doing for them, rather than just completing the work and handing it over. In addition, they were usually operating under limited budgets so wanted to do at least some of the work themselves to save money.
This was a bit of a problem when I needed to use fairly high-level statistics to model their data. In order to allow clients to do this sort of modelling work on their own, I designed a decision tree in Stata that helped them evaluate a series of models to reach the most suitable one.
A brief background
This all started when I was asked to do some analyses for a client who was doing a Ph.D. in immunology, who came to me with a pretty complicated dataset. She was trying to find out whether two groups of patients with hepatitis C had different levels of biomarkers. She had measured the biomarkers at different time points within the same person, and also had a bunch of different biomarker outcomes she wanted to evaluate. Multilevel models (with random intercepts) were likely the most appropriate model for most of the outcomes, and the graphs demonstrated that time might have a curvilinear relationship with some of the outcomes. These sorts of statistics were beyond the knowledge of my client, but she had so many outcomes she would need to do a lot of this model selection work herself.
In order to allow her to complete this work, I handed the data over to her with the decision-tree function described below (after doing the initial data cleaning, visualisations and testing for some of the assumptions).
What does the function do?
The function presents the user with several decision points in building a model, with clear instructions on how to evaluate the output they have been given. It allows them to input each decision on the Stata command line using simple y/n answers. Finally, it presents them with the final model and a brief explanation of what this is.
A demonstration of the function
There are several components to the function:
The models that the user needs to evaluate. In the case of this decision-tree, one choice is between a random-intercepts multilevel model (with “subj” as the Level 2 variable) and a regression model with robust standard errors clustered by “subj”, and the other choice is between including a quadratic term for time or excluding it (don’t worry if you don’t fully understand what these are, the important part is understanding the function itself!). For example, this is one of the models:
xtmixed time i.group || subj: , var mle
The ability for the function to take user input using the “_request” command. This command assigns the user input to a local which the function can use. For example:
di "Enter a value below in the 'command' window (y/n)" _request(_answer1)
Conditional statements (if, else) that guide the client through the decision-making tree. These conditional statements evaluate the locals generated by the user input. For example:
if ("" == "y") {
xtmixed time time_sq i.group || subj: , var mle
}
To demonstrate the function, I will use an example dataset (as my client’s data contains pretty sensitive medical information!). This is a dataset I sourced from IDRE at UCLA, which is an incredible statistics and statistical programming resource that has particularly good documentation for Stata.
This dataset contains depression scores for 61 participants in two different treatment groups (27 in the placebo and 34 in the estrogen treatment). Each participant was assessed at baseline (pre) and at 6 timepoints thereafter.
The depression scores are currently in a wide format, with a variable for each timepoint. As we are doing regression analyses, we’ll need all the depression scores in a single variable, therefore we’ll convert the data to a long format.
use http://www.ats.ucla.edu/stat/stata/library/depress, clear
reshape long dep, i(subj)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 61 -> 366
Number of variables 9 -> 5
j variable (6 values) -> _j
xij variables:
dep1 dep2 ... dep6 -> dep
-----------------------------------------------------------------------------
rename _j time
drop pre
Let’s now visualise the data to get an idea of how the two groups’ scores behave over time:
twoway ///
(lowess dep time if group == 0, lcolor("169 204 6") lwidth(thick)) ///
(lowess dep time if group == 1, lcolor("84 102 3") lwidth(thick)), ///
title("Depression Scores Over Time") ///
ytitle("Depression scores", size(medsmall)) ///
xtitle("Time") ///
graphregion(color(white)) ///
legend(cols(2) ///
label(1 "Placebo") ///
label(2 "Estrogen"))
The relationship between time and depression seems linear for both groups, but I will keep the quadratic decision-making step in the function for now. Let’s generate the quadratic term for time:
gen time_sq = time ^ 2
Now we run the function below:
capture program drop model_decision
program model_decision
args outcome
xtmixed `outcome' time i.group || subj: , var mle
di _newline(1)
di "Step 1: Evaluate initial model." _newline(2) ///
"Look at the bottom of the output where it reads: 'LR test vs. linear regression...Prob >= chibar2 ='" _newline(1) ///
"Does 'Prob >= chibar2' have a value of < .05? " _newline(2) ///
"Enter a value below in the 'command' window (y/n)" _request(_answer1)
if ("`answer1'" == "y") {
xtmixed `outcome' time time_sq i.group || subj: , var mle
di _newline(2) ///
"Step 2: Now look at the row of the first table labelled 'time_sq'." _newline(1) ///
"Is this variable significant? " _newline(2) ///
"Enter a value below in the 'command' window (y/n)" _request(_answer2)
di _newline(2)
}
else if ("`answer1'" == "n") {
regress `outcome' time time_sq i.group, vce(cluster id)
di _newline(2) ///
"Step 2: Now look at the row of the table labelled 'time_sq'." _newline(1) ///
"Is this variable significant? " _newline(2) ///
"Enter a value below in the 'command' window (y/n)" _request(_answer3)
di _newline(2)
}
if ("`answer2'" == "y") {
di _newline(2) ///
"The final model for `outcome' is below. " ///
"This is a random-intercepts multilevel model with a quadratic term."
di _newline(1)
xtmixed `outcome' time time_sq i.group || subj: , var mle
}
else if ("`answer2'" == "n") {
di _newline(2) ///
"The final model for `outcome' is below. " ///
"This is a random-intercepts multilevel model."
di _newline(1)
xtmixed `outcome' time i.group || subj: , var mle
}
if ("`answer3'" == "y") {
di _newline(2) ///
"The final model for `outcome' is below. " ///
"This is a linear regression with a quadratic term which has robust standard errors clustered by ID."
di _newline(1)
regress `outcome' time time_sq i.group, vce(cluster id)
}
else if ("`answer3'" == "n") {
di _newline(2) ///
"The final model for `outcome' is below. " ///
"This is a linear regression which has robust standard errors clustered by ID."
di _newline(1)
regress `outcome' time time_sq i.group, vce(cluster id)
}
end
Once you have everything set up as above, it is simple! We just get the user to call the function:
model_decision dep
The output that the user sees is shown below, but I recommend you run the function yourself within Stata to see how the user will actually interact with the program.
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log likelihood = -836.98669
Iteration 1: log likelihood = -836.98669
Computing standard errors:
Mixed-effects ML regression Number of obs = 295
Group variable: subj Number of groups = 61
Obs per group: min = 1
avg = 4.8
max = 6
Wald chi2(2) = 122.49
Log likelihood = -836.98669 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
dep | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | -1.22265 .1169081 -10.46 0.000 -1.451785 -.993514
|
group |
estrogen | -3.780498 1.174419 -3.22 0.001 -6.082316 -1.478679
_cons | 17.97587 .9454841 19.01 0.000 16.12276 19.82899
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
subj: Identity |
var(_cons) | 17.42846 3.696099 11.50118 26.41044
-----------------------------+------------------------------------------------
var(Residual) | 11.19347 1.031967 9.343066 13.41035
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) = 152.15 Prob >= chibar2 = 0.0000
Step 1: Evaluate initial model.
Look at the bottom of the output where it reads: 'LR test vs. linear regression...Prob >= chibar2 ='
Does 'Prob >= chibar2' have a value of < .05?
Enter a value below in the 'command' window (y/n). y
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log likelihood = -836.40586
Iteration 1: log likelihood = -836.40586
Computing standard errors:
Mixed-effects ML regression Number of obs = 295
Group variable: subj Number of groups = 61
Obs per group: min = 1
avg = 4.8
max = 6
Wald chi2(3) = 124.14
Log likelihood = -836.40586 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
dep | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | -1.814816 .5610959 -3.23 0.001 -2.914544 -.7150883
time_sq | .0852873 .0790511 1.08 0.281 -.06965 .2402246
|
group |
estrogen | -3.747568 1.171837 -3.20 0.001 -6.044326 -1.450811
_cons | 18.69979 1.157469 16.16 0.000 16.43119 20.96839
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
subj: Identity |
var(_cons) | 17.33573 3.674356 11.44268 26.26375
-----------------------------+------------------------------------------------
var(Residual) | 11.15279 1.028017 9.309431 13.36114
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) = 152.44 Prob >= chibar2 = 0.0000
Step 2: Now look at the row of the first table labelled 'time_sq'.
Is this variable significant?
Enter a value below in the 'command' window (y/n). n
The final model for dep is below. This is a random-intercepts multilevel model.
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log likelihood = -836.98669
Iteration 1: log likelihood = -836.98669
Computing standard errors:
Mixed-effects ML regression Number of obs = 295
Group variable: subj Number of groups = 61
Obs per group: min = 1
avg = 4.8
max = 6
Wald chi2(2) = 122.49
Log likelihood = -836.98669 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
dep | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | -1.22265 .1169081 -10.46 0.000 -1.451785 -.993514
|
group |
estrogen | -3.780498 1.174419 -3.22 0.001 -6.082316 -1.478679
_cons | 17.97587 .9454841 19.01 0.000 16.12276 19.82899
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
subj: Identity |
var(_cons) | 17.42846 3.696099 11.50118 26.41044
-----------------------------+------------------------------------------------
var(Residual) | 11.19347 1.031967 9.343066 13.41035
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) = 152.15 Prob >= chibar2 = 0.0000
The take away message
I hope this program has given you some ideas for teaching clients or students about model building. The function can be easily adapted to any decision-making process where the user can easily choose between models on the basis of the output. It is also obviously not limited to two decision points - you can have as many as you can reasonably keep track of (although it will get confusing pretty fast as the number of decision points increases…). The code for this post is located in this gist on my Github page.
This post was written in Stata using MarkDoc, which was developed by the talented E. F. Haghish.