Interactive model decision trees in Stata

written in stata, programming tips, consulting

When I was doing statistical consulting work in a university’s medical department, my clients often wanted me to teach them about the analyses I was doing for them, rather than just completing the work and handing it over. In addition, they were usually operating under limited budgets so wanted to do at least some of the work themselves to save money.

This was a bit of a problem when I needed to use fairly high-level statistics to model their data. In order to allow clients to do this sort of modelling work on their own, I designed a decision tree in Stata that helped them evaluate a series of models to reach the most suitable one.

A brief background

This all started when I was asked to do some analyses for a client who was doing a Ph.D. in immunology, who came to me with a pretty complicated dataset. She was trying to find out whether two groups of patients with hepatitis C had different levels of biomarkers. She had measured the biomarkers at different time points within the same person, and also had a bunch of different biomarker outcomes she wanted to evaluate. Multilevel models (with random intercepts) were likely the most appropriate model for most of the outcomes, and the graphs demonstrated that time might have a curvilinear relationship with some of the outcomes. These sorts of statistics were beyond the knowledge of my client, but she had so many outcomes she would need to do a lot of this model selection work herself.

In order to allow her to complete this work, I handed the data over to her with the decision-tree function described below (after doing the initial data cleaning, visualisations and testing for some of the assumptions).

What does the function do?

The function presents the user with several decision points in building a model, with clear instructions on how to evaluate the output they have been given. It allows them to input each decision on the Stata command line using simple y/n answers. Finally, it presents them with the final model and a brief explanation of what this is.

A demonstration of the function

There are several components to the function:

The models that the user needs to evaluate. In the case of this decision-tree, one choice is between a random-intercepts multilevel model (with “subj” as the Level 2 variable) and a regression model with robust standard errors clustered by “subj”, and the other choice is between including a quadratic term for time or excluding it (don’t worry if you don’t fully understand what these are, the important part is understanding the function itself!). For example, this is one of the models:

xtmixed time i.group || subj: , var mle

The ability for the function to take user input using the “_request” command. This command assigns the user input to a local which the function can use. For example:

di "Enter a value below in the 'command' window (y/n)" _request(_answer1)

Conditional statements (if, else) that guide the client through the decision-making tree. These conditional statements evaluate the locals generated by the user input. For example:

if ("" == "y") {               
    xtmixed  time time_sq i.group || subj: , var mle
}

To demonstrate the function, I will use an example dataset (as my client’s data contains pretty sensitive medical information!). This is a dataset I sourced from IDRE at UCLA, which is an incredible statistics and statistical programming resource that has particularly good documentation for Stata.

This dataset contains depression scores for 61 participants in two different treatment groups (27 in the placebo and 34 in the estrogen treatment). Each participant was assessed at baseline (pre) and at 6 timepoints thereafter.

The depression scores are currently in a wide format, with a variable for each timepoint. As we are doing regression analyses, we’ll need all the depression scores in a single variable, therefore we’ll convert the data to a long format.

  use http://www.ats.ucla.edu/stat/stata/library/depress, clear
  reshape long dep, i(subj)
  (note: j = 1 2 3 4 5 6)

  Data                               wide   ->   long
  -----------------------------------------------------------------------------
  Number of obs.                       61   ->     366
  Number of variables                   9   ->       5
  j variable (6 values)                     ->   _j
  xij variables:
                       dep1 dep2 ... dep6   ->   dep
  -----------------------------------------------------------------------------

  rename _j time
  drop pre

Let’s now visualise the data to get an idea of how the two groups’ scores behave over time:

  twoway /// 
    (lowess dep time if group == 0, lcolor("169 204 6") lwidth(thick)) /// 
    (lowess dep time if group == 1, lcolor("84 102 3") lwidth(thick)), /// 
    title("Depression Scores Over Time") /// 
    ytitle("Depression scores", size(medsmall)) ///
    xtitle("Time") /// 
    graphregion(color(white)) /// 
    legend(cols(2) ///
        label(1 "Placebo") /// 
        label(2 "Estrogen"))

plot of lowess_graph_stata

The relationship between time and depression seems linear for both groups, but I will keep the quadratic decision-making step in the function for now. Let’s generate the quadratic term for time:

  gen time_sq = time ^ 2

Now we run the function below:

capture program drop model_decision
    program model_decision
    args outcome

    xtmixed `outcome' time i.group || subj: , var mle

    di _newline(1)
    di "Step 1: Evaluate initial model." _newline(2) ///
        "Look at the bottom of the output where it reads: 'LR test vs. linear regression...Prob >= chibar2 ='" _newline(1) ///
        "Does 'Prob >= chibar2' have a value of < .05? " _newline(2) /// 
        "Enter a value below in the 'command' window (y/n)" _request(_answer1)

    if ("`answer1'" == "y") {

        xtmixed `outcome' time time_sq i.group || subj: , var mle

        di _newline(2) ///
            "Step 2: Now look at the row of the first table labelled 'time_sq'." _newline(1) ///
            "Is this variable significant? "  _newline(2) ///
            "Enter a value below in the 'command' window (y/n)" _request(_answer2)
            di _newline(2)

        }

    else if ("`answer1'" == "n") {

        regress `outcome' time time_sq i.group, vce(cluster id)

        di _newline(2) ///   
            "Step 2: Now look at the row of the table labelled 'time_sq'." _newline(1) ///
            "Is this variable significant? " _newline(2) ///
            "Enter a value below in the 'command' window (y/n)" _request(_answer3)
            di _newline(2)

        }

    if ("`answer2'" == "y") {
        di _newline(2) ///
        "The final model for `outcome' is below. " ///
        "This is a random-intercepts multilevel model with a quadratic term."
        di _newline(1)

        xtmixed `outcome' time time_sq i.group || subj: , var mle

        }

    else if ("`answer2'" == "n") {
        di _newline(2) ///
        "The final model for `outcome' is below. " ///
        "This is a random-intercepts multilevel model."
        di _newline(1)

        xtmixed `outcome' time i.group || subj: , var mle

        }

    if ("`answer3'" == "y") {
        di _newline(2) ///
        "The final model for `outcome' is below. " ///
        "This is a linear regression with a quadratic term which has robust standard errors clustered by ID."
        di _newline(1)

        regress `outcome' time time_sq i.group, vce(cluster id)

        }

    else if ("`answer3'" == "n") {
        di _newline(2) ///
        "The final model for `outcome' is below. " ///
        "This is a linear regression which has robust standard errors clustered by ID."
        di _newline(1)

        regress `outcome' time time_sq i.group, vce(cluster id)

        }
end

Once you have everything set up as above, it is simple! We just get the user to call the function:

model_decision dep

The output that the user sees is shown below, but I recommend you run the function yourself within Stata to see how the user will actually interact with the program.

  Performing EM optimization:

  Performing gradient-based optimization:

  Iteration 0:   log likelihood = -836.98669  
  Iteration 1:   log likelihood = -836.98669

  Computing standard errors:

  Mixed-effects ML regression                     Number of obs      =       295
  Group variable: subj                            Number of groups   =        61

                                                  Obs per group: min =         1
                                                                 avg =       4.8
                                                                 max =         6


                                                  Wald chi2(2)       =    122.49
  Log likelihood = -836.98669                     Prob > chi2        =    0.0000

  ------------------------------------------------------------------------------
           dep |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
          time |   -1.22265   .1169081   -10.46   0.000    -1.451785    -.993514
               |
         group |
     estrogen  |  -3.780498   1.174419    -3.22   0.001    -6.082316   -1.478679
         _cons |   17.97587   .9454841    19.01   0.000     16.12276    19.82899
  ------------------------------------------------------------------------------

  ------------------------------------------------------------------------------
    Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
  -----------------------------+------------------------------------------------
  subj: Identity               |
                    var(_cons) |   17.42846   3.696099      11.50118    26.41044
  -----------------------------+------------------------------------------------
                 var(Residual) |   11.19347   1.031967      9.343066    13.41035
  ------------------------------------------------------------------------------
  LR test vs. linear regression: chibar2(01) =   152.15 Prob >= chibar2 = 0.0000


  Step 1: Evaluate initial model.

  Look at the bottom of the output where it reads: 'LR test vs. linear regression...Prob >= chibar2 ='
  Does 'Prob >= chibar2' have a value of < .05?

  Enter a value below in the 'command' window (y/n). y

  Performing EM optimization:

  Performing gradient-based optimization:

  Iteration 0:   log likelihood = -836.40586  
  Iteration 1:   log likelihood = -836.40586

  Computing standard errors:

  Mixed-effects ML regression                     Number of obs      =       295
  Group variable: subj                            Number of groups   =        61

                                                  Obs per group: min =         1
                                                                 avg =       4.8
                                                                 max =         6


                                                  Wald chi2(3)       =    124.14
  Log likelihood = -836.40586                     Prob > chi2        =    0.0000

  ------------------------------------------------------------------------------
           dep |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
          time |  -1.814816   .5610959    -3.23   0.001    -2.914544   -.7150883
       time_sq |   .0852873   .0790511     1.08   0.281      -.06965    .2402246
               |
         group |
     estrogen  |  -3.747568   1.171837    -3.20   0.001    -6.044326   -1.450811
         _cons |   18.69979   1.157469    16.16   0.000     16.43119    20.96839
  ------------------------------------------------------------------------------

  ------------------------------------------------------------------------------
    Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
  -----------------------------+------------------------------------------------
  subj: Identity               |
                    var(_cons) |   17.33573   3.674356      11.44268    26.26375
  -----------------------------+------------------------------------------------
                 var(Residual) |   11.15279   1.028017      9.309431    13.36114
  ------------------------------------------------------------------------------
  LR test vs. linear regression: chibar2(01) =   152.44 Prob >= chibar2 = 0.0000


  Step 2: Now look at the row of the first table labelled 'time_sq'.
  Is this variable significant?

  Enter a value below in the 'command' window (y/n). n

  The final model for dep is below. This is a random-intercepts multilevel model.

  Performing EM optimization:

  Performing gradient-based optimization:

  Iteration 0:   log likelihood = -836.98669  
  Iteration 1:   log likelihood = -836.98669

  Computing standard errors:

  Mixed-effects ML regression                     Number of obs      =       295
  Group variable: subj                            Number of groups   =        61

                                                  Obs per group: min =         1
                                                                 avg =       4.8
                                                                 max =         6


                                                  Wald chi2(2)       =    122.49
  Log likelihood = -836.98669                     Prob > chi2        =    0.0000

  ------------------------------------------------------------------------------
           dep |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
          time |   -1.22265   .1169081   -10.46   0.000    -1.451785    -.993514
               |
         group |
     estrogen  |  -3.780498   1.174419    -3.22   0.001    -6.082316   -1.478679
         _cons |   17.97587   .9454841    19.01   0.000     16.12276    19.82899
  ------------------------------------------------------------------------------

  ------------------------------------------------------------------------------
    Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
  -----------------------------+------------------------------------------------
  subj: Identity               |
                    var(_cons) |   17.42846   3.696099      11.50118    26.41044
  -----------------------------+------------------------------------------------
                 var(Residual) |   11.19347   1.031967      9.343066    13.41035
  ------------------------------------------------------------------------------
  LR test vs. linear regression: chibar2(01) =   152.15 Prob >= chibar2 = 0.0000

The take away message

I hope this program has given you some ideas for teaching clients or students about model building. The function can be easily adapted to any decision-making process where the user can easily choose between models on the basis of the output. It is also obviously not limited to two decision points - you can have as many as you can reasonably keep track of (although it will get confusing pretty fast as the number of decision points increases…). The code for this post is located in this gist on my Github page.

This post was written in Stata using MarkDoc, which was developed by the talented E. F. Haghish.