You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
563 lines
23 KiB
Plaintext
563 lines
23 KiB
Plaintext
10 months ago
|
{smcl}
|
||
|
{* 30aug2005}{...}
|
||
|
{hline}
|
||
|
help for {hi:ice}, {hi:uvis}{right:(SJ5-4: st0067_2; SJ5-2: st0067_1; SJ4-3: st0067)}
|
||
|
{hline}
|
||
|
|
||
|
{title:Multiple imputation by the MICE system of chained equations}
|
||
|
|
||
|
{p 8 17 2}
|
||
|
{cmd:ice}
|
||
|
{it:mainvarlist}
|
||
|
{cmd:using} {it:filename}[{cmd:.dta}]
|
||
|
{ifin}
|
||
|
{weight}
|
||
|
[{cmd:,}
|
||
|
{cmdab:bo:ot}[{cmd:(}{it:varlist}{cmd:)}]
|
||
|
{cmd:cc(}{it:varlist}{cmd:)}
|
||
|
{cmdab:cm:d(}{it:cmdlist}{cmd:)}
|
||
|
{cmdab:cy:cles(}{it:#}{cmd:)}
|
||
|
{cmdab:dry:run}
|
||
|
{cmd:eq(}{it:eqlist}{cmd:)}
|
||
|
{cmdab:g:enmiss(}{it:string}{cmd:)}
|
||
|
{cmdab:i:d(}{it:string}{cmd:)}
|
||
|
{cmd:m(}{it:#}{cmd:)}
|
||
|
{cmdab:ma:tch}[{cmd:(}{it:varlist}{cmd:)}]
|
||
|
{cmdab:nocons:tant}
|
||
|
{cmdab:nosh:oweq}
|
||
|
{cmd:on(}{it:varlist}{cmd:)}
|
||
|
{cmdab:pass:ive(}{it:passivelist}{cmd:)}
|
||
|
{cmdab:sub:stitute(}{it:sublist}{cmd:)}
|
||
|
{cmd:replace}
|
||
|
{cmdab:se:ed(}{it:#}{cmd:)}
|
||
|
{cmdab:tr:ace(}{it:filename}{cmd:)}]
|
||
|
|
||
|
{p 8 17 2}
|
||
|
{cmd:uvis}
|
||
|
{it:regression_cmd}
|
||
|
{it:yvar}
|
||
|
{it:xvarlist}
|
||
|
{ifin}
|
||
|
{weight}
|
||
|
{cmd:,}
|
||
|
{cmdab:g:en(}{it:newvarname}{cmd:)}
|
||
|
[{cmdab:bo:ot}
|
||
|
{cmdab:ma:tch}
|
||
|
{cmdab:nocons:tant}
|
||
|
{cmd:replace}
|
||
|
{cmdab:se:ed(}{it:#}{cmd:)}]
|
||
|
|
||
|
{p 4 4 2}
|
||
|
where
|
||
|
|
||
|
{p 8 8 2}
|
||
|
{it:regression_cmd} may be
|
||
|
{helpb logistic},
|
||
|
{helpb logit},
|
||
|
{helpb mlogit},
|
||
|
{helpb ologit},
|
||
|
or
|
||
|
{helpb regress}.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
All weight types supported by {it:regression_cmd} are allowed; see {help weight}.
|
||
|
|
||
|
|
||
|
{title:Description}
|
||
|
|
||
|
{p 4 4 2}
|
||
|
{cmd:ice} imputes missing values
|
||
|
in {it:mainvarlist} by using switching regression, an iterative multivariable
|
||
|
regression technique. The abbreviation MICE means multiple imputation by
|
||
|
chained equations and was apparently coined by Steff van Buuren. {cmd:ice}
|
||
|
implements MICE for Stata. Sets of imputed and nonimputed variables are
|
||
|
stored to a new file called {it:filename}. Any number of complete imputations
|
||
|
may be created.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
{cmd:uvis} (univariate imputation sampling) imputes missing values in the
|
||
|
single variable {it:yvar} based on multiple regression on {it:xvarlist}.
|
||
|
{cmd:uvis} is called repeatedly by {cmd:ice} in a regression switching mode to
|
||
|
perform multivariate imputation.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
The missing observations are assumed to be missing at random (MAR) or
|
||
|
missing completely at random (MCAR), according to the jargon. See, for
|
||
|
example, van Buuren et al. (1999) for an explanation of these concepts.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
Please note that {cmd:ice} and {cmd:uvis} require Stata 8 or later.
|
||
|
There have been incompatibility issues with Stata 7 and earlier.
|
||
|
|
||
|
|
||
|
{title:Options for ice}
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:boot}[{cmd:(}{it:varlist}{cmd:)}] instructs that each member of
|
||
|
{it:varlist}, a subset of {it:mainvarlist}, be imputed with the {cmd:boot}
|
||
|
option of {cmd:uvis} activated. If {cmd:(}{it:varlist}{cmd:)} is omitted,
|
||
|
all members of {it:mainvarlist} with missing observations are imputed using
|
||
|
the {cmd:boot} option of {cmd:uvis}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:cc(}{it:varlist}{cmd:)} prevents imputation of missing data in
|
||
|
{it:mainvarlist} for cases in which any member of {it:varlist} has a missing
|
||
|
value. "cc" signifies "complete case". Note that members of {it:varlist} are
|
||
|
used for imputation if they appear in {it:mainvarlist}, but not otherwise. Use
|
||
|
of this option is equivalent to entering {cmd:if}
|
||
|
{cmd:~missing(}{it:var1}{cmd:) &} {cmd:~missing(}{it:var2}{cmd:)} ..., where
|
||
|
{it:var1}, {it:var2}, ... denote the members of {it:varlist}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:cmd(}{it:cmdlist}{cmd:)} defines the regression commands to be used for
|
||
|
each variable in {it:mainvarlist}, when it becomes the dependent variable in
|
||
|
the switching regression procedure used by {cmd:uvis} (see {hi:Remarks}). The
|
||
|
first item in {it:cmdlist} may be a command, such as {cmd:regress}, or may have
|
||
|
the syntax {it:varlist}{cmd::}{it:cmd}, specifying that command {it:cmd}
|
||
|
applies to all the variables in {it:varlist}. Subsequent items in
|
||
|
{it:cmdlist} must follow the latter syntax, and each item should be followed
|
||
|
by a comma.
|
||
|
|
||
|
{p 8 8 2}
|
||
|
The default {it:cmd} for a variable is {cmd:logit} when there are two distinct
|
||
|
values, {cmd:mlogit} when there are 3-5 and {cmd:regress} otherwise.
|
||
|
|
||
|
{p 8 18 2} Example: {cmd:cmd(regress)} specifies that all variables are
|
||
|
to be imputed by {cmd:regress}, overriding the defaults.
|
||
|
|
||
|
{p 8 18 2} Example: {cmd:cmd(x1 x2:logit, x3:regress)} specifies that
|
||
|
{cmd:x1} and {cmd:x2} are to be imputed by {cmd:logit}, {cmd:x3} by
|
||
|
{cmd:regress} and all others by their default choices.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:cycles(}{it:#}{cmd:)} determines the number of cycles of regression
|
||
|
switching to be carried out. The default is {cmd:cycles(10)}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:dryrun} does a "dry run"; that is, {cmd:ice}
|
||
|
reports the prediction equations it has constructed from the various
|
||
|
inputs. No imputation is done, and no files are created. It is not
|
||
|
mandatory to specify an output file with {cmd:using} for a dry run.
|
||
|
Sometimes the prediction equation set-up needs to be carefully
|
||
|
checked before running what may be a lengthy imputation process.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:eq(}{it:eqlist}{cmd:)} allows one to define customized prediction
|
||
|
equations for any subset of variables in {it:mainvarlist}. The option,
|
||
|
particularly when used with {cmd:passive()}, allows
|
||
|
great flexibility in the possible imputation schemes. The
|
||
|
syntax of {it:eqlist} is {it:varname1}{cmd::}{it:varlist1}
|
||
|
[{cmd:,}{it:varname2}{cmd::}{it:varlist2} ...], where each
|
||
|
{it:varname#} (or {it:varlist#})
|
||
|
is a member (or subset) of {it:mainvarlist}. It is your responsibility to ensure
|
||
|
that each equation is sensible. {cmd:ice} places no restrictions
|
||
|
except to check that all variables mentioned are indeed in
|
||
|
{it:mainvarlist} and that an equation is not defined
|
||
|
for a variable specified to be passively imputed
|
||
|
(see the {cmd:passive()} option. Note that {cmd:eq()} takes
|
||
|
precedence over all default definitions and assumptions about
|
||
|
the way a given variable in {cmd:mainvarlist} will be imputed.
|
||
|
The default, if the {cmd:passive()} and {cmd:substitute()}
|
||
|
options are not invoked, is that each
|
||
|
variable in {it:mainvarlist} with any missing data is imputed from all
|
||
|
the other variables in {it:mainvarlist}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:genmiss(}{it:string}{cmd:)} creates an indicator variable for the
|
||
|
missingness of data in any variable in {it:mainvarlist} for which at least one
|
||
|
value has been imputed. The indicator variable is set to missing for
|
||
|
observations excluded by {cmd:if}, {cmd:in}, etc. The indicator variable for
|
||
|
{it:xvar} is named {it:string}{it:xvar}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:id(}{it:string}{cmd:)} creates a variable called {it:string} containing
|
||
|
the original sort order of the data. The default {it:string} is {cmd:_i}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:m(}{it:#}{cmd:)} defines {it:#} as the number of imputations required
|
||
|
(minimum 1, no upper limit). The default is {cmd:m(1)}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:match}[{cmd:(}{it:varlist}{cmd:)}] instructs that each member of
|
||
|
{it:varlist} be imputed with the {cmd:match} option of {cmd:uvis}.
|
||
|
This provides prediction matching for each member of {it:varlist}.
|
||
|
If {cmd:(}{it:varlist}{cmd:)} is omitted then all relevant variables are
|
||
|
imputed with the {cmd:match} option of {cmd:uvis}. The default, if
|
||
|
{cmd:match()} is not specified, is to draw from the posterior
|
||
|
predictive distribution of each variable requiring imputation.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:noconstant} suppresses the regression constant in all regressions.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:noshoweq} suppresses the presentation of the prediction equations.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:on(}{it:varlist}{cmd:)} changes the operation of {cmd:ice} in a major
|
||
|
way. With this option, {cmd:uvis} imputes each member of {it:mainvarlist}
|
||
|
univariately on {it:varlist}. This provides a convenient way of producing
|
||
|
multiple imputations when imputation for each variable in {it:mainvarlist} is
|
||
|
to be done univariately on a set of complete predictors.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:passive(}{it:passivelist}{cmd:)} allows the use of "passive" imputation
|
||
|
of variables that depend on other variables, some of which are imputed.
|
||
|
The syntax of {it:passivelist} is {it:varname}{cmd::}{it:exp}
|
||
|
[{cmd:\}{it:varname}{cmd::}{it:exp} ...]. Notice the requirement to use
|
||
|
"\" as a separator between items in {it:passivelist}, rather than the usual comma;
|
||
|
the reason is that a comma may be a valid part of an expression.
|
||
|
The option is most easily explained by example. Suppose x1 is a categorical variable
|
||
|
with 3 levels, and that two dummy variables x1a, x1b have been created by the commands
|
||
|
|
||
|
{p 8 8 2}
|
||
|
{cmd:. generate byte x1a=(x1==2)}{break}
|
||
|
{cmd:. generate byte x1b=(x1==3)}
|
||
|
|
||
|
{p 8 8 2}
|
||
|
Now suppose that x1 is to be imputed by the {cmd:mlogit} command and is
|
||
|
to be treated as the two dummy variables x1a and x1b when predicting other
|
||
|
variables. Use of {cmd:mlogit} is achieved by the option
|
||
|
{cmd:cmd(x1:mlogit)}. When x1 is imputed, we want x1a and x1b to be updated
|
||
|
with new values which depend on the imputed values of x1. This may be
|
||
|
achieved by specifying {cmd:passive(x1a:x1==2 \ x1b:x1==3)}. It is necessary
|
||
|
also to remove x1 from the list of predictors when variables other than x1 are
|
||
|
being imputed, and this is done by using the {cmd:substitute()} option; in the
|
||
|
present example, you would specify {cmd:substitute(x1:x1a x1b)}.
|
||
|
|
||
|
{p 8 8 2}
|
||
|
Note that although in this example x1a will take the (possibly
|
||
|
unintended) value of 0 when x1 is missing, {cmd:ice} is careful to
|
||
|
ensure that x1a (and x1b) inherit the missingness of x1 and are
|
||
|
passively imputed following active imputation of missing values
|
||
|
of x1. If this were not done, incorrect results could occur. The
|
||
|
responsibility of the user is to create x1a and x1b before running
|
||
|
{cmd:ice} such that their missing values are identical
|
||
|
to those of x1.
|
||
|
|
||
|
{p 8 8 2}
|
||
|
A second example is multiplicative interactions between variables, for
|
||
|
example, between x1 and x2 (e.g., x12=x1*x2); this could be entered as
|
||
|
{cmd:passive(x12:x1*x2)}. It would cause the interaction term
|
||
|
x12 to be omitted when either x1 or x2 was being imputed, since it would
|
||
|
make no sense to impute x1 from its interaction with x2.
|
||
|
{cmd:substitute()} is not needed here.
|
||
|
|
||
|
{p 8 8 2}
|
||
|
It should be stressed that variables to be imputed passively must already
|
||
|
exist and must be included in {it:mainvarlist}; otherwise, they will not be
|
||
|
recognized.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:substitute(}{it:sublist}{cmd:)} is typically used with the
|
||
|
{cmd:passive()} option to represent multilevel categorical variables
|
||
|
as dummy variables in models for predicting other variables. See
|
||
|
{cmd:passive()} for more details. The syntax of {it:sublist} is
|
||
|
{it:varname}{cmd::}{it:dummyvarlist}
|
||
|
[{cmd:,}{it:varname}{cmd::}{it:dummyvarlist} ...], where {it:varname} is the
|
||
|
name of a variable to be substituted and {it:dummyvarlist} is the list of
|
||
|
dummy variables representing it.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:replace} permits {it:filename} to be overwritten with new data.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:seed(}{it:#}{cmd:)} sets the random-number seed to {it:#}.
|
||
|
To reproduce a set of imputations, the same random-number seed should be used.
|
||
|
The default is {cmd:seed(0)}, meaning no seed is set by the program.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:trace(}{it:filename}{cmd:)} monitors the convergence of the imputation
|
||
|
algorithm. For each original variable with missing values, the mean of the
|
||
|
imputed values is stored as a variable in {it:filename}, together
|
||
|
with the cycle number at which that
|
||
|
mean was calculated. The results are stored only for the final imputation.
|
||
|
For diagnostic purposes, it is sensible to run {cmd:trace()}
|
||
|
with {cmd:m(1)} and many cycles, such as {cmd:cycles(100)}.
|
||
|
When the run is complete, it is helpful to load {it:filename}
|
||
|
into memory and plot the mean for each imputed
|
||
|
variable against the cycle number. If necessary, smoothing may be applied
|
||
|
to clarify any apparent pattern. Convergence is judged to have occurred
|
||
|
when the pattern of the imputed means is random.
|
||
|
The number of cycles needed for convergence is usually obvious from the appearance
|
||
|
of the plot.
|
||
|
|
||
|
|
||
|
{title:Options for uvis}
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:gen(}{it:newvar}{cmd:)} is not optional. {it:newvar} contains original
|
||
|
(nonmissing) and imputed (originally missing) values of {it:yvar}.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:boot} invokes a bootstrap method for creating imputed values (see Remarks).
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:match} creates imputations by prediction matching. The default is to draw
|
||
|
imputations at random from the posterior distribution of the missing values of
|
||
|
{it:yvar}, conditional on the observed values and the members of
|
||
|
{it:xvarlist}. See Remarks for further details.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:noconstant} suppresses the regression constant in all regressions.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:replace} permits {it:newvar} (see {cmd:gen(}{it:newvar}{cmd:)})
|
||
|
to be overwritten with new data. {cmd:replace} may not be abbreviated.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
{cmd:seed(}{it:#}{cmd:)} sets the random-number seed to {it:#}.
|
||
|
See {hi:Remarks} for comments on how to ensure reproducible imputations
|
||
|
by using the {cmd:seed()} option.
|
||
|
The default is {cmd:seed(0)}, meaning no seed is set by the program.
|
||
|
|
||
|
|
||
|
{title:Remarks}
|
||
|
|
||
|
{p 4 4 2}
|
||
|
{cmd:uvis} imputes {it:yvar} from {it:xvarlist} according to the following
|
||
|
algorithm (see van Buuren et al. (1999, section 3.2) for further technical
|
||
|
details):
|
||
|
|
||
|
{p 8 12 2}
|
||
|
1. Estimate the vector of coefficients (beta) and the residual variance
|
||
|
by regressing the nonmissing values of {it:yvar} on the current "completed"
|
||
|
version of {it:xvarlist}. Predict the fitted values {it:etaobs} at the
|
||
|
nonmissing observations of {it:yvar}.
|
||
|
|
||
|
{p 8 12 2}
|
||
|
2. Draw at random a value (sigma_star) from the posterior distribution of the
|
||
|
residual standard deviation.
|
||
|
|
||
|
{p 8 12 2}
|
||
|
3. Draw at random a value (beta_star) from the posterior distribution of beta,
|
||
|
allowing, through sigma_star, for uncertainty in beta.
|
||
|
|
||
|
{p 8 12 2}
|
||
|
4. Use beta_star to predict the fitted values {it:etamis}
|
||
|
at the missing observations of {it:yvar}.
|
||
|
|
||
|
{p 8 12 2}
|
||
|
5. The imputed values are predicted directly from beta_star, sigma_star and
|
||
|
the covariates. When imputation is by linear regression ({cmd:regress}
|
||
|
command), this step assumes that {it:yvar} is Normally distributed, given the
|
||
|
covariates. For other types of imputation, samples are drawn from the
|
||
|
appropriate distribution.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
With the {cmd:match} option, step 5 is replaced by the following.
|
||
|
For each missing observation of {it:yvar} with prediction {it:etamis},
|
||
|
find the non-missing observation of {it:yvar} whose prediction
|
||
|
({it:etaobs}) on observed data is closest to {it:etamis}. This closest
|
||
|
non-missing observation is used to impute the missing value of {it:yvar}.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
The default draw method is not robust to departures from Normality and
|
||
|
may produce implausible imputations. For example, if the original distribution
|
||
|
is skew and positive-valued, the imputed distribution will not necessarily
|
||
|
have the appropriate amount of skewness, nor will all the imputed values
|
||
|
necessarily be positive. Log transformation of positive variables may greatly
|
||
|
improve the appropriateness of the imputations.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
The alternative {cmd:match} method is recommended only for continuous variables
|
||
|
when the Normality assumption is clearly untenable, even approximately.
|
||
|
It is not necessary, nor is it recommended, for binary, ordered categorical or
|
||
|
nominal variables. {cmd:match} may work well when the distribution of a
|
||
|
continuous variable is very non-Normal, but it may sometimes result in biased
|
||
|
imputations.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
With the {cmd:boot} option, steps 2-4 are replaced by a bootstrap estimation of
|
||
|
beta_star; beta_star
|
||
|
is estimated by regressing {it:yvar} on {it:xvarlist} after taking a bootstrap sample
|
||
|
of the non-missing observations. This has the advantage of robustness since the
|
||
|
distribution of beta is no longer assumed to be multivariate normal.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
Note that {cmd:uvis} will not impute observations for which a value
|
||
|
of a variable in {it:xvarlist} is missing. However, all original
|
||
|
(missing or nonmissing) observations of {it:yvar} will be copied into
|
||
|
{it:newvarname} in such cases. This is a change from the first release of
|
||
|
{cmd:uvis} (with {cmd:mvis}). Previously, {it:newvarname} would be set to
|
||
|
missing whenever a value of a variable in {it:xvarlist} was missing,
|
||
|
irrespective of the value of {it:yvar}.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
Missing data for ordered (or unordered) categorical covariates should
|
||
|
be imputed by using the {cmd:ologit} (or {cmd:mlogit}) command. In these cases,
|
||
|
prediction matching is done on the scale of the mean absolute difference
|
||
|
in the predicted class probabilities, preceded by logit transformation.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
{cmd:ice} carries out multivariate imputation in {it:mainvarlist} using
|
||
|
regression switching (van Buuren et al. 1999) as follows:
|
||
|
|
||
|
{p 8 12 2}
|
||
|
1. Ignore any observations for which {it:mainvarlist} has only missing values,
|
||
|
or if the {cmd:ccvarlist(}{it:varlist}{cmd:)} option has been specified, for
|
||
|
which any member of {it:varlist} has a missing value.
|
||
|
|
||
|
{p 8 12 2}
|
||
|
2. For each variable in {it:mainvarlist} with any missing data, randomly order
|
||
|
that variable and replicate the observed values across the missing cases.
|
||
|
This step initializes the iterative procedure by ensuing that no relevant
|
||
|
values are missing.
|
||
|
|
||
|
{p 8 12 2}
|
||
|
3. For each variable in {it:mainvarlist} in turn, impute missing values by
|
||
|
applying {cmd:uvis} with the remaining variables as covariates.
|
||
|
|
||
|
{p 8 12 2}
|
||
|
4. Repeat step 3 {cmd:cycles()} times, replacing the imputed values with updated
|
||
|
values at the end of each cycle.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
A single imputation sample is created for each variable with any relevant
|
||
|
missing values.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
Van Buuren recommends {cmd:cycles(20)} but goes on to say that 10 or even 5
|
||
|
iterations are probably sufficient. We have chosen a compromise default of 10.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
"Multiple imputation" (MI) implies the creation and analysis of several
|
||
|
imputed datasets. To do this, one would run {cmd:ice} with {it:m} set
|
||
|
to a suitable number, for example 5. To obtain final estimates
|
||
|
of the parameters of interest and their standard errors,
|
||
|
one would fit a model in
|
||
|
each imputation and carry out the appropriate post-MI averaging procedure
|
||
|
on the results from the {it:m} separate imputations. A suitable
|
||
|
estimation tool for this purpose is {helpb micombine}.
|
||
|
|
||
|
{title:Handling categorical variables}
|
||
|
|
||
|
{p 4 4 2}
|
||
|
Binary variables present no difficulty: by default, in the MICE
|
||
|
procedure, when such a variable is the response, it is
|
||
|
predicted from other variables by using logistic regression;
|
||
|
when it is a covariate, it is modeled in the only way possible,
|
||
|
effectively as a single dummy variable. Categorical variables with 3 or
|
||
|
more levels may in principle be treated in different ways.
|
||
|
By default, in {cmd:ice} variables with 3-5 levels are modeled
|
||
|
using multinomial logistic regression ({cmd:mlogit} command) when
|
||
|
the response, and as a single linear term when a covariate. The
|
||
|
same behavior occurs with the ordered logistic model ({cmd:ologit}
|
||
|
command), requested via the {cmd:cmd()} option. The use of dummy variables
|
||
|
instead of a single linear term may be imposed as described under
|
||
|
the {cmd:passive()} option. The requisite dummy variables
|
||
|
must be created before {cmd:ice} is invoked. Variables with 6 or
|
||
|
more levels are treated as ordered and continuous, but again
|
||
|
different choices may be imposed by use of the {cmd:cmd()},
|
||
|
{cmd:passive()} and {cmd:substitute()} options.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
You should be aware that
|
||
|
unless the dataset is large, use of the {cmd:mlogit} command may produce
|
||
|
unstable estimates if the number of levels is too large, and
|
||
|
may compromise the accuracy of the imputations. It is hard to
|
||
|
predict when this will occur.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
Note that due to a peculiarity of the way the {cmd:mlogit} command works,
|
||
|
variables with score labels cause problems to {cmd:ice}
|
||
|
and {cmd:uvis} when missing data are imputed using {cmd:mlogit}.
|
||
|
Score labels for such variables are removed in the file of imputed
|
||
|
data. See also the related comment on {hi:Postestimation prediction} in
|
||
|
{helpb micombine}.
|
||
|
|
||
|
|
||
|
{title:Further notes}
|
||
|
|
||
|
{p 4 4 2}
|
||
|
{cmd:ice} determines the order of imputing variables in the round
|
||
|
of chained equations according to the amount of missing data.
|
||
|
Variables with the least missingness are imputed first.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
An important application of MI is to investigate possible models, for example
|
||
|
prognostic models, in which selection of influential variables is required
|
||
|
(Clark and Altman 2003). For example, the stability of the final model across
|
||
|
the imputation samples is of interest. This area of inquiry is in its infancy.
|
||
|
|
||
|
{p 4 4 2}
|
||
|
In survival analysis, it is recommended to include the censoring indicator
|
||
|
and the log of the survival time in the variables to be used for imputation.
|
||
|
Van Buuren et al. (1999) give a detailed discussion of the different types
|
||
|
of covariate that can be included in the imputation model and discuss the
|
||
|
important issue of how to deal with variables which are missing completely at
|
||
|
random (MCAR), missing at random (MAR), and missing not at random (MNAR).
|
||
|
|
||
|
{p 4 4 2}
|
||
|
See also Van Buuren's web site http://www.multiple-imputation.com for further
|
||
|
information and software sources.
|
||
|
|
||
|
|
||
|
{title:Examples}
|
||
|
|
||
|
{p 4 10 2}
|
||
|
{cmd:. uvis regress y x1 x2 x3, gen(ym)}
|
||
|
|
||
|
{p 4 10 2}
|
||
|
{cmd:. ice x1 x2 x3 using imputed, m(5)}
|
||
|
|
||
|
{p 4 10 2}
|
||
|
{cmd:. ice x1 x2 x3 using imputed, m(5) cycles(20) cc(x4 x5)}
|
||
|
|
||
|
{p 4 10 2}
|
||
|
{cmd:. ice x1-x5 using imputed, m(10) boot match(x1 x2 x3) cmd(x1 x2:mlogit, x3:ologit) id(pid) seed(101) genmiss(m_)}
|
||
|
|
||
|
{p 4 10 2}
|
||
|
{cmd:. ice x1 x1a x1b x2 x3 x23 using imputed, m(5) cmd(x1:ologit) passive(x1a:x1==2 \x1b:x1==3 \x23=x2*x3) substitute(x1:x1a x1b)}
|
||
|
|
||
|
{p 4 10 2}
|
||
|
{cmd:. ice y1 y2 y3 x1 x2 x3 x4 using imputed, m(5) eq(y1:x1 x2 y2, y2:y1 x3 x4, y3:y1 y2) match(y3)}
|
||
|
|
||
|
|
||
|
{title:Acknowledgement}
|
||
|
|
||
|
{p 4 4 2}
|
||
|
I am grateful to Gillian Raab for pointing out certain issues with the prediction
|
||
|
matching approach, particularly that it is only useful with continuous variables.
|
||
|
As a result, the default imputation method has been
|
||
|
changed from matching to drawing from the predictive distribution. Gillian also
|
||
|
suggested imputing the variables in reverse order of the amount of missingness,
|
||
|
and selecting the imputed value at random from the set determined by the available
|
||
|
matching predictions. Both suggestions have been implemented in this software update.
|
||
|
|
||
|
|
||
|
{title:Author}
|
||
|
|
||
|
{p 4 4 2}
|
||
|
Patrick Royston, MRC Clinical Trials Unit, London.{break}
|
||
|
patrick.royston@ctu.mrc.ac.uk
|
||
|
|
||
|
|
||
|
{title:References}
|
||
|
|
||
|
{p 4 8 2}
|
||
|
van Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of
|
||
|
missing blood pressure covariates in survival analysis.
|
||
|
{it:Statistics in Medicine} {cmd:18}:681-694.
|
||
|
Also see http://www.multiple-imputation.com.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing
|
||
|
multiple imputed datasets. {it:Stata Journal} 3(3): 226-244.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
Clark T. G. and D. G. Altman. 2003. Developing a prognostic model
|
||
|
in the presence of missing data: an ovarian cancer case-study.
|
||
|
{it:Journal of Clinical Epidemiology} 56: 28-37.
|
||
|
|
||
|
{p 4 8 2}
|
||
|
Royston P. 2004. Multiple imputation of missing values.
|
||
|
{it:Stata Journal} 4(3): 227-241.
|
||
|
|
||
|
|
||
|
{title:Also see}
|
||
|
|
||
|
{psee}
|
||
|
Online: {helpb mijoin}, {helpb micombine}, {helpb mitools}, and related programs,
|
||
|
if installed
|
||
|
{p_end}
|