You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

285 lines
10 KiB
Plaintext

{smcl}
{* 28nov2005}{...}
{hline}
help for {hi:micombine}{right:(SJ5-4: st0067_2; SJ5-2: st0067_1; SJ4-3: st0067)}
{hline}
{title:Estimation of regression models with multiply imputed samples}
{p 8 18 2}
{cmd:micombine}
{{it:supported_regression_cmd} | {it:other_regression_cmd}}
[{it:yvar}]
[{it:covarlist}]
[{it:other_stuff]}
{ifin}
{weight}
[{cmd:,}
{cmd:br}
{cmdab:nocons:tant}
{cmdab:det:ail}
{cmdab:ef:orm}[{cmd:(}{it:string}{cmd:)}]
{cmdab:g:enxb(}{it:newvarname}{cmd:)}
{cmdab:imp:id(}{it:varname}{cmd:)}
{cmd:lrr}
{cmdab:nowar:ning}
{cmdab:obs:id(}{it:varname}{cmd:)}
{it:regression_cmd_options}]
{p 4 4 2}
where
{p 8 8 2}
{it:supported_regression_cmd}s are
{helpb clogit},
{helpb cnreg},
{helpb glm},
{helpb logistic},
{helpb logit},
{helpb mlogit},
{helpb ologit},
{helpb oprobit},
{helpb poisson},
{helpb probit},
{helpb qreg},
{helpb regress},
{helpb rreg},
{helpb stcox},
{helpb streg},
or
{helpb xtgee}, and {it:other_regression_cmd} is any other Stata regression command
(see Remarks).
{p 4 4 2}
{cmd:micombine} shares a subset of the features of all {help estcom:estimation commands};
see {it:Remarks}.
{p 4 4 2}
All weight types supported by {it:regression_cmd} are allowed; see
{help weight}.
{title:Description}
{p 4 4 2}
{cmd:micombine} estimates the parameters of a regression model whose
type is determined by {it:supported_regression_cmd} or {it:other_regression_cmd}.
Parameter estimates are combined
across several replicates obtained previously by multiple imputation,
e.g. by using {helpb ice} to create a file of imputed data.
See {it:Remarks} for a brief account of how {cmd:micombine} combines
the estimates and obtains standard errors.
{title:Options}
{p 4 8 2}
{cmd:br} calculates degrees of freedom and tests of significance for each predictor
according to the formulae (3)-(5) of Barnard & Rubin (1999).
After estimation, the required degrees of freedom are stored in a matrix
(column vector) {cmd:e(nutilde)}. Note that if {cmd:test}
is used after {cmd:micombine} for significance testing of regression
coefficients, such tests assume that the degrees of freedom are
equal to the number of observations minus the number of parameters
estimated, not those given in {cmd:e(nutilde)}.
{p 4 8 2}
{cmd:noconstant} suppresses the regression constant in all regressions.
{p 4 8 2}
{cmd:detail} gives details of the regression model for each imputation.
{p 4 8 2}
{cmd:eform}[{cmd:(}{it:string}{cmd:)}] specifies that the exponentiated form
of the coefficients be output and that the constant not be reported.
The exponentiated coefficients are labeled {cmd:exp(b)}, unless
the optional {it:string} is used.
{p 4 8 2}
{cmd:genxb(}{it:newvarname}{cmd:)} creates {it:newvarname} to hold the
linear predictor from each regression model, averaged over all the
imputations.
{p 4 8 2}
{cmd:impid(}{it:varname}{cmd:)} specifies that {it:varname} is the variable
identifying the imputations. The number of imputations is determined as
the number of unique values of {it:varname}. All observations for which
{it:varname} takes the value zero are ignored in the analysis.
The default {it:varname} is {cmd:_j}.
{p 4 8 2}
{cmd:lrr} specifies that the Li-Raghunathan-Rubin (LRR) robust estimate of the
variance-covariance matrix of the regression coefficients be used.
{p 4 8 2}
{cmd:nowarning} suppresses the warning message about the use
of {it:other_regression_cmd}s (see {it:Remarks}).
{p 4 8 2}
{cmd:obsid(}{it:varname}{cmd:)} is provided to allow {cmd:micombine} to analyze
datasets created by programs other than {cmd:ice}. {it:varname} specifies the name
of a variable holding the "observation ID", i.e. the sequence number of each
observation in a given imputation. The number of observations should
be identical between imputations, as should the order of the observations.
{it:varname} should run 1,...,N for imputation 1, 1,...,N for imputation 2, and
so on. {cmd:ice} automatically stores the information with the data, so this
option is not required. The default {it:varname} is {cmd:_i}.
{p 4 8 2}
{it:regression_cmd_options} may be any of the options appropriate to
{it:regression_cmd}.
{title:Remarks}
{p 4 4 2}
Details of statistical inference from multiple imputed datasets are nicely
described in a recent Stata Journal article by John Carlin and colleagues
(Carlin et al. 2003). Here, with due acknowledgment to John, I give an edited
version of section 2 of his article.
{p 4 4 2}
A simple method of combining estimates from several models was derived by
Rubin (1987). Suppose initially that primary interest lies in estimating a
scalar quantity, Q. Here, Q is a regression coefficient, for example, the log
hazard ratio in a proportional hazards model. Suppose that we have imputed m
complete datasets using an appropriate model. In each dataset, standard
complete-data methods are used to obtain an estimate of Q with an associated
standard error. Let Q(k) and U(k) denote the point estimate and variance
respectively from the kth (k = 1, 2, ... , m) dataset. The point estimate Q^
of Q from multiple imputation is simply the arithmetic mean of Q(1),...,Q(k).
{p 4 4 2}
Obtaining a valid standard error for this estimate of Q^ requires combining
information on within-imputation and between-imputation variation. The latter
is important in reflecting uncertainty due to variability between imputation
samples. First, a within-imputation variance component, W, is obtained as the
mean of the complete-data variance estimates, Q(1),....,Q(k). Second, a
between-imputation variance component, B, is calculated as the sum of squares
of Q(1),....,Q(k) about Q^, divided by m-1. The (total) variance T of Q^ is
given by
{p 8 12 2}
T = W + B * (1 + 1/m)
{p 4 4 2}
Rubin (1987) showed that (Q - Q^)/sqrt(T) is distributed approximately
as Student's t on nu degrees of freedom, where
{p 8 12 2}
nu = (m - 1) * (1 + W /(B * (1 + 1/m)))^2
{p 4 4 2}
The (1 + 1/m) term in these expressions indicates that it is not necessary to
a create large number of imputed datasets, particularly when B is much smaller
than W. The condition will be satisfied unless there is much missing data and
the parameter estimates within each dataset are very precise.
{title:Available regression commands}
{p 4 4 2}
{cmd:micombine} has been tested with the commands listed under
{it:supported_regression_cmd} at the beginning of this help file.
{cmd:micombine} {it:may} work satisfactorily with {it:other_regression_cmd}s,
but this cannot be guaranteed. This facility is provided so that the
researcher familiar with a particular Stata command has a fighting chance of
obtaining correct MI estimates and standard errors. HOWEVER, THE AUTHOR
DISCLAIMS ALL RESPONSIBILITY FOR THE CORRECTNESS OF RESULTS ARISING FROM USE
OF AN {it:other_regression_cmd}. Note that {it:other_stuff} in the syntax
diagram is code that may be required by some {it:other_regression_cmd}s, for
example {cmd:ivreg} wants {cmd:(}{it:varlist2}{cmd: = }{it:varlist_iv}{cmd:)}.
{cmd:micombine} parses for the occurrence of an opening parenthesis. There may
be other syntaxes that are not accommodated by this approach; if so, please
contact the author with details.
{title:Postestimation prediction}
{p 4 4 2}
The {cmd:predict} command {it:may} work as you expect after {cmd:micombine},
but this feature should be regarded as under development and should be
treated with caution. {cmd:micombine} stores the quantities needed by
{cmd:predict} at the last execution of the regression command, that is at the
final imputation, but prediction following some regression commands has
non-standard features that are hard to emulate accurately.
Known issues are as follows:
{p 8 12 2}
1. After {cmd:micombine mlogit}: {cmd:predict} may require that the outcome
levels are known as 0, 1, 2, ... , so it may be necessary to drop the
score label for the outcome variable, if such a label is defined.
This is KNOWN to be a problem using {cmd:mfx} following {cmd:micombine mlogit}.
For example, {cmd:mfx compute, predict(outcome(0))} will work only if
the lowest level of the outcome is 0, and is not labeled.
{p 8 12 2}
2. After {cmd:micombine} with a restricted sample (i.e. using {cmd:if},
{cmd:in} or zero weights for some observations, or some members
of {it:covarlist} still have missing values), the system variable
{cmd:e(sample)} is defined as you would expect it to be
only for the final imputation. In all earlier imputations it
is zero. Although not necessarily convenient for use of
{cmd:e(sample)} in data analysis, the behavior is correct for the
purposes of {cmd:predict}, since the relevant sample size and
estimation sample are properties of (any) one imputation,
but not of the complete assembly of imputations.
{title:Examples}
{p 4 8 2}{cmd:. ice y x1 x2 x3 using imp, m(10) genmiss(m_)}{p_end}
{p 4 8 2}{cmd:. use imp, clear}{p_end}
{p 4 8 2}{cmd:. micombine regress y x1 x2 x3}{p_end}
{p 4 8 2}{cmd:. stset time, failure(cens)}{p_end}
{p 4 8 2}{cmd:. micombine stcox x1 x2 x3, genxb(index)}{p_end}
{p 4 8 2}{cmd:. test x2==1}{p_end}
{p 4 8 2}{cmd:. testparm x1 x2}{p_end}
{title:Author}
{p 4 4 2}
Patrick Royston, MRC Clinical Trials Unit, London.
patrick.royston@ctu.mrc.ac.uk
{title:References}
{p 4 8 2}
Barnard, J. and D. B. Rubin. 1999. Small-sample degrees of freedom with
multiple imputation. {it:Biometrika} 86: 948-955.
{p 4 8 2}
Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003.
Tools for analyzing multiple imputed datasets. {it:Stata Journal} 3(3): 226-244.
{p 4 8 2}
Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003.
Tools for analyzing multiple imputed datasets. {it:Stata Journal} 3(3): 226-244.
{p 4 8 2}
Li, K., T. Raghunathan, and D. Rubin. 1991. Large sample significance levels
from multiply-imputed data using moment-based statistics and an F reference
distribution. {it:Journal of the American Statistical Association} 86: 1065-1073.
{p 4 8 2}
Rubin, D. 1987. {it:Multiple Imputation for Nonresponse in Surveys}. New York:
Wiley.
{p 4 8 2}
Schafer, J. 1997. {it:Analysis of Incomplete Multivariate Data}. London:
Chapman & Hall.
{p 4 8 2}
van Buuren, S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of
missing blood pressure covariates in survival analysis.
{it:Statistics in Medicine} 18: 681-694.
(Also see http://www.multiple-imputation.com.)
{title:Also see}
{psee}
Online: {helpb ice}
{p_end}