You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
221 lines
8.1 KiB
Plaintext
221 lines
8.1 KiB
Plaintext
8 months ago
|
{smcl}
|
||
|
{* 15 Aug 2007}{...}
|
||
|
{cmd:help hotdeck}
|
||
|
{hline}
|
||
|
|
||
|
{title:Title}
|
||
|
|
||
|
{hi:Impute missing values using the hotdeck method}
|
||
|
|
||
|
{title:Syntax}
|
||
|
|
||
|
{p 8 27}
|
||
|
{cmdab:hotdeck}
|
||
|
[{it:varlist}] [{cmd:using}] [{hi:if}{it: exp}] [{hi:in}{it: exp}]
|
||
|
,
|
||
|
[
|
||
|
{cmdab:by}{cmd:(}{it:varlist}{cmd:)}
|
||
|
{cmdab:store}
|
||
|
{cmdab:imp:ute}{cmd:(}{it:varlist}{cmd:)}
|
||
|
{cmdab:noise}
|
||
|
{cmdab:keep}{cmd:(}{it:varlist}{cmd:)}
|
||
|
{cmdab:com:mand}{cmd:(}{it:command}{cmd:)}
|
||
|
{cmdab:parms}{cmd:(}{it:varlist}{cmd:)}
|
||
|
{cmdab:seed}{cmd:(}{it:#}{cmd:)}
|
||
|
{cmdab:infiles}{cmd:(}{it:filename filename ...}{cmd:)}
|
||
|
]
|
||
|
|
||
|
{p}
|
||
|
|
||
|
{title:Description}
|
||
|
|
||
|
{pstd}
|
||
|
{hi:Hotdeck} will tabulate the missing data patterns within the {help varlist}.
|
||
|
A row of data with missing values in any of the variables in the {hi:varlist}
|
||
|
is defined as a `missing line' of data, similarly a `complete line' is one where all the
|
||
|
variables in the {hi:varlist} contain data. The {hi:hotdeck} procedure
|
||
|
replaces the {hi:varlist} variables in the `missing lines' with the
|
||
|
corresponding values in the `complete lines'.
|
||
|
{hi:Hotdeck} should be used several times within a multiple imputation
|
||
|
sequence since missing data
|
||
|
are imputed stochastically rather than deterministically. The {hi:nmiss} missing
|
||
|
lines in each stratum of the data described by the `by' option are replaced
|
||
|
by lines sampled from the {hi:nobs} complete lines in the same stratum. The
|
||
|
approximate Bayesian bootstrap method of Rubin and Schenker(1986) is used;
|
||
|
first a bootstrap sample of {hi:nobs} lines are sampled with replacement from
|
||
|
the complete lines, and the {hi:nmiss} missing lines are sampled at random
|
||
|
(again with replacement) from this bootstrap sample.
|
||
|
|
||
|
{pstd}
|
||
|
A major assumption with the hotdeck procedure is
|
||
|
that the missing data are either missing completely at random (MCAR) or is
|
||
|
missing at random (MAR), the probability that a line is missing
|
||
|
varying only with respect to the categorical
|
||
|
variables specified in the `by' option.
|
||
|
|
||
|
{pstd}
|
||
|
If a dataset contains many variables with missing values then
|
||
|
it is possible that many of the rows of data will contain at
|
||
|
least one missing value. The {hi:hotdeck} procedure will not work
|
||
|
very well in such circumstances.
|
||
|
There are more
|
||
|
elaborate methods that {bf:only} replace missing values, rather than the whole row,
|
||
|
for imputed values.
|
||
|
These multivariate multiple imputation methods are discussed by Schafer(1997).
|
||
|
|
||
|
{pstd}
|
||
|
A critical point is that all variables that are used in the analysis should be included in
|
||
|
the variable list. This is particularly true for variables that have missing data!
|
||
|
Variables that predict missingness should be included in the
|
||
|
by option so missing data is imputed within strata.
|
||
|
|
||
|
{title:Latest Version}
|
||
|
|
||
|
{pstd}
|
||
|
The latest version is always kept on the SSC website. To install the latest version click
|
||
|
on the following link
|
||
|
|
||
|
{phang}
|
||
|
{stata ssc install hotdeck, replace}.
|
||
|
|
||
|
{title:Options}
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:using} specifies the root of the imputed datasets filenames. The default is
|
||
|
"imp" and hence the datasets will be saved as imp1.dta, imp2.dta, ....
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:by}{cmd:(}{it:varlist}{cmd:)} specifies categorical variables defining strata within which
|
||
|
the imputation is to be carried out. Missing values will be replaced by complete values only within the
|
||
|
strata. If within a strata there are no complete records then no data will be imputed and will lead
|
||
|
to the wrong answers. Make sure there are a reasonable number of complete records per strata.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:store} specifies whether the imputed datasets are saved to disk.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:imp:ute}{cmd:(}{it:varlist}{cmd:)} specifies the number of imputed datasets to generate. The number
|
||
|
needed varies according to the percentage missing and the type of data, but
|
||
|
generally 5 is sufficient.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:noise} specifies whether the individual analyses, from the {hi:command()} option,
|
||
|
are displayed.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:keep}{cmd:(}{it:varlist}{cmd:)} specifies the variables saved in the imputed datasets
|
||
|
in addition to the imputed variables and the by list. By default the imputed
|
||
|
variables and the by list are always saved.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:com:mand}{cmd:(}{it:command}{cmd:)} specifies the analysis performed on every imputed dataset.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:parms}{cmd:(}{it:varlist}{cmd:)} specifies the parameters of interest from the
|
||
|
analysis. If the {hi:command} is a regression command then the parameter list can
|
||
|
include a subset of the variables specified in the regression command.The
|
||
|
final output consists of the combined estimates of these parameters.
|
||
|
For non-standard commands that are "regression" commands the {hi:parms()} option
|
||
|
looks at the estimation matrix e(b) and requires the column names to identify
|
||
|
the coefficients of interest.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:seed}{cmd:(}{it:#}{cmd:)} specifies the random number generator seed. When using the {hi:seed} option
|
||
|
the hotdeck command must be used in the correct way. The key point is that ALL variables in the analysis command
|
||
|
must be in the variable list, this ensures that the correlations between the variables are maintained post
|
||
|
imputation.
|
||
|
|
||
|
{phang}
|
||
|
{cmdab:infiles}{cmd:(}{it:filename filename ...}{cmd:)} specifies a list of files that have missing
|
||
|
values replaced by imputed values. This is convenient when the user has
|
||
|
several imputed datasets and wants to analyse them and combine the results.
|
||
|
|
||
|
|
||
|
{title:Examples}
|
||
|
|
||
|
Impute values for y in sex/age groups.
|
||
|
|
||
|
{inp:hotdeck y, by(sex age) }
|
||
|
|
||
|
Additionally to store the imputed datasets above as {hi:imp1.dta} and {hi:imp2.dta}.
|
||
|
|
||
|
{inp:hotdeck y using imp,store by(sex age) impute(2)}
|
||
|
|
||
|
{p 0 0}
|
||
|
Hotdeck can also use the stored imputed datafiles hi:imp1.dta} and {hi:imp2.dta}
|
||
|
and carry out the combined analysis. This analysis is displayed for the coefficient
|
||
|
of {hi:x} and constant term {hi:_cons}.
|
||
|
|
||
|
{inp:hotdeck y using imp, command(logit y x) parms(x _cons) infiles(imp1 imp2)}
|
||
|
|
||
|
{p 0 0}
|
||
|
Do not save imputed datasets to disk but carry out a logistic regression on the imputed
|
||
|
datasets and display the coefficients for {hi:x} and the constant term {hi:_cons} of the model.
|
||
|
|
||
|
{inp:hotdeck y x, by(sex age) command(logit y x) parms(x _cons) impute(5)}
|
||
|
|
||
|
|
||
|
{title:Example - Multiple Equation Model}
|
||
|
|
||
|
{p 0 0}
|
||
|
Multiple equation models require more complicated {hi:parms()} statements.
|
||
|
The example used can be applied to all multiple equation models. The only complication
|
||
|
is that the name of the coefficients are different.
|
||
|
|
||
|
For the following command
|
||
|
|
||
|
{inp:xtreg kgh f1, mle}
|
||
|
|
||
|
Then inspect the matrix of coefficients
|
||
|
|
||
|
{inp:mat list e(b)}
|
||
|
|
||
|
e(b)[1,4]
|
||
|
kgh: kgh: sigma_u: sigma_e:
|
||
|
f1 _cons _cons _cons
|
||
|
y1 -1.6751401 77.792948 0 16.730843
|
||
|
|
||
|
Then the following command will do an imputation and analysis for the single parameter.
|
||
|
|
||
|
{inp:hotdeck kgh, by(ethn) command(xtreg kgh f1, mle) parms(kgh:f1) impute(5)}
|
||
|
|
||
|
{title:Example - mlogit}
|
||
|
|
||
|
Use this web dataset for STATA release 9.
|
||
|
|
||
|
{stata "use http://www.stata-press.com/data/r9/sysdsn3.dta"}
|
||
|
|
||
|
The simple model without handling missing data
|
||
|
|
||
|
{stata mlogit insure male}
|
||
|
|
||
|
{p 0 0}
|
||
|
The estimated coefficients are put automatically by STATA into the matrix e(b), note the column
|
||
|
headings are the parameter names that {hi:hotdeck} uses. So you can not use the simple syntax
|
||
|
of just {hi:parms(male)} because this refers to two parameters.
|
||
|
|
||
|
{stata mat list e(b)}
|
||
|
|
||
|
{p 0 0}
|
||
|
So this syntax will handle the missing data using {hi:hotdeck} imputation.
|
||
|
|
||
|
{stata "hotdeck insure male, command(mlogit insure male) parms(Prepaid:male) impute(5)"}
|
||
|
|
||
|
{p 0 0}
|
||
|
{hi:NOTE} hotdeck will fail when using mlogit with spaces in the category labels. This is due
|
||
|
to the lack of functionality in STATA's matrix commands.
|
||
|
|
||
|
{title:Author}
|
||
|
|
||
|
{p}
|
||
|
Adrian Mander, MRC Human Nutrition Research, Cambridge, UK.
|
||
|
|
||
|
Email {browse "mailto:adrian.mander@mrc-hnr.cam.ac.uk":adrian.mander@mrc-hnr.cam.ac.uk}
|
||
|
|
||
|
{title:See Also}
|
||
|
Related commands
|
||
|
|
||
|
HELP FILES Installation status SSC installation links Description
|
||
|
|
||
|
{help whotdeck} (if installed) ({stata ssc install whotdeck}) Weighted version of Hotdeck
|