JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2022, VOL. 117, NO. 538, 983–995: Theory and Methods
https://doi.org/10.1080/01621459.2020.1840383
Robust Post-Matching Inference
Alberto Abadie
a
and Jann Spiess
b
a
Department of Economics, MIT, Cambridge, MA;
b
Graduate School of Business, Stanford University, Stanford, CA
ABSTRACT
Nearest-neighbor matching is a popular nonparametric tool to create balance between treatment and
control groups in observational studies. As a preprocessing step before regression, matching reduces the
dependence on parametric modeling assumptions. In current empirical practice, however, the matching
step is often ignored in the calculation of standard errors and condence intervals. In this article, we show
that ignoring the matching step results in asymptotically valid standard errors if matching is done without
replacement and the regression model is correctly specied relative to the population regression function
of the outcome variable on the treatment variable and all the covariates used for matching. However,
standard errors that ignore the matching step are not valid if matching is conducted with replacement or,
more crucially, if the second step regression model is misspecied in the sense indicated above. Moreover,
correct specication of the regression model is not required for consistent estimation of treatment eects
with matched data. We show that two easily implementable alternatives produce approximations to the
distribution of the post-matching estimator that are robust to misspecication. A simulation study and an
empirical example demonstrate the empirical relevance of our results. Supplementary materials for this
article are available online.
ARTICLE HISTORY
Received January 2019
Accepted August 2020
KEYWORDS
Matching; Robust estimation;
Treatment eects
1. Introduction
Matching methods are widely used to create balance between
treatment and control groups in observational studies. Oen-
times, matching is followed by a simple comparison of means
between treated and nontreated (Cochran 1953;Rubin1973;
Dehejia and Wahba 1999). In other instances, however, match-
ing is used in combination with regression or with other esti-
mation methods more complex than a simple comparison of
means. The combination of matching in a rst step with a
second-step regression estimator brings together parametric
and nonparametric estimation strategies and, as demonstrated
in Ho et al. (2007), reduces the dependence of regression esti-
mates on modeling decisions. Moreover, matching followed by
regression allows the estimation of elaborate models, such as
those that include interaction eects and other parameters that
go beyond average treatment eects.
In this article, we develop valid standard error estimates
for regression aer matching. The large sample properties of
average treatment eect estimators that employ a simple com-
parison of mean outcomes between treated and nontreated aer
matching on covariates are well understood (see, e.g., Abadie
and Imbens 2006). However, studies that employ regression
models aer matching usually ignore the matching step when
performing inference on post-matching regression coecients.
We show that this practice is not generally valid if the sec-
ond step regression is misspecied in a sense we make precise
below. We propose two easily implementable and robust-to-
misspecication approaches to the estimation of the standard
errors of regression coecient estimators in matched samples
CONT ACT Alberto Abadie [email protected] Department of Economics, MIT, Cambridge, MA 02142.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA.
(with matching done without replacement). First, we show that
standard errors that are clustered at the level of the matched
sets are valid under misspecication. Second, we show that a
nonparametric block bootstrap that resamples matched pairs or
matched sets, as opposed to resampling individual observations,
also yields valid inference under misspecication. Furthermore,
we show that standard errors that ignore the matching step
can both underestimate or overestimate the variation of post-
matching estimates. The procedures that we propose in this
article are straightforward to implement with standard statistical
soware.
We will consider the following setup. Let W be a binary
random variable representing exposure to the treatment or con-
dition of interest (e.g., smoking), so W = 1forthetreated,
and W = 0 for the nontreated. Y is a random variable repre-
senting the outcome of interest (e.g., forced expiratory volume)
and X is a vector of covariates (e.g., gender or age). We will
study the problem of estimating how the treatment aects the
outcomes of the individuals in the treated population (i.e., those
with W = 1). In particular, we will analyze the properties
of a two-step (rst matching, then regression) estimator oen
used in empirical practice. This estimation strategy starts with
an unmatched sample, S,fromwhichtreatedunitsandtheir
matches are extracted to create a matched sample, S
.Matching
is done without replacement and on the basis of the values of
X. Then, using data for the matched sample only, the researcher
runs a regression of Y on Z,whereZ isavectoroffunctions
of W and X (e.g., individual variables plus interactions). We
aim to obtain valid inferential methods for the coecients of
this regression, possibly under misspecication. To be precise,
© 2020 American Statistical Association
984 A. ABADIE AND J. SPIESS
by misspecication we mean that there is no version of the
conditional expectation of Y given W and X that follows the
functional form employed in the second-step estimator. For
example, as explained below, a dierence in means between
treated and nontreated in the second step would be misspeci-
ed if the conditional expectation of Y given X and W depends
on X. To simplify the exposition, here we have described a
setting where Z depends only on the treatment, W,andonthe
covariates used in the matching stage, X. Our general framework
in Section 2 allows Z to depend on other covariates not in X.
The intuition behind the results in this article is that, if Y
depends on X,thenmatchingonX creates dependence between
the outcomes of treated units and their matches. This depen-
dence is absorbed by the second-step regression function as
long as the regression function is correctly specied relative
to the population regression of Y on W and X.However,if
the second-step regression is misspecied relative to the pop-
ulation regression of Y on W and X, dependence between
treated units and matches remains in the regression residuals.
Ignoring this dependence produces biased inference. Clustered
standard errors and analogous block bootstrap procedures take
into account the dependence between the outcomes of treated
units and their matches, restoring valid inference.
A special case of our setup is that of the standard matching
estimator for the average treatment eect on the treated, which
is given by the regression coecient on treatment W in a regres-
sion of Y on Z = (1, W)
. However, the framework allows for
richer analysis, such as the analysis of linear interaction eects
of the treatment with covariates, Z = (1, W, WX
, X
)
.
To illustrate the implications of our results, consider the
simple case when Z = (1, W)
.Aswementionedpreviously,
for Z = (1, W)
thesampleregressioncoecientonW cor-
responds to the simple matching estimator oen employed in
applied studies, which is based on a post-matching comparison
of means between treated and nontreated. Under well-known
conditions this estimator is consistent for the average eect of
thetreatmentonthetreated(see,e.g.,AbadieandImbens2012),
irrespective of the true form of the expectation of Y given W and
X. Notice, however, that even in this simple scenario, our results
imply that regression standard errors that ignore the matching
step are not valid in general. Although the expectation of Y given
W is linear because W is binary, a linear regression of Y on
Z = (1, W)
will be misspecied relative to the regression of Y
on W and X,unlessY is mean-independent of X given W over
asetofprobabilityone.
The rest of the article is organized as follows. Section 2 starts
with a detailed description of the setup of our investigation.
We then characterize the parameters estimated by the two-step
procedure described above. We show that these parameters are
equal to the regression coecients in a regression of Y on Z in
a population for which the distribution of matching covariates
X in the control group has been modied to coincide with that
of the treated. Under selection on observables—that is, if treat-
ment is as good as random conditional on X—post-matching
regression estimands are equal to the population regression
coecients in an experiment where the treatment is randomly
assigned in a population that has the same distribution of X as
the treated. We next establish consistency with respect to this
vector of parameters, show asymptotic normality, and describe
the asymptotic variance of the post-matching estimator. In
Section 3, we discuss dierent ways of constructing standard
errors.BasedontheresultsofSection 2,weshowthatstandard
errors that ignore the matching step are not generally valid
if the regression model is misspecied in the sense indicated
above, while clustered standard errors or an analogous block
bootstrap procedure yield valid inference. Section 4 presents
simulation evidence, which conrms our theoretical results.
Section 5 applies our results to the analysis of the eect of
smoking on pulmonary function. In this application, matching
before regression and the use of the robust standard errors
proposed in this article substantially aect empirical ndings.
Section 6 concludes.
The appendix contains the proofs of our main results. A
supplementary appendix contains proofs of intermediate results
and two extensions. In particular, the standard errors derived in
this article are valid for unconditional inference. Alternatively,
one could perform inference conditional on the values of the
regressors, X and W, in the sample. Notice that, in this case, the
rst step matches are xed. We discuss this alternative setting
in the supplementary appendix, where we show that, for the
conditional case, the usual regression standard errors are not
generally valid, but valid standard errors can be calculated using
the formulas in Abadie, Imbens, and Zheng (2014). Also, for
concreteness and following the vast majority of applied practice,
in the main text of this article we restrict our analysis to linear
regression aer matching. In the supplementary appendix, we
provide an extension of our result to general M-estimation aer
matching.
2. Post-Matching I nference
In this section, we discussthe asymptotic distribution of the least
squares estimator obtained from a linear regression of Y on Z
aer matching on observables, X.
2.1. Post-Matching Least Squares
Consider a standard binary treatment setting along the lines of
Rubin (1974) with potential outcomes Y(1) and Y(0),ofwhich
we only observe Y = Y(W) for treatment W ∈{0, 1}.LetS be
a set of observed covariates.
We will assume that the data consist of random samples of
treated and nontreated. This assumption could be easily relaxed,
andweadoptitonlytosimplifythediscussion.
Assumption 1 (Random sampling). S ={(Y
i
, W
i
, S
i
)}
N
i=1
is a
pooled sample obtained from N
1
and N
0
independent draws
from the population distribution of (Y, S) for the treated (W =
1) and nontreated (W = 0), respectively, so N = N
0
+ N
1
.
Consider an (m × 1) vector of covariates X = f (S) X
R
m
,andletS
S be the matched sample generated by match-
ing without replacement each treated unit to M nontreated units
on the basis of their X-values. We will denote J (i) the set of
nontreated units matched to treated unit i. For simplicity, in our
notation we omit the dependence of J (i) on N and M.Oen,for
matching without replacement, the sets J (i) form the collection
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 985
of nonoverlapping subsets of {j : W
j
= 0},eachofcardinality
M, that minimizes the sum of the matching discrepancies.
N
i=1
W
i
jJ (i)
d(X
i
, X
j
),(1)
where d : X × X →[0, ) is a metric. More generally, our
conditions do not require a matching scheme that directly min-
imizes (1), as long as Assumption 3 and the Lipschitz conditions
in Assumption 4 and Proposition 3 hold for some metric, d(·, ·),
under the adopted matching scheme.
The matched sample, S
=
W
i
=1
(
{i}∪J (i)
)
,hassizen =
(M + 1)N
1
.Weuseadoublesubscriptnotationtorefertothe
observations in the matched sample. For instance, Y
n1
, ..., Y
nn
refers to the values of the outcome variable for the units in S
,
with analogous notation for other variables. Within the matched
sample, observations will be rearranged so that the rst N
1
observations are the treated units.
Let Z = g(W, S) be a (k × 1) vector of functions of (W, S),
and let
β bethevectorofsampleregressioncoecientsobtained
from regressing Y on Z in the matched sample,
β = argmin
bR
k
1
n
n
i=1
(Y
ni
Z
ni
b)
2
=
1
n
n
i=1
Z
ni
Z
ni
1
1
n
n
i=1
Z
ni
Y
ni
.(2)
In Section 2.3, we will introduce a set of assumptions under
which
β exists with probability approaching one.
As we mentioned above, when Z = (1, W)
the regression
coecient on W inthematchedsampleisgivenby
τ =
1
N
1
n
i=1
W
ni
Y
ni
1
MN
1
n
i=1
(1 W
ni
)Y
ni
=
1
N
1
N
i=1
W
i
Y
i
1
M
jJ (i)
Y
j
,
which is the usual matching estimator for the average eect of
the treatment on the treated.
2.2. Characterization of the Estimand
Before we study the sampling distribution of
β,werstchar-
acterize its population counterpart, which we will denote by
β. That is, our rst task is to obtain a precise description of
the nature of the parameters estimated by
β.Althoughpost-
matching regressions are oen used in empirical practice, to
the best of our knowledge, the precise nature of post-matching
estimands has not been previously derived.
Thegoalofmatchingistochangethedistributionofthe
covariatesinthesampleofnontreatedunits,sothatitrepro-
duces the distribution of the covariates among the treated. To
do so, it is necessary that the support of the matching variables,
X, for the treated is inside the support for the nontreated.
Assumption 2 (Support condition). Let X
1
= supp(X|W = 1)
and X
0
= supp(X|W = 0),then
X
1
X
0
.
We now describe the population distribution targeted by the
matched sample, S
.LetP(·|W = 1) and P(·|W = 0) be the
matching source distributions of (Y, S) from where the treated
and nontreated samples in S are, respectively, drawn, and let
E[·|W = 1] and E[·|W = 0] be the corresponding expectation
operators. For given P(·|W = 1) and P(·|W = 0) and a given
number of matches, M, we dene a matching target distribution,
P
,overthetriple(Y, S, W), as follows:
P
(W = 1) =
1
1 + M
,
and for each measurable set, A,
P
((Y, S) A|W = 1) = P((Y, S) A|W = 1),
and
P
((Y, S) A|W = 0) = E[P((Y, S) A|W = 0, X)|W = 1].
That is, in the matching target distribution: (i) treatment is
assigned in the same proportion as in the matched sample; (ii)
the distribution of (Y, S) among the treated is the same as in the
matching source; (iii) the distribution of (Y, S) among the non-
treated is generated by integrating the conditional distribution
of (Y, S) given X and W = 0 over the distribution of X given
W = 1, in the matching source. As a result, under the matching
target distribution, the distribution of X given W = 0 coincides
with the distribution of X given W = 1.
Under regularity conditions stated below, estimation on
the matched sample, S
, asymptotically recovers parameters
of the matching target distribution, P
,inwhichthetreated
and nontreated have the same distribution of X,butpossibly
dierent outcome and covariate distributions conditional on
X. As a result, comparisons of outcomes between treated and
nontreated in the matched sample, S
,producethecontrolled
contrasts of the Oaxaca–Blinder decomposition (Blinder 1973;
Oaxaca 1973; DiNardo, Fortin, and Lemieux 1996). More gen-
erally, under regularity conditions, regression coecients of Y
on Z in the matched sample, S
,asymptoticallyrecoverthe
analogous regression coecients in the target population:
β = argmin
bR
k
E
[(Y Z
b)
2
]
= (E
[ZZ
])
1
E
[ZY].(3)
Matching methods are oen motivated by a selection-on-
observables assumption, that is, by the assumption that treat-
ment assignment is as good as random conditional on observed
covariates. To formalize the assumption of selection on observ-
ables and its implications in our framework, consider source
populations expressed this time in terms of potential outcomes
and covariates, Q(·|W = 1) and Q(·|W = 0), which represent
the distributions of (Y(1), Y(0), S) given W = 1andW = 0,
respectively. These distributions are dened in such a way that
P(·|W = 1) and P(·|W = 0) can be obtained by integrating
out Y(0) from Q(·|W = 1) and Y(1) from Q(·|W = 0),
respectively. For given Q(·|W = 1) and Q
(·|W = 0),selection
on observables means
(Y(1), Y(0), S)|X, W = 1 (Y(1), Y(0), S)|X, W = 0
986 A. ABADIE AND J. SPIESS
almost surely with respect to the distribution of X|W = 1. That
is, the joint distribution of covariates and potential outcomes
is independent of treatment assignment conditional on the
matching variables. Because in this article, we focus on causal
parameters dened for a population with distribution of the
matching variables equal to X|W = 1, for our purposes it is
enough that the selection-on-observables assumption holds for
the distribution of (Y(0), S) only,
(Y(0), S)|X, W = 1 (Y(0), S)|X, W = 0. (4)
Proposition 1 (Estimand under selection on observables). Sup-
pose that Assumption 2 holds and that β,asdenedinEquation
(3), exists. Then if selection on observables, as dened in Equa-
tion (4), holds, the coecients β are the same as the population
coecients that would be obtained from a regression of Y on Z
in a setting where:
1. (Y(1), Y(0), S) has distribution Q(·|W = 1)
,
2. treatment is randomly assigned with probability 1/(M +1).
This result formalizes the notion that matching under selec-
tion on observables allows researchers to reproduce an exper-
imental setting under which average treatment eects can be
easily evaluated through a least squares regression of Y on Z.
The results in this article, however, apply to the general estimand
β in Equation (3), regardless of the validity of the selection-on-
observables assumption.
2.3. Consistency and Asymptotic Normality
In this section, we will establish large sample properties of
β,
as N
1
, N
0
→∞with N
0
MN
1
. Throughout this article, we
will assume that the sum of matching discrepancies vanishes
quickly enough to allow asymptotic unbiasedness and root-n
consistency:
Assumption 3 (Matching discrepancies).
1
N
1
N
i=1
W
i
jJ (i)
d(X
i
, X
j
)
p
−→ 0.
Abadie and Imbens (2012) derived primitive conditions for
Assumption 3,whichrequireN
1
= O(N
1/r
0
) for some r greater
than the number of covariates in X (excludingthosethattakeon
a nite number of values). This condition highlights the impor-
tanceofobtainingmatchesfromalargereservoirofuntreated
units, especially when the dimensionality of X is large. Of
course, in concrete empirical settings, the adequacy of matching
should not rely on asymptotic results. Instead, the quality of
the matches needs to be evaluated for each particular sample.
Abadie and Imbens (2011)andImbensandRubin(2015)dis-
cussed measures of the discrepancy between the distributions
of the covariates of treated and nontreated. For example, the
normalized dierence in Abadie and Imbens (2011)is(m
1
m
0
)/
(s
2
1
+ s
2
0
)/2, where m
w
and s
2
w
are the means and standard
deviations of a covariate (typically, products of/and powers of
the components of X) for the units with W = w in the matched
sample.
For any real matrix A,letA=
tr(A
A) be the Euclidean
norm of A. The next assumption collects regularity conditions
on the conditional moments of (Y, Z) given (X, W).
Assumption 4 (Well-behavedness of conditional expectations).
For w = 0, 1, and some δ>0,
E[Z
4
|W = w, X = x] and
E[Z(Y Z
β)
2+δ
|W = w, X = x]
areuniformlyboundedonX
w
.Furthermore,
E[ZZ
|X = x, W = 0], E[ZY|X = x, W = 0]
and var(Z(Y Z
β)|X = x, W = 0)
are componentwise Lipschitz in x with respect to d(·, ·).
To ensure the existence of
β with probability approaching
one as n 0, we assume invertibility of the Hessian, H =
E
(ZZ
).Noticethat
H =
E
E[ZZ
|X, W=1]+ME[ZZ
|X, W=0]
W=1
1 + M
.(5)
Assumption 5 (Linear independence of regressors). H is invert-
ible.
The next proposition establishes the asymptotic distribution
of
β.
Proposition 2 (Asymptotic distribution of the post-matching esti-
mator). Under Assumptions 15,
n(
β β)
d
N (0, H
1
JH
1
),
where
J =
var
E[Z(Y Z
β)|X, W = 1]
+ME[Z(Y Z
β)|X, W = 0]
W = 1
1 + M
+
E
var(Z(Y Z
β)|X, W = 1)
+Mvar(Z(Y Z
β)|X, W = 0)
W = 1
1 + M
and H isasdenedinEquation(5).
All proofs are in the appendix.
3. Post-Matching Standard Errors
In the previous section, we established that
n(
β β)
d
N (0, H
1
JH
1
)
for the post-matching estimator obtained from a regression of
Y on Z within the matched sample S
.Inthissection,ourgoal
is to estimate the asymptotic variance, H
1
JH
1
.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 987
3.1. Standard Errors Ignoring the Matching Step
Ho et al. (2007)arguedthatmatchingcanbeseenasaprepro-
cessing step, prior to estimation, so the matching step can be
ignored in the calculation of standard errors. Here, we consider
commonly applied sandwich standard error estimates for iid
data (Eicker 1967;Huber1967;White1980a, 1980b, 1982). In an
iid setting, sandwich standard errors are valid in large samples
even if the regression is misspecied relative to the conditional
expectation of Y given Z,inwhichcasethepopulationregres-
sion parameters are the coecients of an L
2
approximation
to the conditional expectation. As we will show, however, the
assumption of iid data does not apply in matched samples.
Sandwich standard errors can be computed as the square root
of the main diagonal of the matrix
H
1
J
s
H
1
/n,where
H =
1
n
n
i=1
Z
ni
Z
ni
(6)
and
J
s
=
1
n
n
i=1
Z
ni
(Y
ni
Z
ni
β)
2
Z
ni
.(7)
The following proposition derives the probability limit of
J
s
with
data from a matched sample.
Proposition 3 (Convergence of J
s
). Suppose that Assumptions 1
5 hold. Assume also that E[Z(Y Z
β)
2
Z
|X = x, W = 0] is
Lipschitz on X
0
and E[Y
4
|X = x, W = w]is uniformly bounded
on X
w
for all w = 0, 1. Then,
J
s
p
J
s
,where
J
s
=
E
E[Z(Y Z
β)
2
Z
|X, W = 1]
+ME[Z(Y Z
β)
2
Z
|X, W = 0]
W = 1
1 + M
.
Notice that J
s
= E
[Z(Y Z
β)
2
Z].Thatis,J
s
is equal to the
innermatrixofthesandwichasymptoticvariancewhendataare
iid with distribution P
.However,sincethematchedsampleS
is not an iid sample from P
,
J
s
is not generally consistent for J.
The dierence between the limit of the sandwich standard errors
H
1
J
s
H
1
and the actual asymptotic variance H
1
JH
1
is given
by H
1
H
1
,where
=
ME
0
(X)
1
(X)
+
1
(X)
0
(X)
|W = 1
(M 1)ME
0
(X)
0
(X)
|W = 1
M + 1
,(8)
and
w
(x) = E
Z(Y Z
β)|X = x, W = w
,
for w = 0, 1.
Therefore, bias in the estimation of the variance may arise
when
0
(X) = 0. The following example provides a simple
instance of this bias.
Example 1 (Inconsistency of sandwich standard errors). Assume
the sample outcomes are drawn from
Y = τ W + X + ε,(9)
where X is a scalar random variable with var(X|W = 1) =
σ
2
X
,andε has mean zero, variance σ
2
ε
, and is independent of
W and X. Consider the case where we match the values of X
for N
1
treated units to N
1
untreated units (M = 1) without
replacement. Let j(i) be the index of the untreated observation
that serves as a match for treated observation i. For simplicity,
suppose that X is discrete and all matches are perfect, X
i
= X
j(i)
for every treated unit i, so we can ignore potential biases gen-
erated by matching discrepancies. Within the matched sample,
S
, we run a linear regression of Y on Z = (1, W)
to obtain the
regression coecient on W,
τ =
1
N
1
N
i=1
W
i
(Y
i
Y
j(i)
). (10)
τ is the usual matching estimator for the average eect of the
treatment on the treated. Notice that, in the previous expression,
Y
i
Y
j(i)
= τ + ε
i
ε
j(i)
,withvariance2σ
2
ε
.VariationinX
is taken care of through matching. Therefore, all variation in τ
comes through the error term, ε.Becausen = 2N
1
,itfollows
that
n var(τ) = 4σ
2
ε
.
Consider now the residuals of the ordinary least squares (OLS)
regression of Y
ni
on a constant and W
ni
in the matched sample:
ε
ni
= Y
ni
μ τ W
ni
X
ni
+ ε
ni
,
where μ istheinterceptofthesampleregressionline.Forthis
simple case, the sandwich variance estimator for τ is
n
var(τ) =
4
n
n
i=1
ε
2
ni
4σ
2
X
+ 4σ
2
ε
.
That is, in this example, the sandwich variance estimator over-
estimates the variance of τ because it does not take into account
the dependence generated by matching between the regression
residuals of the treated units and their matches.
Sections 3.2 and 3.3 discuss variance estimators that adjust
for the matching step by taking into account the dependence of
regression errors between treated units and their matches. For
matching with M = 1 and a second-step regression of Y on a
constant and W, the clustered variance estimator of Section 3.2
becomes
n
var(τ) =
2
n
n
i=1
(ε
i
ε
j(i)
)
2
4σ
2
ε
,
restoring valid inference.
The next example shows that ignoring the matching step may
result in underestimation of the variance.
Example 2 (Underestimation of the variance). Inthesamesetting
as Example 1, assume that data are generated by
Y = τ W + X 2WX + ε. (11)
The post-matching estimator of τ from a regression of Y on
(1, W)
is τ as in Equation (10). In this case, if all matches are
988 A. ABADIE AND J. SPIESS
perfect, so X
i
= X
j(i)
,weobtainY
i
Y
j(i)
= τ 2X
i
+ε
i
ε
j(i)
.
Therefore,
n var(τ) = 8σ
2
X
+ 4σ
2
ε
.
Least squares regression residuals are
ε
ni
= Y
ni
μ τ W
ni
X
i
2W
ni
X
ni
+ ε
ni
=
X
ni
+ ε
ni
if W
ni
= 1,
X
ni
+ ε
ni
if W
ni
= 0,
implying
n
var(τ) =
4
n
n
i=1
ε
2
ni
4σ
2
X
+ 4σ
2
ε
,
for the conventional sandwich variance estimator. Again, the
sandwich variance estimator does not take into account depen-
dencies between sample units induced by matching. In this
example, matching on X induces a negative correlation between
the regression residuals of the treated units and their matches.
As a result, the sandwich variance estimator underestimates the
variance of τ. Once again, the clustered variance estimator of
Section 3.2 takes into account the correlation between regres-
sion error induced by matching, and produces valid inference,
n
var(τ) =
2
n
n
i=1
(ε
i
ε
j(i)
)
2
8σ
2
X
+ 4σ
2
ε
.
Sandwich standard errors would be valid in Examples 1 and 2
if the specications for the post-matching regressions included
the terms containing X in Equations (9)and(11), respectively.
Indeed, sandwich standard errors are generally valid if the
regression is correctly specied in a specic sense dened in the
following result.
Proposition 4 (Validity of sandwich standard errors under correct
specication). Assume that the post-matching regression,
Y = Z
β + ε,
is correctly specied with respect to the conditional distribution
of Y given (Z, X, W),thatis,E[ε|Z, X, W]=0. Then, under the
assumptions of Proposition 3, J
s
= J and the sandwich variance
estimator,
H
1
J
s
H
1
, is consistent for the asymptotic variance
of
n(
β β).
Notice,however,thatcorrectspecicationispreciselythe
condition under which matching would not be required to
obtain a consistent estimator of β, since direct estimation with-
out matching would be valid. Moreover, a correct specication
(in the sense dened above) of the post-matching regression
is not required for consistent estimation of causal parameters.
For example, under regularity conditions, a simple dierence in
means between the treated and a matched sample of untreated
units is consistent for the average eect of the treatment on the
treated. Consistent estimators of the variance exist for the sim-
ple dierence in means in a matched samples. These variance
estimators are dierent from the sandwich variance estimator,
and do not rely on correct specication of the post-matching
regression (see Abadie and Imbens 2012).
Finally, Equation (8) implies that the conditions of Proposi-
tion 4 canbeslightlyweakenedtorequireonlythattheregres-
sion function is correctly specied among the nontreated, in
the sense that E[ε|Z, X, W = 0]=0. This is because for
the estimators studied in this article, matching aects only the
distribution of the covariates for the nontreated. In addition,
for the special case M = 1, it is sucient that the regression
function is correctly specied among the treated, in the sense
that E[ε|Z, X, W = 1]=0.
3.2. Match-Level Clustered Standard Errors
We have shown that sandwich standard errors are not generally
validforthepost-matchingleastsquaresestimator.Inthissec-
tion, we will demonstrate that, when matching is done without
replacement, clustered standard errors (Liang and Zeger 1986;
Arellano 1987) can be employed to obtain valid estimates of the
standard deviation of post-matching regression coecients. In
particular, we will consider standard errors clustered at the level
of the matched sets.
Consider an estimator of the asymptotic variance of
β given
by
H
1
J
H
1
,where
H is as in Equation (6)and
J is given by the
clusteredvarianceformulaappliedtothematchedsets,
J =
1
n
n
i=1
W
i
Z
i
(Y
i
Z
i
β) +
jJ (i)
Z
j
(Y
j
Z
j
β)
×
Z
i
(Y
i
Z
i
β) +
jJ (i)
Z
j
(Y
j
Z
j
β)
.
Clustered standard errors can be readily implemented using
standard statistical soware. The next result shows that match-
level clustered standard errors are valid in large samples for the
post-matching estimator (provided matching is done without
replacement).
Proposition 5 (Validity of clustered standard errors). Under the
assumptions of Proposition 3,weobtainthat
J
p
J.
In particular, the clustered estimator of the variance is consis-
tent, that is,
H
1
J
H
1
nvar(
β)
p
0.
The intuition behind this result is that matching on covariates
makes regression errors statistically dependent among units in
the same matched sets, {i}∪J (i), i = 1, ..., N
1
.Standarderrors
clustered at the level of the matched set take this dependency
into account.
3.3. Matched Bootstrap
Proposition 5 shows that clustered standard errors are valid for
the asymptotic variance of the post-matching estimator. In this
section, we show that a clustered version of the nonparametric
bootstrap (Efron 1979)isalsovalid.Thisversionoftheboot-
strap relies on resampling of matched sets instead on individual
observations.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 989
Recall that we reordered the observations in our sample,
so that the rst N
1
observations are the treated. Consider the
nonparametric bootstrap that samples treated units together
with their M matches partners from S
to obtain
β
=
1
n
n
i=1
V
ni
Z
ni
Z
ni
1
1
n
n
i=1
V
ni
Z
ni
Y
ni
,
where (V
n1
, ..., V
nN
1
) has a multinomial distribution with
parameters (N
1
, (1/N
1
, ...,1/N
1
)),andV
nj
= V
ni
if j >
N
1
and j J (i). In this bootstrap procedure, N
1
units are
drawn at random with replacement from the N
1
treated sample
units. Untreated units are drawn along with their treated match.
Eectively, the matched bootstrap samples matched sets of one
treated unit and M untreated units. The next proposition shows
validity of the matched bootstrap.
Proposition 6 (Validity of the matched bootstrap). Under the
assumptions of Proposition 5,wehavethat
sup
rR
k
P
n(
β
β) r
S
P(N (0, H
1
JH
1
) r)
p
0.
Proposition 6 shows that the bootstrap distribution pro-
vides an asymptotically valid approximation of the limiting
distribution of the post-matching estimator, but that does not
necessarily imply that the associated bootstrap variance is an
asymptotically valid estimate of the variance of the estimator.
The formal analysis of the bootstrap variance is complicated
by the fact that, in forming the bootstrap estimate
β
,the
empirical analog
H
=
1
n
n
i=1
V
ni
Z
ni
Z
ni
of the Hessian H for a given bootstrap draw may be ill-
conditioned or noninvertible. In fact, because the bootstrap may
samplethesamematchedsetN
1
times, noninvertibility of the
Hessian may happen with positive probability for any sample
size. To circumvent this issue, we x constants c > 0and
α (0, 1/2) and consider the alternative bootstrap estimator
˜
β
=
β
if
H
H≤c/n
α
,
β otherwise.
That is,
˜
β
is equal to
β
whenever the bootstrap Hessian,
H
,is
close to the matched sample Hessian,
H.Otherwise,
˜
β
is equal
to the post-matching estimator,
β. As the sample size grows,
˜
β
is equal to
β
with probability approaching one.
Proposition 7 (Validity of bootstrap standard er rors). Under the
assumptions of Proposition 5 and E[Z
8
|W = w, X = x]
uniformly bounded on X
w
, the bootstrap distribution given by
˜
β
isvalidinthesenseofProposition 6,andyieldsavalid
estimate of the asymptotic variance of
β,thatis,
nvar(
˜
β
|S)
p
H
1
JH
1
as n →∞.
The use of
˜
β
in Proposition 7 is a formal device to make the
outcome of each bootstrap iteration well-dened. For practical
purposes, however, bootstrap standard errors based on
β
will
perform well unless the bootstrap Hessians are ill-conditioned.
Bootstrap standard errors based on
β
performverywellinour
simulations of Section 4.
It is useful to relate the results in this section, which pertain to
matching without replacement, to previous results for matching
with replacement. In particular, for matching with replacement
Abadie and Imbens (2008) showed that the nonparametric boot-
strap fails to consistently estimate the standard error of a simple
matching estimator. The consistency results that we obtain in
this section is for matching without replacement, and do not
directly extend to matching with replacement. The reason is that
matching with replacement creates dependencies in the data
that are not preserved by resampling matched sets.
4. Simulations
In this section, we study the performance of the post-matching
standard error estimators from Section 3 in a simulation exercise
using two data generating processes (DGPs).
4.1. DGP1: Robustness to Misspecication
Let U(a, b) be the uniform distribution on [a, b]. We generate
data according to
Y = WX + 5X
2
+ ε,
where X|W = 1 U (1, 1), X|W = 0 U (1, 2),and
ε N (0, 1).WesampleN
1
= 50 treated and N
0
= 200 non-
treated units. We rst match treated and untreated units on the
covariates, X, without replacement and with M = 1matchper
treated unit. We consider the following post-matching regres-
sion specications.
Specication 1:
Y = α +τ
0
W + τ
1
WX + β
1
X + ε.
Specication 2:
Y = α +τ
0
W + τ
1
WX + β
1
X + β
2
X
2
+ ε.
Specication 2 is correct relative to the conditional expectation
E[Y|X, W], while specication 1 is not. Regression estimands
canalwaysbeseenasL
2
approximations to E[Y|W, X], regard-
less of the specication adopted for estimation (see, e.g., White
1980b). For our simulation results, we will focus on estimators of
τ
0
and τ
1
, the regression coecients on terms involving W.For
the DGP and the two specications adopted for this simulation,
it can be shown that τ
0
= 0andτ
1
= 1 under the matching
target distribution.
Table 1 reports the results of the simulation exercise. In
a regression that uses the full sample without matching, the
estimates of τ
0
and τ
1
are biased under misspecication (speci-
cation 1), while they are valid under correct specication (spec-
ication 2). Aer matching, both specications yield valid esti-
mates for τ
0
and τ
1
.However,sandwichstandarderrorestimates
are inated under misspecication, while average clustered and
990 A. ABADIE AND J. SPIESS
Table 1. Monte Carlo results for DGP1 (10,000 iterations).
(a) Target parameter: coecient τ
0
= 0onW
Average
Full sample Post-matching standard error
Mean Std. Mean Std.
Specication of τ
0
of τ
0
of τ
0
of τ
0
Sandwich Cluster Bootstrap
1 0.85 0.404 0.00 0.204 0.359 0.197 0.199
2 0.00 0.165 0.00 0.204 0.196 0.196 0.199
(b) Target parameter: coecient τ
1
= 1 on the interaction WX
Average
Full sample Post-matching standard error
Mean Std. Mean Std.
Specication of τ
1
of τ
1
of τ
1
of τ
1
Sandwich Cluster Bootstrap
1 4.00 0.646 0.99 0.358 0.728 0.340 0.348
2 1.00 0.286 1.00 0.356 0.337 0.338 0.346
matched bootstrap standard errors (with 1000 bootstrap draws)
closely approximate the standard deviation of τ
0
and τ
1
. Under
correct specication (specication 2), all standard error esti-
mates perform well.
4.2. DGP2: High Treatment-Eect Heterogeneity
In the simulation in the previous section, sandwich standard
errors overestimate the variation of the post-matching estimator
under misspecication. In this section, we present an example in
which sandwich standard errors are too small. We generate data
according to
Y = WX + 20WX
2
10X
2
+ ε
with ε N (0, 1) as above. For this DGP2, the conditional
treatment eect is nonlinear with
E[Y|W = 1, X]−E[Y|W = 0, X]=X + 20X
2
.
Sample sizes, matching settings, and regression specications
are as in DGP1. Notice that both regression specications are
incorrect relative to E[Y|X, W],astheydonotcapturenonlinear
conditional treatment eects. Like in Section 4.1,regression
coecients represent the parameters of an L
2
approximation
to E[Y|W, X] over the distribution of (W, X) in Proposition 1.
Direct calculations yield τ
0
= 6.67 and τ
1
= 1forboth
specications in the matching target distribution.
Table 2 presents the results of the simulation exercise for
DGP2. The large heterogeneity in conditional treatment eects
isnotcapturedbyeitherregressionspecication,andsandwich
standard errors that ignore the matching step underestimate
the variation of the post-matching estimator. In contrast, the
average clustered and matched bootstrap (with 1000 bootstrap
draws) standard errors proposed in this article closely reect the
variability of the post-matching estimators.
5. Application
This section reports the results of an empirical application where
we look at the eect of smoking on the pulmonary function of
youths. The application is based on data originally collected in
Table 2. Monte Carlo results for DGP2 (10,000 iterations).
(a) Target parameter: coecient τ
0
= 6.67 on W
Average
Full sample Post-matching standard error
Mean std. mean std.
Specication of τ
0
of τ
0
of τ
0
of τ
0
Sandwich Cluster Bootstrap
1 8.25 0.754 6.55 0.883 0.630 0.869 0.897
2 6.70 0.857 6.55 0.883 0.630 0.869 0.897
(b) Target parameter: coecient τ
1
= 1 on the interaction WX
Average
Full sample Post-matching standard error
Mean Std. Mean Std.
Specication of τ
1
of τ
1
of τ
1
of τ
1
Sandwich Cluster Bootstrap
1 11.00 1.209 1.01 1.950 1.330 1.848 1.932
2 1.90 1.877 1.01 1.950 1.330 1.848 1.933
Boston, Massachusetts, by Tager et al. (1979, 1983), and sub-
sequently described and analyzed in Rosner (1995)andKahn
(2005). The sample contains 654 youth, N
1
= 65 who have ever
smoked regularly (W = 1) and N
0
= 589 who never smoked
regularly (W = 0). The outcome of interest is the subjects
forced expiratory volume (Y), ranging from 0.791 to 5.793 liters
per second (/sec). In addition, we use data on age (X
1
,ranging
from 3 to 19 with the youngest ever-smoker aged 9) and gender
(X
2
,withX
2
= 1formalesandX
2
= 0 for females).
The use of matching to study the causal eect of smoking is
motivated by the likely confounding eects of age and gender.
Forinstance,whilethecausaleectofsmokingonrespiratory
volume is expected to be negative, older children are more likely
to smoke and have a larger respiratory volume, which induces a
positive association between smoking and respiratory volume.
We rst match every smoker in the sample to a nonsmoker
(M = 1), without replacement, based on age (X
1
) and gender
(X
2
). Within the resulting matched sample of 65 smokers and
65 nonsmokers, we run linear regressions with the following
specications:
Specication 1:
Y = α +τ
0
W + ε.
Specication 2:
Y = α +τ
0
W + β
1
X
1
+ β
2
X
2
+ ε.
Specication 3:
Y = α +τ
0
W + τ
1
W(X
1
E[X
1
]) + τ
2
W(X
2
E[X
2
])
+ β
1
(X
1
E[X
1
]) + β
2
(X
2
E[X
2
]) + ε.
The rst specication yields the matching estimator for the
average treatment eect τ
0
as the regression coecient on W,
while the second adds linear controls in X
1
and X
2
.Thethird
specication also includes interaction terms of smoking with
age and gender.
Table 3 reports regression estimates of τ
0
, τ
1
,andτ
2
along
with standard errors (regression coecients on terms not
involving W are omitted from Table 3 for brevity). Estimates for
the rst specication demonstrate the problem of confounding
in this application. Without controlling for age and gender, there
is a positive correlation between smoking and forced expiratory
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 991
Table 3. OLS and post-matching estimates for the smoking dataset.
Dependent variable: forced expiratory volume
Smoker Smoker×age Smoker×male
Coe. Std. error Coe. Std. error Coe. Std. error
Sandwich Cluster Sandwich Cluster Sandwich Cluster
Specication 1:
OLS 0.711 0.099
Post-matching 0.066 0.132 0.095
Specication 2:
OLS 0.154 0.104
Post-matching 0.077 0.104 0.096
Specication 3:
OLS 0.495 0.187 0.182 0.036 0.461 0.193
Post-matching 0.077 0.102 0.093 0.092 0.054 0.038 0.021 0.249 0.212
function. Aer matching on age and gender, the sign of the
regression coecient on smoking becomes negative. In this
specication, the clustered standard error for the post-matching
estimate is considerably smaller than the corresponding sand-
wich standard error.
Specication 2 includes linear controls for age and gender.
The sign and magnitude of the least squares estimate of the
coecient on the smoker variable changes substantially between
specications1and2,whilethemagnitudeofthepost-matching
estimate stays roughly constant. This result illustrates the higher
robustness across specications of the post-matching estimator
relativetoleastsquaresontheunmatchedsample(Hoetal.
2007). When specication 2 is adopted for regression, the sign
of the coecient on the smoker variable is not aected by
matching. Also, for this specication, clustered and sandwich
standard errors are similar. Both ndings are consistent with
the adopted regression specication moving closer toward the
correct specication of E[Y|W, X
1
, X
2
].
In specication 3, which includes interactions between the
smoker variable and age and gender, the use of matching and
the use of robust standard errors matters for the substantive
results of the analysis. First, notice that the coecient on the
interaction of gender with treatment is large, signicant and
positive without matching, suggesting that the eect of smok-
ing is more severe for girls than for boys. Aer matching,
the sign changes, and the estimated coecient is small and
insignicant. This suggests that the large interaction nding
with OLS for this coecient is caused by misspecication.
Second, in the post-matching regression we nd a negative
estimate for the interaction of treatment with age. With sand-
wich standard errors, this eect is not signicant (at the 5%
level). The robust standard errors proposed in this article are
smaller and result in a rejection of the null hypothesis of a
zero interaction coecient between smoker and age (at the
5% level).
6. Conclusion
This article establishes valid inference for regression on a sample
matched without replacement. Standard errors that ignore the
matching step are not generally valid if the regression spec-
ication is incorrect relative to the expectation of the out-
come conditional on the treatment and the matching covariates.
However, using a correct specication relative to E[Y|W, X]
is not necessary to consistently estimate treatment parameters
aer matching. For example, under selection on observables,
simple dierences in means in a matched sample can be used
to estimate average treatment eects.
We propose two alternatives—standard errors clustered at
thematchedsetlevelandananalogousblockbootstrapthat
arerobusttomisspecicationandeasilyimplementablewith
standard statistical soware. A simulation study and an empiri-
cal example demonstrate the usefulness of our results.
To conclude, we outline potential extensions of our results.
First, in this article, we discuss only matching without replace-
ment, and the results do not directly carry over to matching
with replacement as in Abadie and Imbens (2006). Match-
ing with replacement (i.e., allowing nontreated units to be
used as a match more than once) creates additional depen-
dencies between matched sets that are not reected in sand-
wich standard errors or in the robust standard errors pro-
posedinthisarticle.Whilethenegativeresultaboutpost-
matching standard errors extend to matching with replace-
ment (standard errors that ignore the matching step are not
generally valid for matching is done with replacement, see
Abadie and Imbens 2006), the positive results we describe do
notdirectlyapply:Evenwhenthelinearregressioniscor-
rectly specied, sandwich standard errors do not correctly
capture the variance of the post-matching estimates, since
the overlap between matched sets is not accounted for. Clus-
tered standard errors, as well as the analogous block bootstrap
that samples treated units with all their matching partners,
do not provide an immediate solution since one untreated
unit may now be part of multiple such clusters or bootstrap
groups.
In addition, our analysis applies to the case when matching
is done directly on the covariates, avoiding substantial com-
plications created by the presence of nuisance parameters in
the matching step when matching is done on the estimated
propensity score (see Rosenbaum and Rubin 1983;Abadieand
Imbens 2016). Finally, our analysis assumes that the quality
ofmatchesisgoodenoughformatchingdiscrepanciesnotto
bias the asymptotic distribution of the post-matching regression
estimator. Post-matching regression adjustments may, in prac-
tice,helpeliminatethebiasasinthebias-correctedmatching
estimator in Abadie and Imbens (2011). These are angles that we
do not explore in this article and interesting avenues for future
research.
992 A. ABADIE AND J. SPIESS
Appendix: Proofs
Preliminary Lemmas A.1 and A.2 and Propositions A.1–A.3 are in a
supplementary appendix.
Proof of Proposition 1. Let E
Q(·|W=1)
and E
Q(·|W=0)
be expectation
operators for Q(·|W = 1) and Q(·|W = 0). Notice rst that for any
measurable function q,
E
Q(·|W=1)
[q(Y(1), S)]=E[q(Y, S)|W = 1]. (A.1)
The result holds also replacing W = 1withW = 0, and aer
conditioning on X.Inparticular,
E
Q(·|W=0)
[q(Y(0), S)|X]=E[q(Y, S)|X, W = 0]. (A.2)
The regression coecient in the population dened by (a) and (b) is
the minimizer of
1
M + 1
E
Q(·|W=1)
[(Y(1) g(1, S)
b)
2
]
+
M
M + 1
E
Q(·|W=1)
[(Y(0) g(0, S)
b)
2
].
Notice that
E
Q(·|W=1)
[(Y(1) g(1, S)
b)
2
]=E[(Y g(1, S)
b)
2
|W = 1]
= E
[(Y Z
b)
2
|W = 1],
where the rst equality follows from Equation (A.1)andthesecond
equality follows from the denitions of P
(·|W = 1) and Z. Similarly,
E
Q(·|W=1)
[(Y(0) g(0, S)
b)
2
]
= E
Q(·|W=1)
[E
Q(·|W=1)
[(Y(0) g(0, S)
b)
2
|X]]
= E
Q(·|W=1)
[E
Q(·|W=0)
[(Y(0) g(0, S)
b)
2
|X]]
= E[E[(Y g(W, S)
b)
2
|X, W = 0]|W = 1]
= E
[(Y Z
b)
2
|W = 0].
In the last equation, the rst equality follows from the law of iterated
expectations, the second equality follows from selection on observ-
ables, the third equality follows from (A.2)and(A.1), and the last
equation follows from the denition of P
(·|W = 0). Therefore,
1
M + 1
E
Q(·|W=1)
[(Y(1) g(1, S)
b)
2
]
+
M
M + 1
E
Q(·|W=1)
[(Y(0) g(0, S)
b)
2
]
=
1
M + 1
E
[(Y Z
b)
2
|W = 1]
+
M
M + 1
E
[(Y Z
b)
2
|W = 0]=E
[(Y Z
b)
2
],
which implies the result of the proposition.
Proof of Proposition 2. This proof is based on two lemmas in the sup-
plementary appendix about the asymptotic distribution of averages in
matched samples based on a martingale representation of matching
estimators similar to Abadie and Imbens (2012). Lemma A.1 establishes
convergence in probability, while Lemma A.2 deals with root-n consis-
tency and asymptotic normality. By Lemma A.1,
1
n
iS
Z
i
Z
i
p
H.
By Lemma A.2,
H
n
β β
=
n
1
n
iS
(Z
i
Y
i
Z
i
Z
i
β)
d
N (0, J),
wherewenotethatE[ZY ZZ
β|W = 0, X = x] is Lipschitz. Hence,
n
β β
=
p
H
1

H
1
n
1
n
iS
(Z
i
Y
i
Z
i
Z
i
β)

d
N (0,J)
d
N (0, H
1
JH
1
).
Proof of Proposition 3. We have t hat
J
s
=
1
n
iS
Z
i
(Y
i
Z
i
β)
2
Z
i
=
1
n
iS
Z
i
(Y
i
Z
i
β)
2
Z
i
+
1
n
iS
Z
i
(Y
i
Z
i
β)
2
(Y
i
Z
i
β)
2
Z
i
.
Notice that
1
n
iS
Z
i
(Y
i
Z
i
β)
2
(Y
i
Z
i
β)
2
Z
i
= (
β β)
1
n
iS
Z
i
(Z
i
Z
i
)Z
i
(
β + β) 2
1
n
iS
Z
i
(Z
i
Z
i
)Y
i
.
By assumption, the functions
E[Z
4
|X = x, W = w] and E[|Y|
4
|X = x, W = w]
are uniformly bounded on X
w
,forw = 0, 1. By Hölders inequality,
E
1
n
iS
Z
i
Z
i
Z
i
Z
i
and E
1
n
iS
Z
i
Z
i
Z
i
Y
i
are thus nite. Then, for (0, 1/2),byMarkovsinequality,weobtain
1
n
iS
Z
i
((Y
i
Z
i
β)
2
(Y
i
Z
i
β)
2
)Z
i
= n
1/2
(
β β)
#
iS
Z
i
(Z
i
Z
i
)Z
i
/n
n
1/2
(
β + β)
2
#
iS
Z
i
(Z
i
Z
i
)Y
i
/n
n
1/2
p
0.
As a result,
J
s
=
1
n
iS
Z
i
(Y
i
Z
i
β)
2
Z
i
+ o
p
(1),
and the claim follows from Lemma A.1 in the supplementary appendix,
which deals with consistency of averages in matched samples.
Proof of Proposition 4. Under correct specication, we nd that
W
(X) = E[Z(Y Z
β)|W, X]=E[Zε|W, X]
= E[E[Zε|Z, W, X]|W, X]=E[ZE[ε|Z, W, X]

=0
|W, X]=0.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 993
Proof of Proposition 5. First, note that
J =
1
n
W
i
=1
Z
i
(Y
i
Z
i
β) +
#
jJ (i)
Z
j
(Y
j
Z
j
β)
Z
i
(Y
i
Z
i
β) +
#
jJ (i)
Z
j
(Y
j
Z
j
β)
+o
P
(1),
where we replace
β by β analogous to the proof of Proposition 3.Write
G = Z(Y Z
β)
w
(x) = E[Z(Y Z
β)|W = w, X = x].
Note that
0
(x) is Lipschitz on X ,andthatG
i
has uniformly bounded
fourth moments. We decompose
J =
1
n
W
i
=1
G
i
+
#
jJ (i)
G
j

G
i
+
#
jJ (i)
G
j
+ o
P
(1)
=
1
n
W
i
=1
(
1
(X
i
) + M
0
(X
i
)
)(
1
(X
i
) + M
0
(X
i
)
)
+
1
n
iS
G
i
W
i
(X
i
)

G
i
W
i
(X
i
)
+
1
n
W
i
=1
=
J (i)∪{i}
G
W
(X
)
G
W
(X
)
+
1
n
W
i
=1
(
1
(X
i
) + M
0
(X
i
)
)
G
i
1
(X
i
) +
#
jJ (i)
(G
j
0
(X
j
))
+
G
i
1
(X
i
) +
#
jJ (i)
(G
j
0
(X
i
))
1
(X
i
) + M
0
(X
j
)
+ o
P
(1).
Here, the o
P
terms absorb the deviation due to using
β instead of β,as
well as the matching discrepancies in the conditional expectations. The
rst sum is iid with
1
n
W
i
=1
(
1
(X
i
) + M
0
(X
i
)
)(
1
(X
i
) + M
0
(X
i
)
)
p
E
(
1
(X) + M
0
(X))(
1
(X) + M
0
(X))
|W = 1
1 + M
=
var(
E[·|W=1]=0

1
(X) + M
0
(X) |W = 1)
1 + M
,
while the second is a martingale with
1
n
iS
G
i
W
i
(X
i
)

G
i
W
i
(X
i
)
p
E[var(Z(Y Z
β)|W = 1, X)
+Mvar(Z(Y Z
β)|W = 0, X)|W = 1]
1 + M
by Lemma A.1 in the supplementary appendix, which establishes con-
sistency of averages in matched samples. Under appropriate reordering
of the individual increments, all other sums can be represented as aver-
ages of mean-zero martingale increments. Since the second moments of
the increments are uniformly bounded, they vanish asymptotically.
Proof of Proposition 6. In this proof, we invoke Proposition A.2 in the
supplementary appendix, which establishes a general result on the
validity of the matched bootstrap for averages within matched samples.
Write
H
=
1
n
iS
V
ni
Z
ni
Z
ni
.
Note rst that
H
1
n(
H
(
β
β)
H(
β β))
= H
1
n
1
n
n
i=1
(V
ni
1)Z
ni
(Y
ni
Z
ni
β)
d
N (0, H
1
JH
1
),
conditional on S, by Proposition A.2. Now,
n(
β
β) = (
H
)
1
H(H
1
n(
H
(
β
β)
H
(
β β))
= (
H
)
1
H

p
I
(H
1
n(
H
(
β
β)
H(
β β)))
+ ((
H
)
1
H I)

p
O
n(
β β)
d
N (0, H
1
JH
1
),
conditional on S, where we have used that
H
H
p
O conditional
on S.
Proof of Proposition 7. First, P(
˜
β
=
β
|S) P(
H
H≤
c
n
α
|S)
p
1asn →∞. Indeed, since Z has bounded conditional eighth
moments, we also have that E[ZZ
4
|W = w, X = s] is uniformly
bounded in X
w
. It follows with Proposition A.2 in the supplementary
appendix, which establishes the validity of the matched bootstrap, that
sup
rR
(dim Z)
2
P(
n vec(
H
H) r|S) P(N (0,
H
) r)
p
0
as n →∞and thus in particular P(n
α
H
H≤c|S)
p
1forall
α (0, 1/2), c > 0.
Second, since for
˜
A B = A B generally
|P(A) P(
˜
A)|≤|P(A B) P(
˜
A B)|

=0
+|P(A B
c
) P(
˜
A B
c
)|

P(B
c
)
1 P(B),
for (r) = P
N (0, H
1
JH
1
) r
we have specically that
sup
rR
k
P
n(
˜
β
β) r
S
(r)
sup
rR
k
P
n(
β
β) r
S
(r)
+
P
n(
β
β) r
S
P
n(
˜
β
β) r
S

1P(
˜
β
=
β
|S)
sup
rR
k
P
n(
β
β) r
S
(r)

p
0
+1 P(
˜
β
=
β
|S)

p
0
p
0.
This shows that this alternative bootstrap is valid in the sense of
Proposition 6.
994 A. ABADIE AND J. SPIESS
Third, for the bootstrap variance, we nd
β
β =
H
1
1
n
iS
V
ni
Z
ni
Y
ni
H
β
=
H
1
1
n
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)
=
H
1
1
n
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)

=
+
H
1
H
1
1
n
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)

=
R
.
Since
1
n
#
iS
Z
ni
(Y
ni
Z
ni
β) = 0andthusnvar
1
n
#
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)
S
=
J,
nvar
S
=
H
1
nvar
1
n
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)
S
H
1
=
H
1
J
H
1
p
H
1
JH
1
,
which is a valid estimate of the asymptotic variance of
β.However,the
remainder term
R
generally does not have a bounded second moment
since
H
is badly conditioned for some bootstrap draws.
To show that
˜
β
yields valid standard errors, we collect a number
of preliminary results. Consider the random variables
and
˜
=
1
n
α
H
H≤c
.
n
converges in distribution to N (0, ) with
= H
1
JH
1
, conditional on S, by Proposition A.2. Since P(
˜
=
|S)
p
1,thesameholdstruefor
n
˜
by the above argument.
Also, we have established that
E
n
S
= 0, var
n
S
p
and thus E[n
2
|S]
p
tr().SinceE[n
˜
2
|S]≤
E[n
2
|S],andn
˜
2
and n
2
havethesameweaklimit
(with expectation tr()) by the continuous mapping theorem,
E[n
˜
2
|S]
p
tr() by Proposition A.3 in the supplementary
appendix. Consequently,
E[n
2
|S]−E[n
˜
2
|S]=P(n
α
H
H
> c|S) E[n
2
|n
α
H
H > c, S]
p
0. (A.3)
Next, note that for conformable random variables A, B if
var(A|S)
p
, E[B
2
|S]
p
0thenvar(A + B|S)
p
.
Indeed,
|(var(A + B|S) var(A|S))
ij
|=|cov(A
i
, B
j
|S)
+ cov(A
j
, B
i
|S) + cov(B
i
, B
j
|S)|
$
var(A
i
|S)
var(B
j
|S) +
var(A
j
|S)
$
var(B
i
|S)
+
$
var(B
i
|S)
var(B
j
|S)
p
0.
Hence, setting A =
n
and B =
n(
˜
β
β
),toestablishthe
desired result var(
n(
˜
β
β)|S)
p
H
1
JH
1
it suces to show that
E
n
˜
β
β
2
S
p
0 (A.4)
as n →∞.
Toward establishing (A.4), note rst that whenever n
α
H
H≤c
then also
(
H
)
1
H
1
=(
H
)
1
(
H
H
)
H
1
≤(
H
)
1

H
H

H
1
λ
1
min
(
H
1
min
(
H)
H
H
dim(Z),
where
λ
min
(
H
) = λ
min
(
H +
H
H) = min
x=1
x
(
H +
H
H)x
min
x=1
x
Hx + min
x=1
x
(
H
H)x
λ
min
(
H) −
H
H
and thus
(
H
)
1
H
1
min
(
H) −
H
H)
1
λ
1
min
(
H)
H
H dim(Z)
min
(
H) cn
α
)
1
λ
1
min
(
H) cn
α
dim(Z). (A.5)
If follows that
E
%
n
˜
β
β
2
S
&
= P(n
α
H
H≤c|S) E[n
=
β

˜
β
β
2
|n
α
H
H≤c, S]
+ P(n
α
H
H > c|S) E[n
˜
β

=
β
β
2
|n
α
H
H > c, S]
= P(n
α
H
H≤c|S)
E[n
≤(
H
)
1
H
1
2
1
n
#
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)
2

R
2
|
n
α
H
H≤c, S]
+ P(n
α
H
H > c|S) E[n
2
|n
α
H
H > c, S]
(A.5)
min
(
H)

p
λ
min
(H)>0
cn
α
)
1
λ
1
min
(
H) cn
α
dim(Z)
P(n
α
H
H≤c|S) E[n
1/2
#
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)
2
|n
α
H
H≤c, S]

E[
1
n
#
iS
V
ni
Z
ni
(Y
ni
Z
ni
β)
2
|S]=tr(
J)
p
tr(J)
+ P(n
α
H
H > c|S) E[n
2
|n
α
H
H > c, S]

(A.3)
p
0
p
0.
Hence, var(
n(
˜
β
β)|S) and var(
n
|S) have the same proba-
bility limit H
1
JH
1
, which is also the asymptotic variance of
β.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 995
Supplementary Materials
The supplementary appendix contains proofs of intermediate results and
extensions.
Acknowledgments
We thank Gary King, seminar participants at Harvard, and the editor
(Hongyu Zhao) and referees for helpful comments, and Jaume Vives for
expert research assistance.
Funding
Financial support by the NSF through grant SES 0961707 is gratefully
acknowledged.
References
Abadie, A., and Imbens, G. (2006), “Large Sample Properties of Matching
Estimators for Average Treatment Eects, Econometrica, 74, 235–267.
[983,991]
(2008), On the Failure of the Bootstrap for Matching Estimators,
Econometrica, 76, 1537–1557. [989]
(2011), Bias-Corrected Matching Estimators for Average Treat-
ment Eects, Journal of Business & Economic Statistics, 29, 1–11.
[986,991]
(2012), A Martingale Representation for Matching Estima-
tors,Journal of the American Statistical Association, 107, 833–843.
[984,986,988,992]
(2016), “Matching on the Estimated Propensity Score, Economet-
rica, 84, 781–807. [991]
Abadie, A., Imbens, G. W., and Zheng, F. (2014), “Inference for Misspeci-
ed Models With Fixed Regressors, Journal of the American Statistical
Association, 109, 1601–1614. [984]
Arellano, M. (1987), Computing Robust Standard Errors for Within-
Groups Estimators, Oxford Bulletin of Economics and Statistics, 49, 431–
434. [988]
Blinder, A. S. (1973), “Wage Discrimination: Reduced Form and Structural
Estimates, Journal of Human Resources, 8, 436–455. [985]
Cochran, W. G. (1953), “Matching in Analytical Studies, American Journal
of Public Health and the Nations Health, 43, 684–691. [983]
Dehejia, R. H., and Wahba, S. (1999), “Causal Eects in Nonexperimental
Studies: Reevaluating the Evaluation of Training Programs, Journal of
the American Statistical Association, 94, 1053–1062. [983]
DiNardo, J., Fortin, N., and Lemieux, T. (1996), “Labor Market Institu-
tions and the Distribution of Wages, 1973–1992: A Semiparametric
Approach, Econometrica, 64, 1001–1044. [985]
Efron, B. (1979), “Bootstrap Methods: Another Look at the Jackknife, The
Annals of Statistics, 7, 1–26. [988]
Eicker, F. (1967), “Limit Theorems for Regressions With Unequal and
Dependent Errors, in Proceedings of the Fih Berkeley Symposium on
Mathematical Statistics and Probability (Vol. 1), pp. 59–82. [987]
Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007), “Matching as
Nonparametric Preprocessing for Reducing Model Dependence in Para-
metric Causal Inference, Political Analysis, 15, 199–236. [983,987,991]
Huber, P. J. (1967), The Behavior of Maximum Likelihood Estimates Under
Nonstandard Conditions, in ProceedingsoftheFihBerkeleySymposium
on Mathematical Statistics and Probability (Vol. 1), pp. 221–233. [987]
Imbens, G. W., and Rubin, D. B. (2015), Causal Inference for Statistics,
Social, and Biomedical Sciences: An Introduction,Cambridge:Cambridge
University Press. [986]
Kahn, M. (2005), An Exhalent Problem for Teaching Statistics, The Journal
of Statistical Education, 13. [990]
Liang, K.-Y., and Zeger, S. L. (1986), “Longitudinal Data Analysis Using
Generalized Linear Models, Biometrika, 73, 13–22. [988]
Oaxaca, R. (1973), Male-Female Wage Dierentials in Urban Labor Mar-
kets, International Economic Review, 14, 693–709. [985]
Rosenbaum, P. R., and Rubin, D. B. (1983), “The Central Role of the Propen-
sity Score in Observational Studies for Causal Eects, Biometrika, 70,
41–55. [991]
Rosner, B. (1995), Fundamentals of Biostatistics,Belmont,CA:Duxbury
Press. [990]
Rubin, D. B. (1973), “Matching to Remove Bias in Observational Studies,
Biometrics, 29, 159–183. [983]
(1974), “Estimating Causal Eects of Treatments in Randomized
and Nonrandomized Studies, Journal of Educational Psychology, 66, 688.
[984]
Tager, I. B., Weiss, S. T., Muñoz, A., Rosner, B., and Speizer, F. E. (1983),
“Longitudinal Study of the Eects of Maternal Smoking on Pulmonary
Function in Children, New England Journal of Medicine, 309, 699–703.
[990]
Tager, I. B., Weiss, S. T., Rosner, B., and Speizer, F. E. (1979), “Eect of
Parental Cigarette Smoking on the Pulmonary Function of Children,
American Journal of Epidemiology, 110, 15–26. [990]
White, H. (1980a), A Heteroskedasticity-Consistent Covariance Matrix
EstimatorandaDirectTestforHeteroskedasticity,Econometrica, 48,
817–838. [987]
(1980b), “Using Least Squares to Approximate Unknown Regres-
sion Functions, International Economic Review, 21, 149–170. [987,989]
(1982), “Maximum Likelihood Estimation of Misspecied Models,
Econometrica, 50, 1–25. [987]