Robust Post-Matching Inference

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION

2022, VOL. 117, NO. 538, 983–995: Theory and Methods

https://doi.org/10.1080/01621459.2020.1840383

Alberto Abadie

and Jann Spiess

Department of Economics, MIT, Cambridge, MA;

Graduate School of Business, Stanford University, Stanford, CA

ABSTRACT

Nearest-neighbor matching is a popular nonparametric tool to create balance between treatment and

control groups in observational studies. As a preprocessing step before regression, matching reduces the

dependence on parametric modeling assumptions. In current empirical practice, however, the matching

step is often ignored in the calculation of standard errors and condence intervals. In this article, we show

that ignoring the matching step results in asymptotically valid standard errors if matching is done without

replacement and the regression model is correctly specied relative to the population regression function

of the outcome variable on the treatment variable and all the covariates used for matching. However,

standard errors that ignore the matching step are not valid if matching is conducted with replacement or,

more crucially, if the second step regression model is misspecied in the sense indicated above. Moreover,

correct specication of the regression model is not required for consistent estimation of treatment eects

with matched data. We show that two easily implementable alternatives produce approximations to the

distribution of the post-matching estimator that are robust to misspecication. A simulation study and an

empirical example demonstrate the empirical relevance of our results. Supplementary materials for this

article are available online.

ARTICLE HISTORY

Received January 2019

Accepted August 2020

KEYWORDS

Matching; Robust estimation;

Treatment eects

1. Introduction

Matching methods are widely used to create balance between

treatment and control groups in observational studies. Oen-

times, matching is followed by a simple comparison of means

between treated and nontreated (Cochran 1953;Rubin1973;

Dehejia and Wahba 1999). In other instances, however, match-

ing is used in combination with regression or with other esti-

mation methods more complex than a simple comparison of

means. The combination of matching in a rst step with a

second-step regression estimator brings together parametric

and nonparametric estimation strategies and, as demonstrated

in Ho et al. (2007), reduces the dependence of regression esti-

mates on modeling decisions. Moreover, matching followed by

regression allows the estimation of elaborate models, such as

those that include interaction eects and other parameters that

go beyond average treatment eects.

In this article, we develop valid standard error estimates

for regression aer matching. The large sample properties of

average treatment eect estimators that employ a simple com-

parison of mean outcomes between treated and nontreated aer

matching on covariates are well understood (see, e.g., Abadie

and Imbens 2006). However, studies that employ regression

models aer matching usually ignore the matching step when

performing inference on post-matching regression coecients.

We show that this practice is not generally valid if the sec-

ond step regression is misspecied in a sense we make precise

below. We propose two easily implementable and robust-to-

misspecication approaches to the estimation of the standard

errors of regression coecient estimators in matched samples

CONT ACT Alberto Abadie [email protected] Department of Economics, MIT, Cambridge, MA 02142.

Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA.

(with matching done without replacement). First, we show that

standard errors that are clustered at the level of the matched

sets are valid under misspecication. Second, we show that a

nonparametric block bootstrap that resamples matched pairs or

matched sets, as opposed to resampling individual observations,

also yields valid inference under misspecication. Furthermore,

we show that standard errors that ignore the matching step

can both underestimate or overestimate the variation of post-

matching estimates. The procedures that we propose in this

article are straightforward to implement with standard statistical

soware.

We will consider the following setup. Let W be a binary

random variable representing exposure to the treatment or con-

dition of interest (e.g., smoking), so W = 1forthetreated,

and W = 0 for the nontreated. Y is a random variable repre-

senting the outcome of interest (e.g., forced expiratory volume)

and X is a vector of covariates (e.g., gender or age). We will

study the problem of estimating how the treatment aects the

outcomes of the individuals in the treated population (i.e., those

with W = 1). In particular, we will analyze the properties

of a two-step (rst matching, then regression) estimator oen

used in empirical practice. This estimation strategy starts with

an unmatched sample, S,fromwhichtreatedunitsandtheir

matches are extracted to create a matched sample, S

∗

.Matching

is done without replacement and on the basis of the values of

X. Then, using data for the matched sample only, the researcher

runs a regression of Y on Z,whereZ isavectoroffunctions

of W and X (e.g., individual variables plus interactions). We

aim to obtain valid inferential methods for the coecients of

this regression, possibly under misspecication. To be precise,

984 A. ABADIE AND J. SPIESS

by “misspecication” we mean that there is no version of the

conditional expectation of Y given W and X that follows the

functional form employed in the second-step estimator. For

example, as explained below, a dierence in means between

treated and nontreated in the second step would be “misspeci-

ed” if the conditional expectation of Y given X and W depends

on X. To simplify the exposition, here we have described a

setting where Z depends only on the treatment, W,andonthe

covariates used in the matching stage, X. Our general framework

in Section 2 allows Z to depend on other covariates not in X.

The intuition behind the results in this article is that, if Y

depends on X,thenmatchingonX creates dependence between

the outcomes of treated units and their matches. This depen-

dence is absorbed by the second-step regression function as

long as the regression function is correctly specied relative

to the population regression of Y on W and X.However,if

the second-step regression is misspecied relative to the pop-

ulation regression of Y on W and X, dependence between

treated units and matches remains in the regression residuals.

Ignoring this dependence produces biased inference. Clustered

standard errors and analogous block bootstrap procedures take

into account the dependence between the outcomes of treated

units and their matches, restoring valid inference.

A special case of our setup is that of the standard matching

estimator for the average treatment eect on the treated, which

is given by the regression coecient on treatment W in a regres-

sion of Y on Z = (1, W)



. However, the framework allows for

richer analysis, such as the analysis of linear interaction eects

of the treatment with covariates, Z = (1, W, WX



, X



)



To illustrate the implications of our results, consider the

simple case when Z = (1, W)



.Aswementionedpreviously,

for Z = (1, W)



thesampleregressioncoecientonW cor-

responds to the simple matching estimator oen employed in

applied studies, which is based on a post-matching comparison

of means between treated and nontreated. Under well-known

conditions this estimator is consistent for the average eect of

thetreatmentonthetreated(see,e.g.,AbadieandImbens2012),

irrespective of the true form of the expectation of Y given W and

X. Notice, however, that even in this simple scenario, our results

imply that regression standard errors that ignore the matching

step are not valid in general. Although the expectation of Y given

W is linear because W is binary, a linear regression of Y on

Z = (1, W)



will be misspecied relative to the regression of Y

on W and X,unlessY is mean-independent of X given W over

asetofprobabilityone.

The rest of the article is organized as follows. Section 2 starts

with a detailed description of the setup of our investigation.

We then characterize the parameters estimated by the two-step

procedure described above. We show that these parameters are

equal to the regression coecients in a regression of Y on Z in

a population for which the distribution of matching covariates

X in the control group has been modied to coincide with that

of the treated. Under selection on observables—that is, if treat-

ment is as good as random conditional on X—post-matching

regression estimands are equal to the population regression

coecients in an experiment where the treatment is randomly

assigned in a population that has the same distribution of X as

the treated. We next establish consistency with respect to this

vector of parameters, show asymptotic normality, and describe

the asymptotic variance of the post-matching estimator. In

Section 3, we discuss dierent ways of constructing standard

errors.BasedontheresultsofSection 2,weshowthatstandard

errors that ignore the matching step are not generally valid

if the regression model is misspecied in the sense indicated

above, while clustered standard errors or an analogous block

bootstrap procedure yield valid inference. Section 4 presents

simulation evidence, which conrms our theoretical results.

Section 5 applies our results to the analysis of the eect of

smoking on pulmonary function. In this application, matching

before regression and the use of the robust standard errors

proposed in this article substantially aect empirical ndings.

Section 6 concludes.

The appendix contains the proofs of our main results. A

supplementary appendix contains proofs of intermediate results

and two extensions. In particular, the standard errors derived in

this article are valid for unconditional inference. Alternatively,

one could perform inference conditional on the values of the

regressors, X and W, in the sample. Notice that, in this case, the

rst step matches are xed. We discuss this alternative setting

in the supplementary appendix, where we show that, for the

conditional case, the usual regression standard errors are not

generally valid, but valid standard errors can be calculated using

the formulas in Abadie, Imbens, and Zheng (2014). Also, for

concreteness and following the vast majority of applied practice,

in the main text of this article we restrict our analysis to linear

regression aer matching. In the supplementary appendix, we

provide an extension of our result to general M-estimation aer

matching.

2. Post-Matching I nference

In this section, we discussthe asymptotic distribution of the least

squares estimator obtained from a linear regression of Y on Z

aer matching on observables, X.

2.1. Post-Matching Least Squares

Consider a standard binary treatment setting along the lines of

Rubin (1974) with potential outcomes Y(1) and Y(0),ofwhich

we only observe Y = Y(W) for treatment W ∈{0, 1}.LetS be

a set of observed covariates.

We will assume that the data consist of random samples of

treated and nontreated. This assumption could be easily relaxed,

andweadoptitonlytosimplifythediscussion.

Assumption 1 (Random sampling). S ={(Y

, W

, S

)}

i=1

is a

pooled sample obtained from N

and N

independent draws

from the population distribution of (Y, S) for the treated (W =

1) and nontreated (W = 0), respectively, so N = N

+ N

Consider an (m × 1) vector of covariates X = f (S) ∈ X ⊆

,andletS

∗

⊆ S be the matched sample generated by match-

ing without replacement each treated unit to M nontreated units

on the basis of their X-values. We will denote J (i) the set of

nontreated units matched to treated unit i. For simplicity, in our

notation we omit the dependence of J (i) on N and M.Oen,for

matching without replacement, the sets J (i) form the collection

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 985

of nonoverlapping subsets of {j : W

= 0},eachofcardinality

M, that minimizes the sum of the matching discrepancies.



i=1



j∈J (i)

d(X

, X

),(1)

where d : X × X →[0, ∞) is a metric. More generally, our

conditions do not require a matching scheme that directly min-

imizes (1), as long as Assumption 3 and the Lipschitz conditions

in Assumption 4 and Proposition 3 hold for some metric, d(·, ·),

under the adopted matching scheme.

The matched sample, S

∗



(

{i}∪J (i)

)

,hassizen =

(M + 1)N

.Weuseadoublesubscriptnotationtorefertothe

observations in the matched sample. For instance, Y

, ..., Y

refers to the values of the outcome variable for the units in S

∗

with analogous notation for other variables. Within the matched

sample, observations will be rearranged so that the rst N

observations are the treated units.

Let Z = g(W, S) be a (k × 1) vector of functions of (W, S),

and let



β bethevectorofsampleregressioncoecientsobtained

from regressing Y on Z in the matched sample,



β = argmin

b∈R



i=1

− Z







i=1





−1



i=1

.(2)

In Section 2.3, we will introduce a set of assumptions under

which



β exists with probability approaching one.

As we mentioned above, when Z = (1, W)



the regression

coecient on W inthematchedsampleisgivenby

τ =



i=1

−



i=1

(1 − W



i=1



−



j∈J (i)



which is the usual matching estimator for the average eect of

the treatment on the treated.

2.2. Characterization of the Estimand

Before we study the sampling distribution of



β,werstchar-

acterize its population counterpart, which we will denote by

β. That is, our rst task is to obtain a precise description of

the nature of the parameters estimated by



β.Althoughpost-

matching regressions are oen used in empirical practice, to

the best of our knowledge, the precise nature of post-matching

estimands has not been previously derived.

Thegoalofmatchingistochangethedistributionofthe

covariatesinthesampleofnontreatedunits,sothatitrepro-

duces the distribution of the covariates among the treated. To

do so, it is necessary that the support of the matching variables,

X, for the treated is inside the support for the nontreated.

Assumption 2 (Support condition). Let X

= supp(X|W = 1)

and X

= supp(X|W = 0),then

⊆ X

We now describe the population distribution targeted by the

matched sample, S

∗

.LetP(·|W = 1) and P(·|W = 0) be the

matching source distributions of (Y, S) from where the treated

and nontreated samples in S are, respectively, drawn, and let

E[·|W = 1] and E[·|W = 0] be the corresponding expectation

operators. For given P(·|W = 1) and P(·|W = 0) and a given

number of matches, M, we dene a matching target distribution,

∗

,overthetriple(Y, S, W), as follows:

∗

(W = 1) =

1 + M

and for each measurable set, A,

∗

((Y, S) ∈ A|W = 1) = P((Y, S) ∈ A|W = 1),

and

∗

((Y, S) ∈ A|W = 0) = E[P((Y, S) ∈ A|W = 0, X)|W = 1].

That is, in the matching target distribution: (i) treatment is

assigned in the same proportion as in the matched sample; (ii)

the distribution of (Y, S) among the treated is the same as in the

matching source; (iii) the distribution of (Y, S) among the non-

treated is generated by integrating the conditional distribution

of (Y, S) given X and W = 0 over the distribution of X given

W = 1, in the matching source. As a result, under the matching

target distribution, the distribution of X given W = 0 coincides

with the distribution of X given W = 1.

Under regularity conditions stated below, estimation on

the matched sample, S

∗

, asymptotically recovers parameters

of the matching target distribution, P

∗

,inwhichthetreated

and nontreated have the same distribution of X,butpossibly

dierent outcome and covariate distributions conditional on

X. As a result, comparisons of outcomes between treated and

nontreated in the matched sample, S

∗

,producethecontrolled

contrasts of the Oaxaca–Blinder decomposition (Blinder 1973;

Oaxaca 1973; DiNardo, Fortin, and Lemieux 1996). More gen-

erally, under regularity conditions, regression coecients of Y

on Z in the matched sample, S

∗

,asymptoticallyrecoverthe

analogous regression coecients in the target population:

β = argmin

b∈R

∗

[(Y − Z



]

= (E

∗

[ZZ



])

−1

∗

[ZY].(3)

Matching methods are oen motivated by a selection-on-

observables assumption, that is, by the assumption that treat-

ment assignment is as good as random conditional on observed

covariates. To formalize the assumption of selection on observ-

ables and its implications in our framework, consider source

populations expressed this time in terms of potential outcomes

and covariates, Q(·|W = 1) and Q(·|W = 0), which represent

the distributions of (Y(1), Y(0), S) given W = 1andW = 0,

respectively. These distributions are dened in such a way that

P(·|W = 1) and P(·|W = 0) can be obtained by integrating

out Y(0) from Q(·|W = 1) and Y(1) from Q(·|W = 0),

respectively. For given Q(·|W = 1) and Q

(·|W = 0),selection

on observables means

(Y(1), Y(0), S)|X, W = 1 ∼ (Y(1), Y(0), S)|X, W = 0

986 A. ABADIE AND J. SPIESS

almost surely with respect to the distribution of X|W = 1. That

is, the joint distribution of covariates and potential outcomes

is independent of treatment assignment conditional on the

matching variables. Because in this article, we focus on causal

parameters dened for a population with distribution of the

matching variables equal to X|W = 1, for our purposes it is

enough that the selection-on-observables assumption holds for

the distribution of (Y(0), S) only,

(Y(0), S)|X, W = 1 ∼ (Y(0), S)|X, W = 0. (4)

Proposition 1 (Estimand under selection on observables). Sup-

pose that Assumption 2 holds and that β,asdenedinEquation

(3), exists. Then if selection on observables, as dened in Equa-

tion (4), holds, the coecients β are the same as the population

coecients that would be obtained from a regression of Y on Z

in a setting where:

1. (Y(1), Y(0), S) has distribution Q(·|W = 1)

2. treatment is randomly assigned with probability 1/(M +1).

This result formalizes the notion that matching under selec-

tion on observables allows researchers to reproduce an exper-

imental setting under which average treatment eects can be

easily evaluated through a least squares regression of Y on Z.

The results in this article, however, apply to the general estimand

β in Equation (3), regardless of the validity of the selection-on-

observables assumption.

2.3. Consistency and Asymptotic Normality

In this section, we will establish large sample properties of



β,

as N

, N

→∞with N

≥ MN

. Throughout this article, we

will assume that the sum of matching discrepancies vanishes

quickly enough to allow asymptotic unbiasedness and root-n

consistency:

Assumption 3 (Matching discrepancies).

√



i=1



j∈J (i)

d(X

, X

)

−→ 0.

Abadie and Imbens (2012) derived primitive conditions for

Assumption 3,whichrequireN

= O(N

1/r

) for some r greater

than the number of covariates in X (excludingthosethattakeon

a nite number of values). This condition highlights the impor-

tanceofobtainingmatchesfromalargereservoirofuntreated

units, especially when the dimensionality of X is large. Of

course, in concrete empirical settings, the adequacy of matching

should not rely on asymptotic results. Instead, the quality of

the matches needs to be evaluated for each particular sample.

Abadie and Imbens (2011)andImbensandRubin(2015)dis-

cussed measures of the discrepancy between the distributions

of the covariates of treated and nontreated. For example, the

normalized dierence in Abadie and Imbens (2011)is(m

−



+ s

)/2, where m

and s

are the means and standard

deviations of a covariate (typically, products of/and powers of

the components of X) for the units with W = w in the matched

sample.

For any real matrix A,letA=

√

tr(A



A) be the Euclidean

norm of A. The next assumption collects regularity conditions

on the conditional moments of (Y, Z) given (X, W).

Assumption 4 (Well-behavedness of conditional expectations).

For w = 0, 1, and some δ>0,

E[Z

|W = w, X = x] and

E[Z(Y − Z



β)

2+δ

|W = w, X = x]

areuniformlyboundedonX

.Furthermore,

E[ZZ



|X = x, W = 0], E[ZY|X = x, W = 0]

and var(Z(Y − Z



β)|X = x, W = 0)

are componentwise Lipschitz in x with respect to d(·, ·).

To ensure the existence of



β with probability approaching

one as n → 0, we assume invertibility of the Hessian, H =

∗

(ZZ



).Noticethat

H =



E[ZZ



|X, W=1]+ME[ZZ



|X, W=0]



W=1



1 + M

.(5)

Assumption 5 (Linear independence of regressors). H is invert-

ible.

The next proposition establishes the asymptotic distribution



β.

Proposition 2 (Asymptotic distribution of the post-matching esti-

mator). Under Assumptions 1–5,

√



β − β)

→ N (0, H

−1

where

J =

var



E[Z(Y − Z



β)|X, W = 1]

+ME[Z(Y − Z



β)|X, W = 0]



W = 1



1 + M



var(Z(Y −Z



β)|X, W = 1)

+Mvar(Z(Y − Z



β)|X, W = 0)



W = 1



1 + M

and H isasdenedinEquation(5).

All proofs are in the appendix.

3. Post-Matching Standard Errors

In the previous section, we established that

√



β − β)

→ N (0, H

−1

)

for the post-matching estimator obtained from a regression of

Y on Z within the matched sample S

∗

.Inthissection,ourgoal

is to estimate the asymptotic variance, H

−1

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 987

3.1. Standard Errors Ignoring the Matching Step

Ho et al. (2007)arguedthatmatchingcanbeseenasaprepro-

cessing step, prior to estimation, so the matching step can be

ignored in the calculation of standard errors. Here, we consider

commonly applied “sandwich” standard error estimates for iid

data (Eicker 1967;Huber1967;White1980a, 1980b, 1982). In an

iid setting, sandwich standard errors are valid in large samples

even if the regression is misspecied relative to the conditional

expectation of Y given Z,inwhichcasethepopulationregres-

sion parameters are the coecients of an L

approximation

to the conditional expectation. As we will show, however, the

assumption of iid data does not apply in matched samples.

Sandwich standard errors can be computed as the square root

of the main diagonal of the matrix



−1



−1

/n,where



H =



i=1



(6)

and





i=1

− Z





β)



.(7)

The following proposition derives the probability limit of



with

data from a matched sample.

Proposition 3 (Convergence of J

). Suppose that Assumptions 1–

5 hold. Assume also that E[Z(Y − Z



β)



|X = x, W = 0] is

Lipschitz on X

and E[Y

|X = x, W = w]is uniformly bounded

on X

for all w = 0, 1. Then,



→ J

,where



E[Z(Y − Z



β)



|X, W = 1]

+ME[Z(Y − Z



β)



|X, W = 0]



W = 1



1 + M

Notice that J

= E

∗

[Z(Y −Z



β)

Z].Thatis,J

is equal to the

innermatrixofthesandwichasymptoticvariancewhendataare

iid with distribution P

∗

.However,sincethematchedsampleS

∗

is not an iid sample from P

∗



is not generally consistent for J.

The dierence between the limit of the sandwich standard errors



−1



−1

and the actual asymptotic variance H

−1

is given

by H

−1

H

−1

,where

 =

−ME





(X)

(X)



+ 

(X)

(X)



|W = 1



−(M − 1)ME





(X)

(X)



|W = 1



M + 1

,(8)

and



(x) = E



Z(Y − Z



β)|X = x, W = w



for w = 0, 1.

Therefore, bias in the estimation of the variance may arise

when 

(X) = 0. The following example provides a simple

instance of this bias.

Example 1 (Inconsistency of sandwich standard errors). Assume

the sample outcomes are drawn from

Y = τ W + X + ε,(9)

where X is a scalar random variable with var(X|W = 1) =

,andε has mean zero, variance σ

, and is independent of

W and X. Consider the case where we match the values of X

for N

treated units to N

untreated units (M = 1) without

replacement. Let j(i) be the index of the untreated observation

that serves as a match for treated observation i. For simplicity,

suppose that X is discrete and all matches are perfect, X

= X

j(i)

for every treated unit i, so we can ignore potential biases gen-

erated by matching discrepancies. Within the matched sample,

∗

, we run a linear regression of Y on Z = (1, W)



to obtain the

regression coecient on W,

τ =



i=1

− Y

j(i)

). (10)

τ is the usual matching estimator for the average eect of the

treatment on the treated. Notice that, in the previous expression,

− Y

j(i)

= τ + ε

− ε

j(i)

,withvariance2σ

.VariationinX

is taken care of through matching. Therefore, all variation in τ

comes through the error term, ε.Becausen = 2N

,itfollows

that

n var(τ) = 4σ

Consider now the residuals of the ordinary least squares (OLS)

regression of Y

on a constant and W

in the matched sample:

ε

= Y

− μ −τ W

≈ X

+ ε

where μ istheinterceptofthesampleregressionline.Forthis

simple case, the sandwich variance estimator for τ is



var(τ) =



i=1

ε

≈ 4σ

+ 4σ

That is, in this example, the sandwich variance estimator over-

estimates the variance of τ because it does not take into account

the dependence generated by matching between the regression

residuals of the treated units and their matches.

Sections 3.2 and 3.3 discuss variance estimators that adjust

for the matching step by taking into account the dependence of

regression errors between treated units and their matches. For

matching with M = 1 and a second-step regression of Y on a

constant and W, the clustered variance estimator of Section 3.2

becomes



var(τ) =



i=1

(ε

−ε

j(i)

)

≈ 4σ

restoring valid inference.

The next example shows that ignoring the matching step may

result in underestimation of the variance.

Example 2 (Underestimation of the variance). Inthesamesetting

as Example 1, assume that data are generated by

Y = τ W + X − 2WX + ε. (11)

The post-matching estimator of τ from a regression of Y on

(1, W)



is τ as in Equation (10). In this case, if all matches are

988 A. ABADIE AND J. SPIESS

perfect, so X

= X

j(i)

,weobtainY

−Y

j(i)

= τ −2X

+ε

−ε

j(i)

Therefore,

n var(τ) = 8σ

+ 4σ

Least squares regression residuals are

ε

= Y

− μ −τ W

≈ X

− 2W

+ ε



−X

+ ε

if W

= 1,

+ ε

if W

= 0,

implying



var(τ) =



i=1

ε

≈ 4σ

+ 4σ

for the conventional sandwich variance estimator. Again, the

sandwich variance estimator does not take into account depen-

dencies between sample units induced by matching. In this

example, matching on X induces a negative correlation between

the regression residuals of the treated units and their matches.

As a result, the sandwich variance estimator underestimates the

variance of τ. Once again, the clustered variance estimator of

Section 3.2 takes into account the correlation between regres-

sion error induced by matching, and produces valid inference,



var(τ) =



i=1

(ε

−ε

j(i)

)

≈ 8σ

+ 4σ

Sandwich standard errors would be valid in Examples 1 and 2

if the specications for the post-matching regressions included

the terms containing X in Equations (9)and(11), respectively.

Indeed, sandwich standard errors are generally valid if the

regression is correctly specied in a specic sense dened in the

following result.

Proposition 4 (Validity of sandwich standard errors under correct

specication). Assume that the post-matching regression,

Y = Z



β + ε,

is correctly specied with respect to the conditional distribution

of Y given (Z, X, W),thatis,E[ε|Z, X, W]=0. Then, under the

assumptions of Proposition 3, J

= J and the sandwich variance

estimator,



−1



−1

, is consistent for the asymptotic variance

√



β − β).

Notice,however,thatcorrectspecicationispreciselythe

condition under which matching would not be required to

obtain a consistent estimator of β, since direct estimation with-

out matching would be valid. Moreover, a correct specication

(in the sense dened above) of the post-matching regression

is not required for consistent estimation of causal parameters.

For example, under regularity conditions, a simple dierence in

means between the treated and a matched sample of untreated

units is consistent for the average eect of the treatment on the

treated. Consistent estimators of the variance exist for the sim-

ple dierence in means in a matched samples. These variance

estimators are dierent from the sandwich variance estimator,

and do not rely on correct specication of the post-matching

regression (see Abadie and Imbens 2012).

Finally, Equation (8) implies that the conditions of Proposi-

tion 4 canbeslightlyweakenedtorequireonlythattheregres-

sion function is correctly specied among the nontreated, in

the sense that E[ε|Z, X, W = 0]=0. This is because for

the estimators studied in this article, matching aects only the

distribution of the covariates for the nontreated. In addition,

for the special case M = 1, it is sucient that the regression

function is correctly specied among the treated, in the sense

that E[ε|Z, X, W = 1]=0.

3.2. Match-Level Clustered Standard Errors

We have shown that sandwich standard errors are not generally

validforthepost-matchingleastsquaresestimator.Inthissec-

tion, we will demonstrate that, when matching is done without

replacement, clustered standard errors (Liang and Zeger 1986;

Arellano 1987) can be employed to obtain valid estimates of the

standard deviation of post-matching regression coecients. In

particular, we will consider standard errors clustered at the level

of the matched sets.

Consider an estimator of the asymptotic variance of



β given



−1



−1

,where



H is as in Equation (6)and



J is given by the

clusteredvarianceformulaappliedtothematchedsets,



J =



i=1



− Z





β) +



j∈J (i)

− Z





β)





− Z





β) +



j∈J (i)

− Z





β)





Clustered standard errors can be readily implemented using

standard statistical soware. The next result shows that match-

level clustered standard errors are valid in large samples for the

post-matching estimator (provided matching is done without

replacement).

Proposition 5 (Validity of clustered standard errors). Under the

assumptions of Proposition 3,weobtainthat



→ J.

In particular, the clustered estimator of the variance is consis-

tent, that is,



−1



−1

− nvar(



β)

→ 0.

The intuition behind this result is that matching on covariates

makes regression errors statistically dependent among units in

the same matched sets, {i}∪J (i), i = 1, ..., N

.Standarderrors

clustered at the level of the matched set take this dependency

into account.

3.3. Matched Bootstrap

Proposition 5 shows that clustered standard errors are valid for

the asymptotic variance of the post-matching estimator. In this

section, we show that a clustered version of the nonparametric

bootstrap (Efron 1979)isalsovalid.Thisversionoftheboot-

strap relies on resampling of matched sets instead on individual

observations.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 989

Recall that we reordered the observations in our sample,

so that the rst N

observations are the treated. Consider the

nonparametric bootstrap that samples treated units together

with their M matches partners from S

∗

to obtain



∗





i=1





−1



i=1

where (V

, ..., V

) has a multinomial distribution with

parameters (N

, (1/N

, ...,1/N

)),andV

= V

if j >

and j ∈ J (i). In this bootstrap procedure, N

units are

drawn at random with replacement from the N

treated sample

units. Untreated units are drawn along with their treated match.

Eectively, the matched bootstrap samples matched sets of one

treated unit and M untreated units. The next proposition shows

validity of the matched bootstrap.

Proposition 6 (Validity of the matched bootstrap). Under the

assumptions of Proposition 5,wehavethat

sup

r∈R





√



∗

−



β) ≤ r





− P(N (0, H

−1

) ≤ r)



→ 0.

Proposition 6 shows that the bootstrap distribution pro-

vides an asymptotically valid approximation of the limiting

distribution of the post-matching estimator, but that does not

necessarily imply that the associated bootstrap variance is an

asymptotically valid estimate of the variance of the estimator.

The formal analysis of the bootstrap variance is complicated

by the fact that, in forming the bootstrap estimate



∗

,the

empirical analog



∗



i=1



of the Hessian H for a given bootstrap draw may be ill-

conditioned or noninvertible. In fact, because the bootstrap may

samplethesamematchedsetN

times, noninvertibility of the

Hessian may happen with positive probability for any sample

size. To circumvent this issue, we x constants c > 0and

α ∈ (0, 1/2) and consider the alternative bootstrap estimator

∗





∗

if 



∗

−



H≤c/n



β otherwise.

That is,

∗

is equal to



∗

whenever the bootstrap Hessian,



∗

,is

close to the matched sample Hessian,



H.Otherwise,

∗

is equal

to the post-matching estimator,



β. As the sample size grows,

∗

is equal to



∗

with probability approaching one.

Proposition 7 (Validity of bootstrap standard er rors). Under the

assumptions of Proposition 5 and E[Z

|W = w, X = x]

uniformly bounded on X

, the bootstrap distribution given by

∗

isvalidinthesenseofProposition 6,andyieldsavalid

estimate of the asymptotic variance of



β,thatis,

nvar(

∗

|S)

→ H

−1

as n →∞.

The use of

∗

in Proposition 7 is a formal device to make the

outcome of each bootstrap iteration well-dened. For practical

purposes, however, bootstrap standard errors based on



∗

will

perform well unless the bootstrap Hessians are ill-conditioned.

Bootstrap standard errors based on



∗

performverywellinour

simulations of Section 4.

It is useful to relate the results in this section, which pertain to

matching without replacement, to previous results for matching

with replacement. In particular, for matching with replacement

Abadie and Imbens (2008) showed that the nonparametric boot-

strap fails to consistently estimate the standard error of a simple

matching estimator. The consistency results that we obtain in

this section is for matching without replacement, and do not

directly extend to matching with replacement. The reason is that

matching with replacement creates dependencies in the data

that are not preserved by resampling matched sets.

4. Simulations

In this section, we study the performance of the post-matching

standard error estimators from Section 3 in a simulation exercise

using two data generating processes (DGPs).

4.1. DGP1: Robustness to Misspecication

Let U(a, b) be the uniform distribution on [a, b]. We generate

data according to

Y = WX + 5X

+ ε,

where X|W = 1 ∼ U (−1, 1), X|W = 0 ∼ U (−1, 2),and

ε ∼ N (0, 1).WesampleN

= 50 treated and N

= 200 non-

treated units. We rst match treated and untreated units on the

covariates, X, without replacement and with M = 1matchper

treated unit. We consider the following post-matching regres-

sion specications.

Specication 1:

Y = α +τ

W + τ

WX + β

X + ε.

Specication 2:

Y = α +τ

W + τ

WX + β

X + β

+ ε.

Specication 2 is correct relative to the conditional expectation

E[Y|X, W], while specication 1 is not. Regression estimands

canalwaysbeseenasL

approximations to E[Y|W, X], regard-

less of the specication adopted for estimation (see, e.g., White

1980b). For our simulation results, we will focus on estimators of

and τ

, the regression coecients on terms involving W.For

the DGP and the two specications adopted for this simulation,

it can be shown that τ

= 0andτ

= 1 under the matching

target distribution.

Table 1 reports the results of the simulation exercise. In

a regression that uses the full sample without matching, the

estimates of τ

and τ

are biased under misspecication (speci-

cation 1), while they are valid under correct specication (spec-

ication 2). Aer matching, both specications yield valid esti-

mates for τ

and τ

.However,sandwichstandarderrorestimates

are inated under misspecication, while average clustered and

990 A. ABADIE AND J. SPIESS

Table 1. Monte Carlo results for DGP1 (10,000 iterations).

(a) Target parameter: coecient τ

= 0onW

Average

Full sample Post-matching standard error

Mean Std. Mean Std.

Specication of τ

of τ

Sandwich Cluster Bootstrap

1 −0.85 0.404 0.00 0.204 0.359 0.197 0.199

2 0.00 0.165 0.00 0.204 0.196 0.196 0.199

(b) Target parameter: coecient τ

= 1 on the interaction WX

Average

Full sample Post-matching standard error

Mean Std. Mean Std.

Specication of τ

of τ

Sandwich Cluster Bootstrap

1 −4.00 0.646 0.99 0.358 0.728 0.340 0.348

2 1.00 0.286 1.00 0.356 0.337 0.338 0.346

matched bootstrap standard errors (with 1000 bootstrap draws)

closely approximate the standard deviation of τ

and τ

. Under

correct specication (specication 2), all standard error esti-

mates perform well.

4.2. DGP2: High Treatment-Eect Heterogeneity

In the simulation in the previous section, sandwich standard

errors overestimate the variation of the post-matching estimator

under misspecication. In this section, we present an example in

which sandwich standard errors are too small. We generate data

according to

Y = WX + 20WX

− 10X

+ ε

with ε ∼ N (0, 1) as above. For this DGP2, the conditional

treatment eect is nonlinear with

E[Y|W = 1, X]−E[Y|W = 0, X]=X + 20X

Sample sizes, matching settings, and regression specications

are as in DGP1. Notice that both regression specications are

incorrect relative to E[Y|X, W],astheydonotcapturenonlinear

conditional treatment eects. Like in Section 4.1,regression

coecients represent the parameters of an L

approximation

to E[Y|W, X] over the distribution of (W, X) in Proposition 1.

Direct calculations yield τ

= 6.67 and τ

= 1forboth

specications in the matching target distribution.

Table 2 presents the results of the simulation exercise for

DGP2. The large heterogeneity in conditional treatment eects

isnotcapturedbyeitherregressionspecication,andsandwich

standard errors that ignore the matching step underestimate

the variation of the post-matching estimator. In contrast, the

average clustered and matched bootstrap (with 1000 bootstrap

draws) standard errors proposed in this article closely reect the

variability of the post-matching estimators.

5. Application

This section reports the results of an empirical application where

we look at the eect of smoking on the pulmonary function of

youths. The application is based on data originally collected in

Table 2. Monte Carlo results for DGP2 (10,000 iterations).

(a) Target parameter: coecient τ

= 6.67 on W

Average

Full sample Post-matching standard error

Mean std. mean std.

Specication of τ

of τ

Sandwich Cluster Bootstrap

1 8.25 0.754 6.55 0.883 0.630 0.869 0.897

2 6.70 0.857 6.55 0.883 0.630 0.869 0.897

(b) Target parameter: coecient τ

= 1 on the interaction WX

Average

Full sample Post-matching standard error

Mean Std. Mean Std.

Specication of τ

of τ

Sandwich Cluster Bootstrap

1 11.00 1.209 1.01 1.950 1.330 1.848 1.932

2 1.90 1.877 1.01 1.950 1.330 1.848 1.933

Boston, Massachusetts, by Tager et al. (1979, 1983), and sub-

sequently described and analyzed in Rosner (1995)andKahn

(2005). The sample contains 654 youth, N

= 65 who have ever

smoked regularly (W = 1) and N

= 589 who never smoked

regularly (W = 0). The outcome of interest is the subjects’

forced expiratory volume (Y), ranging from 0.791 to 5.793 liters

per second (/sec). In addition, we use data on age (X

,ranging

from 3 to 19 with the youngest ever-smoker aged 9) and gender

,withX

= 1formalesandX

= 0 for females).

The use of matching to study the causal eect of smoking is

motivated by the likely confounding eects of age and gender.

Forinstance,whilethecausaleectofsmokingonrespiratory

volume is expected to be negative, older children are more likely

to smoke and have a larger respiratory volume, which induces a

positive association between smoking and respiratory volume.

We rst match every smoker in the sample to a nonsmoker

(M = 1), without replacement, based on age (X

) and gender

). Within the resulting matched sample of 65 smokers and

65 nonsmokers, we run linear regressions with the following

specications:

Specication 1:

Y = α +τ

W + ε.

Specication 2:

Y = α +τ

W + β

+ β

+ ε.

Specication 3:

Y = α +τ

W + τ

W(X

− E[X

]) + τ

W(X

− E[X

])

+ β

− E[X

]) + β

− E[X

]) + ε.

The rst specication yields the matching estimator for the

average treatment eect τ

as the regression coecient on W,

while the second adds linear controls in X

and X

.Thethird

specication also includes interaction terms of smoking with

age and gender.

Table 3 reports regression estimates of τ

, τ

,andτ

along

with standard errors (regression coecients on terms not

involving W are omitted from Table 3 for brevity). Estimates for

the rst specication demonstrate the problem of confounding

in this application. Without controlling for age and gender, there

is a positive correlation between smoking and forced expiratory

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 991

Table 3. OLS and post-matching estimates for the smoking dataset.

Dependent variable: forced expiratory volume

Smoker Smoker×age Smoker×male

Coe. Std. error Coe. Std. error Coe. Std. error

Sandwich Cluster Sandwich Cluster Sandwich Cluster

Specication 1:

OLS 0.711 0.099

Post-matching −0.066 0.132 0.095

Specication 2:

OLS −0.154 0.104

Post-matching −0.077 0.104 0.096

Specication 3:

OLS 0.495 0.187 −0.182 0.036 0.461 0.193

Post-matching −0.077 0.102 0.093 −0.092 0.054 0.038 −0.021 0.249 0.212

function. Aer matching on age and gender, the sign of the

regression coecient on smoking becomes negative. In this

specication, the clustered standard error for the post-matching

estimate is considerably smaller than the corresponding sand-

wich standard error.

Specication 2 includes linear controls for age and gender.

The sign and magnitude of the least squares estimate of the

coecient on the smoker variable changes substantially between

specications1and2,whilethemagnitudeofthepost-matching

estimate stays roughly constant. This result illustrates the higher

robustness across specications of the post-matching estimator

relativetoleastsquaresontheunmatchedsample(Hoetal.

2007). When specication 2 is adopted for regression, the sign

of the coecient on the smoker variable is not aected by

matching. Also, for this specication, clustered and sandwich

standard errors are similar. Both ndings are consistent with

the adopted regression specication moving closer toward the

correct specication of E[Y|W, X

, X

In specication 3, which includes interactions between the

smoker variable and age and gender, the use of matching and

the use of robust standard errors matters for the substantive

results of the analysis. First, notice that the coecient on the

interaction of gender with treatment is large, signicant and

positive without matching, suggesting that the eect of smok-

ing is more severe for girls than for boys. Aer matching,

the sign changes, and the estimated coecient is small and

insignicant. This suggests that the large interaction nding

with OLS for this coecient is caused by misspecication.

Second, in the post-matching regression we nd a negative

estimate for the interaction of treatment with age. With sand-

wich standard errors, this eect is not signicant (at the 5%

level). The robust standard errors proposed in this article are

smaller and result in a rejection of the null hypothesis of a

zero interaction coecient between smoker and age (at the

5% level).

6. Conclusion

This article establishes valid inference for regression on a sample

matched without replacement. Standard errors that ignore the

matching step are not generally valid if the regression spec-

ication is incorrect relative to the expectation of the out-

come conditional on the treatment and the matching covariates.

However, using a correct specication relative to E[Y|W, X]

is not necessary to consistently estimate treatment parameters

aer matching. For example, under selection on observables,

simple dierences in means in a matched sample can be used

to estimate average treatment eects.

We propose two alternatives—standard errors clustered at

thematchedsetlevelandananalogousblockbootstrap—that

arerobusttomisspecicationandeasilyimplementablewith

standard statistical soware. A simulation study and an empiri-

cal example demonstrate the usefulness of our results.

To conclude, we outline potential extensions of our results.

First, in this article, we discuss only matching without replace-

ment, and the results do not directly carry over to matching

with replacement as in Abadie and Imbens (2006). Match-

ing with replacement (i.e., allowing nontreated units to be

used as a match more than once) creates additional depen-

dencies between matched sets that are not reected in sand-

wich standard errors or in the robust standard errors pro-

posedinthisarticle.Whilethenegativeresultaboutpost-

matching standard errors extend to matching with replace-

ment (standard errors that ignore the matching step are not

generally valid for matching is done with replacement, see

Abadie and Imbens 2006), the positive results we describe do

notdirectlyapply:Evenwhenthelinearregressioniscor-

rectly specied, sandwich standard errors do not correctly

capture the variance of the post-matching estimates, since

the overlap between matched sets is not accounted for. Clus-

tered standard errors, as well as the analogous block bootstrap

that samples treated units with all their matching partners,

do not provide an immediate solution since one untreated

unit may now be part of multiple such clusters or bootstrap

groups.

In addition, our analysis applies to the case when matching

is done directly on the covariates, avoiding substantial com-

plications created by the presence of nuisance parameters in

the matching step when matching is done on the estimated

propensity score (see Rosenbaum and Rubin 1983;Abadieand

Imbens 2016). Finally, our analysis assumes that the quality

ofmatchesisgoodenoughformatchingdiscrepanciesnotto

bias the asymptotic distribution of the post-matching regression

estimator. Post-matching regression adjustments may, in prac-

tice,helpeliminatethebiasasinthebias-correctedmatching

estimator in Abadie and Imbens (2011). These are angles that we

do not explore in this article and interesting avenues for future

research.

992 A. ABADIE AND J. SPIESS

Appendix: Proofs

Preliminary Lemmas A.1 and A.2 and Propositions A.1–A.3 are in a

supplementary appendix.

Proof of Proposition 1. Let E

Q(·|W=1)

and E

Q(·|W=0)

be expectation

operators for Q(·|W = 1) and Q(·|W = 0). Notice rst that for any

measurable function q,

Q(·|W=1)

[q(Y(1), S)]=E[q(Y, S)|W = 1]. (A.1)

The result holds also replacing W = 1withW = 0, and aer

conditioning on X.Inparticular,

Q(·|W=0)

[q(Y(0), S)|X]=E[q(Y, S)|X, W = 0]. (A.2)

The regression coecient in the population dened by (a) and (b) is

the minimizer of

M + 1

Q(·|W=1)

[(Y(1) − g(1, S)



]

M + 1

Q(·|W=1)

[(Y(0) − g(0, S)



Notice that

Q(·|W=1)

[(Y(1) − g(1, S)



]=E[(Y −g(1, S)



|W = 1]

= E

∗

[(Y −Z



|W = 1],

where the rst equality follows from Equation (A.1)andthesecond

equality follows from the denitions of P

∗

(·|W = 1) and Z. Similarly,

Q(·|W=1)

[(Y(0) − g(0, S)



]

= E

Q(·|W=1)

[(Y(0) − g(0, S)



|X]]

= E

Q(·|W=1)

Q(·|W=0)

[(Y(0) − g(0, S)



|X]]

= E[E[(Y − g(W, S)



|X, W = 0]|W = 1]

= E

∗

[(Y −Z



|W = 0].

In the last equation, the rst equality follows from the law of iterated

expectations, the second equality follows from selection on observ-

ables, the third equality follows from (A.2)and(A.1), and the last

equation follows from the denition of P

∗

(·|W = 0). Therefore,

M + 1

Q(·|W=1)

[(Y(1) − g(1, S)



]

M + 1

Q(·|W=1)

[(Y(0) − g(0, S)



]

M + 1

∗

[(Y −Z



|W = 1]

M + 1

∗

[(Y −Z



|W = 0]=E

∗

[(Y −Z



which implies the result of the proposition.

Proof of Proposition 2. This proof is based on two lemmas in the sup-

plementary appendix about the asymptotic distribution of averages in

matched samples based on a martingale representation of matching

estimators similar to Abadie and Imbens (2012). Lemma A.1 establishes

convergence in probability, while Lemma A.2 deals with root-n consis-

tency and asymptotic normality. By Lemma A.1,



i∈S

∗



→ H.

By Lemma A.2,



√





β − β



√

⎛

⎝



i∈S

∗

− Z



β)

⎞

⎠

→ N (0, J),

wherewenotethatE[ZY − ZZ



β|W = 0, X = x] is Lipschitz. Hence,

√





β − β



→H

−1





−1

√

⎛

⎝



i∈S

∗

− Z



β)

⎞

⎠



 

→N (0,J)

→ N (0, H

−1

Proof of Proposition 3. We have t hat





i∈S

∗

− Z





β)





i∈S

∗

− Z



β)





i∈S

∗



− Z





β)

− (Y

− Z



β)





Notice that



i∈S

∗



− Z





β)

− (Y

− Z



β)





= (



β − β)







i∈S

∗



(



β + β) − 2



i∈S

∗





By assumption, the functions

E[Z

|X = x, W = w] and E[|Y|

|X = x, W = w]

are uniformly bounded on X

,forw = 0, 1. By Hölder’s inequality,

⎡

⎣



i∈S

∗



⎤

⎦

and E

⎡

⎣



i∈S

∗



⎤

⎦

are thus nite. Then, for  ∈ (0, 1/2),byMarkov’sinequality,weobtain



i∈S

∗

((Y

− Z





β)

− (Y

− Z



β)



= n

1/2−

(



β − β)





i∈S

∗



1/2−

(



β + β)

−

i∈S

∗



1/2−



→ 0.

As a result,





i∈S

∗

− Z



β)



+ o

(1),

and the claim follows from Lemma A.1 in the supplementary appendix,

which deals with consistency of averages in matched samples.

Proof of Proposition 4. Under correct specication, we nd that



(X) = E[Z(Y −Z



β)|W, X]=E[Zε|W, X]

= E[E[Zε|Z, W, X]|W, X]=E[ZE[ε|Z, W, X]



 

|W, X]=0.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 993

Proof of Proposition 5. First, note that



J =





−Z



β) +

j∈J (i)

−Z



β)





−Z



β) +

j∈J (i)

−Z



β)





(1),

where we replace



β by β analogous to the proof of Proposition 3.Write

G = Z(Y −Z



β) 

(x) = E[Z(Y − Z



β)|W = w, X = x].

Note that 

(x) is Lipschitz on X ,andthatG

has uniformly bounded

fourth moments. We decompose



J =





j∈J (i)



j∈J (i)





+ o

(1)



(



) + M

)

)(



) + M

)





i∈S

∗



−

)



−

)







=



∈J (i)∪{i}





− 



)









− 









)









(



) + M

)



− 

) +

j∈J (i)

− 

))







− 

) +

j∈J (i)

− 

))







) + M

)







+ o

(1).

Here, the o

terms absorb the deviation due to using



β instead of β,as

well as the matching discrepancies in the conditional expectations. The

rst sum is iid with



(



) + M

)

)(



) + M

)



→



(

(X) + M

(X))(

(X) + M

(X))



|W = 1



1 + M

var(

E[·|W=1]=0

  



(X) + M

(X) |W = 1)

1 + M

while the second is a martingale with



i∈S

∗



− 

)



− 

)





→

E[var(Z(Y − Z



β)|W = 1, X)

+Mvar(Z(Y − Z



β)|W = 0, X)|W = 1]

1 + M

by Lemma A.1 in the supplementary appendix, which establishes con-

sistency of averages in matched samples. Under appropriate reordering

of the individual increments, all other sums can be represented as aver-

ages of mean-zero martingale increments. Since the second moments of

the increments are uniformly bounded, they vanish asymptotically.

Proof of Proposition 6. In this proof, we invoke Proposition A.2 in the

supplementary appendix, which establishes a general result on the

validity of the matched bootstrap for averages within matched samples.

Write



∗



i∈S

∗



Note rst that

−1

√



∗

(



∗

− β) −



β − β))

= H

−1

√





i=1

− 1)Z

− Z



β)



→ N (0, H

−1

conditional on S, by Proposition A.2. Now,

√



∗

−



β) = (



∗

)

−1

H(H

−1

√



∗

(



∗

− β) −



∗

(



β − β))

= (



∗

)

−1



 

→I

−1

√



∗

(



∗

− β) −



β − β)))

+ ((



∗

)

−1



H − I)



 

→O

√



β − β)

→ N (0, H

−1

conditional on S, where we have used that



∗

−



→ O conditional

on S.

Proof of Proposition 7. First, P(

∗



∗

|S) ≥ P(



∗

−



H≤

|S)

→ 1asn →∞. Indeed, since Z has bounded conditional eighth

moments, we also have that E[ZZ





|W = w, X = s] is uniformly

bounded in X

. It follows with Proposition A.2 in the supplementary

appendix, which establishes the validity of the matched bootstrap, that

sup

r∈R

(dim Z)



√

n vec(



∗

−



H) ≤ r|S) − P(N (0, 

) ≤ r)



→ 0

as n →∞and thus in particular P(n





∗

−



H≤c|S)

→ 1forall

α ∈ (0, 1/2), c > 0.

Second, since for

A ∩ B = A ∩ B generally

|P(A) − P(

A)|≤|P(A ∩ B) − P(

A ∩ B)|



 

+|P(A ∩ B

) − P(

A ∩ B



 

≤P(B

)

≤ 1 − P(B),

for (r) = P



N (0, H

−1

) ≤ r



we have specically that

sup

r∈R





√

∗

−



β) ≤ r





− (r)



≤ sup

r∈R







√



∗

−



β) ≤ r





− (r)





√



∗

−



β) ≤ r





− P



√

∗

−



β) ≤ r









 

≤1−P(

∗



∗

|S)



≤ sup

r∈R





√



∗

−



β) ≤ r





− (r)





 

→0

+1 − P(

∗



∗

|S)



 

→0

→ 0.

This shows that this alternative bootstrap is valid in the sense of

Proposition 6.

994 A. ABADIE AND J. SPIESS

Third, for the bootstrap variance, we nd



∗

−



β =





∗



−1

⎛

⎝



i∈S

∗

−



∗



⎞

⎠





∗



−1



i∈S

∗

− Z





β)



−1



i∈S

∗

− Z





β)



 





∗







∗



−1

−



−1





i∈S

∗

− Z





β)



 



∗

Since

i∈S

∗

− Z





β) = 0andthusnvar



i∈S

∗

− Z





β)







nvar







∗







−1

nvar

⎛

⎝



i∈S

∗

− Z





β)



⎞

⎠



−1



−1



−1

→ H

−1

which is a valid estimate of the asymptotic variance of



β.However,the

remainder term



∗

generally does not have a bounded second moment

since



∗

is badly conditioned for some bootstrap draws.

To show that

∗

yields valid standard errors, we collect a number

of preliminary results. Consider the random variables





∗

and



∗





∗





∗

−



H≤c

√





∗

converges in distribution to N (0, ) with

 = H

−1

, conditional on S, by Proposition A.2. Since P(



∗





∗

|S)

→ 1,thesameholdstruefor

√



∗

by the above argument.

Also, we have established that



√





∗





= 0, var



√





∗





→ 

and thus E[n





∗



|S]

→ tr().SinceE[n



∗



|S]≤

E[n





∗



|S],andn



∗



and n





∗



havethesameweaklimit

(with expectation tr()) by the continuous mapping theorem,

E[n



∗



|S]

→ tr() by Proposition A.3 in the supplementary

appendix. Consequently,

E[n





∗



|S]−E[n



∗



|S]=P(n





∗

−



H

> c|S) E[n





∗





∗

−



H > c, S]

→ 0. (A.3)

Next, note that for conformable random variables A, B if

var(A|S)

→ , E[B

|S]

→ 0thenvar(A + B|S)

→ .

Indeed,

|(var(A + B|S) − var(A|S))

|=|cov(A

, B

|S)

+ cov(A

, B

|S) + cov(B

, B

|S)|

≤

var(A

|S)



var(B

|S) +



var(A

|S)

var(B

|S)

var(B

|S)



var(B

|S)

→ 0.

Hence, setting A =

√





∗

and B =

√

∗

−



β −





∗

),toestablishthe

desired result var(

√

∗

−



β)|S)

→ H

−1

it suces to show that



n

∗

−



β −





∗







→ 0 (A.4)

as n →∞.

Toward establishing (A.4), note rst that whenever n





∗

−



H≤c

then also

(



∗

)

−1

−



−1

=(



∗

)

−1

(



H −



∗

)



−1



≤(



∗

)

−1





H −



∗





−1



≤ λ

−1

min

(



∗

)λ

−1

min

(



H) 



H −



∗

 dim(Z),

where

min

(



∗

) = λ

min

(



H +



∗

−



H) = min

x=1



(



H +



∗

−



H)x

≥ min

x=1





Hx + min

x=1



(



∗

−



H)x

≥ λ

min

(



H) −



∗

−



H

and thus

(



∗

)

−1

−



−1



≤ (λ

min

(



H) −



∗

−



H)

−1

min

(



H) 



∗

−



H dim(Z)

≤ (λ

min

(



H) − cn

−α

)

−1

min

(



H) cn

−α

dim(Z). (A.5)

If follows that

n

∗

−



β −





∗





= P(n





∗

−



H≤c|S) E[n



∗



∗

−



β −





∗





∗

−



H≤c, S]

+ P(n





∗

−



H > c|S) E[n

∗





−



β −





∗





∗

−



H > c, S]

= P(n





∗

−



H≤c|S)

E[n

≤(



∗

)

−1

−



−1



i∈S

∗

−Z





β)







∗





∗

−



H≤c, S]

+ P(n





∗

−



H > c|S) E[n





∗





∗

−



H > c, S]

(A.5)

≤ (λ

min

(





 

→λ

min

(H)>0

−cn

−α

)

−1

min

(



H) cn

−α

dim(Z)

P(n





∗

−



H≤c|S) E[n

−1/2

i∈S

∗

− Z





β)





∗

−



H≤c, S]



 

≤E[

√

i∈S

∗

−Z





β)

|S]=tr(



→tr(J)

+ P(n





∗

−



H > c|S) E[n





∗





∗

−



H > c, S]



 

(A.3)

→ 0

→ 0.

Hence, var(

√

∗

−



β)|S) and var(

√





∗

|S) have the same proba-

bility limit H

−1

, which is also the asymptotic variance of



β.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 995

Supplementary Materials

The supplementary appendix contains proofs of intermediate results and

extensions.

Acknowledgments

We thank Gary King, seminar participants at Harvard, and the editor

(Hongyu Zhao) and referees for helpful comments, and Jaume Vives for

expert research assistance.

Funding

Financial support by the NSF through grant SES 0961707 is gratefully

acknowledged.

References

Abadie, A., and Imbens, G. (2006), “Large Sample Properties of Matching

Estimators for Average Treatment Eects,” Econometrica, 74, 235–267.

[983,991]

(2008), “On the Failure of the Bootstrap for Matching Estimators,”

Econometrica, 76, 1537–1557. [989]

(2011), “Bias-Corrected Matching Estimators for Average Treat-

ment Eects,” Journal of Business & Economic Statistics, 29, 1–11.

[986,991]

(2012), “A Martingale Representation for Matching Estima-

tors,” Journal of the American Statistical Association, 107, 833–843.

[984,986,988,992]

(2016), “Matching on the Estimated Propensity Score,” Economet-

rica, 84, 781–807. [991]

Abadie, A., Imbens, G. W., and Zheng, F. (2014), “Inference for Misspeci-

ed Models With Fixed Regressors,” Journal of the American Statistical

Association, 109, 1601–1614. [984]

Arellano, M. (1987), “Computing Robust Standard Errors for Within-

Groups Estimators,” Oxford Bulletin of Economics and Statistics, 49, 431–

434. [988]

Blinder, A. S. (1973), “Wage Discrimination: Reduced Form and Structural

Estimates,” Journal of Human Resources, 8, 436–455. [985]

Cochran, W. G. (1953), “Matching in Analytical Studies,” American Journal

of Public Health and the Nation’s Health, 43, 684–691. [983]

Dehejia, R. H., and Wahba, S. (1999), “Causal Eects in Nonexperimental

Studies: Reevaluating the Evaluation of Training Programs,” Journal of

the American Statistical Association, 94, 1053–1062. [983]

DiNardo, J., Fortin, N., and Lemieux, T. (1996), “Labor Market Institu-

tions and the Distribution of Wages, 1973–1992: A Semiparametric

Approach,” Econometrica, 64, 1001–1044. [985]

Efron, B. (1979), “Bootstrap Methods: Another Look at the Jackknife,” The

Annals of Statistics, 7, 1–26. [988]

Eicker, F. (1967), “Limit Theorems for Regressions With Unequal and

Dependent Errors,” in Proceedings of the Fih Berkeley Symposium on

Mathematical Statistics and Probability (Vol. 1), pp. 59–82. [987]

Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007), “Matching as

Nonparametric Preprocessing for Reducing Model Dependence in Para-

metric Causal Inference,” Political Analysis, 15, 199–236. [983,987,991]

Huber, P. J. (1967), “The Behavior of Maximum Likelihood Estimates Under

Nonstandard Conditions,” in ProceedingsoftheFihBerkeleySymposium

on Mathematical Statistics and Probability (Vol. 1), pp. 221–233. [987]

Imbens, G. W., and Rubin, D. B. (2015), Causal Inference for Statistics,

Social, and Biomedical Sciences: An Introduction,Cambridge:Cambridge

University Press. [986]

Kahn, M. (2005), “An Exhalent Problem for Teaching Statistics,” The Journal

of Statistical Education, 13. [990]

Liang, K.-Y., and Zeger, S. L. (1986), “Longitudinal Data Analysis Using

Generalized Linear Models,” Biometrika, 73, 13–22. [988]

Oaxaca, R. (1973), “Male-Female Wage Dierentials in Urban Labor Mar-

kets,” International Economic Review, 14, 693–709. [985]

Rosenbaum, P. R., and Rubin, D. B. (1983), “The Central Role of the Propen-

sity Score in Observational Studies for Causal Eects,” Biometrika, 70,

41–55. [991]

Rosner, B. (1995), Fundamentals of Biostatistics,Belmont,CA:Duxbury

Press. [990]

Rubin, D. B. (1973), “Matching to Remove Bias in Observational Studies,”

Biometrics, 29, 159–183. [983]

(1974), “Estimating Causal Eects of Treatments in Randomized

and Nonrandomized Studies,” Journal of Educational Psychology, 66, 688.

[984]

Tager, I. B., Weiss, S. T., Muñoz, A., Rosner, B., and Speizer, F. E. (1983),

“Longitudinal Study of the Eects of Maternal Smoking on Pulmonary

Function in Children,” New England Journal of Medicine, 309, 699–703.

[990]

Tager, I. B., Weiss, S. T., Rosner, B., and Speizer, F. E. (1979), “Eect of

Parental Cigarette Smoking on the Pulmonary Function of Children,”

American Journal of Epidemiology, 110, 15–26. [990]

White, H. (1980a), “A Heteroskedasticity-Consistent Covariance Matrix

EstimatorandaDirectTestforHeteroskedasticity,”Econometrica, 48,

817–838. [987]

(1980b), “Using Least Squares to Approximate Unknown Regres-

sion Functions,” International Economic Review, 21, 149–170. [987,989]

(1982), “Maximum Likelihood Estimation of Misspecied Models,”

Econometrica, 50, 1–25. [987]