EVALUATION NOTES BERGEN COURSE SPRING, 2010 Petra Todd University of Pennsylvania Department of Economics The Evaluation Problem

Will study econometric methods for evaluating effects of active labor market programs

Employment, training and job search assistance programs School subsidy programs Health interventions

Key questions Do program participants benefit from the

program? Do program benefits exceed costs? What is the social return to the program? Would an alternative program yield greater impact at the same cost? Goals

Understand the identifying assumptions needed to justify application of different estimators

Statistical assumptions Behavioral assumptions Assumptions with regard to heterogeneity in how people respond to a program intervention Potential Outcomes

Y0 outcome without treatment Y1 output with treatment D=1 if receive treatment, else D=0 Observed outcome Y=D Y1+(1-D) Y0 Treatment Effect = Y1-Y0 not directly observed, missing data problem

Parameters of Interest Average impact of treatment on the treated (TT) E(Y1-Y0|D=1,X)

Average treatment effect (ATE) E(Y1-Y0|X) Average effect of treatment on the untreated (UT)

E(Y1-Y0|D=10,X) ATE=Pr(D=1|X)TT+(1-Pr(D=1|X))UT Other parameters of interest

Proportion of people benefiting from the program Pr(Y1>Y0|D=1)=Pr(>0|D=1) Distribution of treatment effects

F(|D=1,X) Selected quantile Inf {:F(F(|D=1,X)>q} Model for potential outcomes with and

without treatment Model:F( Y1=X1+U1 Y0=X0+U0 E(U1|X)=E(U0|X)=0

Observed outcome:F( Y=Y0+E(Y1-Y0) Y= X0+D(X1- X0)+U0+D(U1-U0) Distinction between TT and ATE

TT=E(|D=1,X)=X1- X0+E(U1-U0|D=1,X) ATE= E(|X)=X1- X0

TT depends on structural parameters as well as means of unobservables Parameters are the same if

(A1) U1=U0 (A2) E(U1-U0|D=1,X)=0 Condition (A2) means that D is uninformative on U 1U0, , i.e. ex post heterogeneity but not acted on ex

ante Three Commonly Made Assumptions from least to most general Coefficient on D is fixed (given X) and is

the same for everyone (most restrictive) U1=U0 Y=X+D(X)+U E(Y1-Y0|X,D)= (X)

Coefficient on D is random given X, but U1-U0 does not help predict participation in the program Pr(D=1| U1-U0 ,X)=Pr(D=1|X) which implies E(U1-U0 |D=1,X)= E(U1-U0 |X)

Coefficient on D is random given X and D helps predict program participation (least restrictive) E(U1-U0 |D=1,X)E(U1-U0 |X) How Can Randomization Solve the

Evaluation Problem? Comparison group selected using a randomization devise to randomly exclude some fraction of program applicants from

the program Main advantage increase comparability between program participants and nonpartcipants

Have same distribution of observables and of unobservables Satisfy program eligibility criteria What problems can arise in social experiments?

Randomization bias occurs when introducing randomization changes the way the program operates Greater recruitment needs may lead to

change in acceptance standards Individuals may decide not to apply if they know they will be subject to randomization Contamination bias occurs when control group members seek alternative forms of treatment

Ethical considerations there may be opposition to the experiment and some sites may refuse to participate, which poses a threat to external validity

Dropout some of the treatment group members may drop out before completing the program Sample attrition may have differential attrition

between the treatment and control groups At what stage should randomization be applied?

Randomization after acceptance into the program Randomization of eligibility

Let R=1 if randomized (treatment group), R=0 if randomized out (control group) Let Y1* and Y0* denote outcomes Let D* denote someone who applies to the program and is subject to randomization

From treatment group, get E(Y1*|X,D*=1,R=1) From control group, get E(Y0*|X,D*=1,R=0) No randomization bias and random assignment implies E(Y1*|X,D*=1,R=1)=E(Y1|X,D=1)

E(Y0*|X,D*=1,R=0)=E(Y0|X,D=1) Thus, the experiment gives TT=E(Y1-Y0|X,D=1) How does program dropout affect experiments?

Can define treatment as intent-to-treat or offer of treatment, in which case dropout not a problem If dropout occurs prior to receiving the

program (i.e. dropouts do not get treatment), then could treat it like randomization on eligibility. Randomization on eligibility

Let e=1 if eligible, e=0 if not eligible Let D=1 denote would-be participants if program were made available E(Y|X,e=1)=Pr(D=1|X,e=1)E(Y1|X,e=1,D=1) + Pr(D=0|X,e=1)E(Y0|X,e=1,D=0) E(Y|X,e=0)=Pr(D=1|X,e=0)E(Y0|X,e=0,D=1) + Pr(D=0|X,e=0)E(Y0|X,e=0,D=0)

Because eligibility is randomized, Pr(D=1|X,e=1)=Pr(D=1|X,e=0) Pr(D=0|X,e=1)=Pr(D=0|X,e=0) E(Y0|X,e,D=1)= E(Y0|X,D=1)

E(Y1|X,e,D=1)= E(Y1|X,D=1) Thus, difference in previous two equations gives Pr(D=1|X,e=1){E(Y1|X,e,D=1)-E(Y0|X,D=1)}

E (Y | X , e 1) E (Y | X , e 0) TT Pr( D 1 | X , e 1) What about control group contamination?

Not necessarily a problem if willing to define benchmark state as being excluded from the program What about sample attrition?

Attrition is a problem that is common to both experimental and nonexperimental studies Attrition occurs when some people are not followed in the data (maybe due to nonresponse)

If attrition is nonrandom with respect to treatment, then attrition requires the use of nonexperimental evaluation methods Sources of bias in estimating E(|X), E(|X,D=1) Traditional (Simple) Regression

Estimators Cross-section Before-after Difference-in-differences

Ashenfelters Dip Mean Y D=1 D=0 T=0

Before-after estimators Drawbacks and Advantages of beforeafter approach Drawbacks

Identification breaks down in the presence of time-specific intercepts Can be sensitive to choice of time periods because of Ashenfelter Dip pattern

Advantage minimal data requirements - only requires data on participants. Cross-section estimators

Difference-in-difference estimators Advantages

Allows for time-specific intercepts that are common across groups Consistent under fixed effect error structure therefore allows for timeinvariant unobservables to affect participation decisions and program

outcomes Matching Estimators Assume have access to data on treated and untreated individuals (D=1 and D=0) Assume also have access to a set of X variables whose distribution is not

affected by D F(X|D,YP)=f(X|YP) where YP=(Y0,Y1) potential outcomes

Matching estimators pair treated individuals with observably similar untreated individuals Usually assumed that (Y0,Y1) D | X (M-1) or Pr(D=1|X, Y0,Y1) = Pr(D=1|X)

and 0

treatment impact Assumption (M-1) implies F(Y0|D=1,X)=F(Y0|D=0,X)=F(Y0|X) F(Y1|D=1,X)=F(Y1|D=0,X)=F(Y1|X) also

E(Y0|D=1,X)=E(Y0|D=0,X)=E(Y0|X) E(Y1|D=1,X)=E(Y1|D=0,X)=E(Y1|X) Under assumptions that justify matching, can estimate TT, ATE, and UT

Let n denote number of observations in the treatment group A typical matching estimator for TT takes the form:F(

1 (Y | X X , D 0)] m [ Y E

1i 0j j i j n1 i{ Di 1}

E (Y0 j | X j X i , D j 0) is an estimator for the matched no treatment outcome Recall, that (M-1) implies E (Y0 j | X j X i , D j 0) E (Y0 j | X j X i , D j 1) How does matching compare to a randomized experiment?

Distribution of observables will by construction be the same matched control

group as in the treatment group However, distribution of unobservables not necessarily balanced across groups Experiment has full support (M-2), but with matching there can be a failure of the common support condition (when matches cannot be found)

Even though matching methods assume E(Y1-Y0|D=1,X)=E(Y1-Y0|X) Could still potentially have E(Y1-Y0|D=1)E(Y1-Y0) E(|D=1)=E(|D=1,X)f(X|D=1)dX E()=E(|X)f(X)dX

If interest centers on TT, (M-1) can be replaced by weaker assumption E(Y0|X,D=1)=E(Y0|X,D=0)=E(Y0|X) The weaker assumption allows selection into the program to depend on Y1 and allows E(Y1-Y0|X,D)E(Y1-Y0|X)

Only require Pr(D=1|X,Y0,Y1)=Pr(D=1|X,Y1) Practical problems in Matching

Problems How to construct match when X is of high

dimension How to choose set of X values What do to if Pr(D=1|X)=1 for some X (violation of common support condition (M1)) Rosenbaum and Rubin (1983) Theorem

Provide a solution to the problem of

constructing a match when X is of high dimension Show that (Y0,Y1) D | X Implies (Y0,Y1) D | Pr(D=1|X) Reduces the matching problem to a univariate problem, provided Pr(D=1|X) can be

parametrically estimated Pr(D=1|X) is known as the propensity score Proof of RR theorem Let P(X)=Pr(D=1|X) E(D|Y ,P(X))=E(E(D|Y ,X)|Y ,P(X)) 0 0

0 = E(P(X)|Y0,P(X)) =P(X) Where first equality holds because X is finer than P(X) E(D|Y ,X)=E(D|X)=P(X) 0

Matching can be implemented in two steps Step 1:F( estimate a model for program participation, estimate the propensity score P(Xi) for each person

Step 2: Select matches based on the estimated propensity score m

1 [Y1i ( Pi ) E (Y0 j | P j Pi , D j 0)] n1 i{ Di 1} Ways of constructing matched outcomes

Define a neighborhood C(Pi) for each person i {Di=1} Neighbors are persons in {Dj=0} for

whom Pj C(Pi) Set of persons matched to i is Ai={j{Di=0} such that Pj C(Pi)} Nearest Neighbor Matching

C(Pi)=min || Pi-Pj || j j{Di=0} => Ai is a singleton set

Caliper matching Matches only made if || Pi-Pj ||< for some prespecified tolerance (tries to avoid bad matches) Kernel Matching

Estimate matched outcomes by nonparametric regression Local Linear Regression Matching Difference-in-difference matching

Assume (Y0t-Y0t) D | Pr(D=1|X) 0

Allows for time invariant unobservable differences between the treatment group and the control group Selection into the program can be based on the unobservables

DD 1 (Y Y | P P , D 0)] [( Y

Y ) E 1it 0it ' 0 jt 0 jt '

j i j n1 i{ Di 1} Should matches be reused?

If dont reuse, then results will not be invariant to the order in which observations were matched Balancing Tests:F( Checking the Specification of the Propensity Score Model

By R&R Theorem, for any set of variables Z Different kinds of balancing tests

If conditioning on estimated value of P(Z) there is additional dependence on Z then this can be viewed as misspecification in P(Z) In practice, researchers often group data according to values of P(Z) (e.g. 5 strata)

and compare means of Z within each group (how to choose strata not entirely clear) Or estimate regression If balancing tests fail

Refine propensity score model and reestimate Could use a semiparametric approach to

estimate P(Z) If estimate P(Z) fully nonparametrically, then curse of dimensionality returns and there is no gain to using the propensity score methodology. What variables to put in the propensity score?

Econometric Models of Program Participation

Assume individuals have the option of taking training in period k Prior to k, observe Y0j, j=1..k After k, observe two potential outcomes (Y0t,Y1t) To participate in training, individuals must apply

and be accepted, so there may be several decision-makers determining who gets training D=1 if participates, =0 else Assume participation decisions are based on maximization of future earnings Simple model of participation

D=1 if T k Y1,k j E C j j 1 (1 r )

T k Y0,k j

| I k 0 j j 0 (1 r )

First term is earnings stream if participates in program C is the direct cost of training Last term is earnings stream if do not participate Ik is the information set at time k used to form expectations Implications of this simple decision rule

Past earnings are irrelevant except for

value in predicting future earnings Persons with lower foregone earnings or lower costs are more likely to participate in programs Older persons and persons with higher discount rates are less likely to participate The decision to take training is correlated with future earnings only through the

correlation with expected future earnings Special case of above model Assume constant treatment effect

D=1 if expected rewards exceed costs T k E | I k C Y0 k j j 1 (1 r )

If earnings temporarily low (e.g. unemployed), people are more likely to enroll in the program Model is consistent with Ashenfelters Dip Pattern Model of the decision process

Let IN=H(X)-V H(X) = expected future rewards V = costs=C+Y0k (assumed unknown)

If V assumed to be independent of X, then could estimate by logistic or probit model:F( Pr(D=1|X)=eH(X)/ {1+ eH(X)} Pr(D=1|X)=((H(X)-1)/v) Evidence on Performance of Matching Estimators

Control function methods References:F(

Roy (1951), Willis and Rosen (1979), Heckman and Honore (1990), Heckman and Sedlacek (1985), Heckman and Robb (1985, 1986) Allow selectivity into the program to be based on unobservables, explicitly model

and control for potential selectivity bias Conventional to assume unobservables are normally distributed, but can relax normality Model for outcomes Comparison of Control function and matching

methods Normal model Note that assumption of normal model in inconsistent with assumption of matching estimator

Bias Function Matching assumption implies that

B(P(Z))=0. Difference-in-difference matching assumes that bias function differences over time Decomposition of Sources of Bias for B=E(Y0|D=1)E(Y0|D=0) Propensity Score

Distribution Pointwise Bias and Comparison with Normal Model Pointwise bias over time, conditional on P

Nonexperimental Estimators:F( Regression Discontinuity Methods Rule determining who gets treatment, but assignment is nonrandom

Probability of getting treatment changes discontinuously as a function of underlying variables Previous research

Introduced in Thistlethwaite and Campbell (1960) Analyzed by Goldberger (1972) in context of evaluating education interventions Applications in Berk and Rauma (1983), van der

Klaauw (1996) and Angrist and Lavy (1996) Many studies rely implicitly on nonlinearities or discontinuities in the treatment assignment rules (e.g. Black, 1996, and Angrist and Krueger, 1991) Post Test score

c Pre-test score Questions

How does the RD design provide additional sources of identifying information? How can treatment effects be recovered

with minimal parametric restrictions? What is the relationship between RD and IV estimators? Sharp design Pr(D=1|z) z

Fuzzy design Pr(D=1|z) c z

IV and Local IV (LIV) Estimators Suppose binary instrument Z Identifying assumption is that

When does the IV estimator identify treatment-on-the-treated? Case 1:F( Common effect Case 2:F( heterogenous impacts

Examples Angrist (1990) uses draft lottery as instrument. Could be invalid for TT parameter

Firms take into account lottery numbers in making hiring decisions Workers take actions to avoid draft Moffit (1996) uses cross-section variation in

welfare benefits as an instrument for participation in a job training program Could be invalid if anticipated benefit correlated with welfare benefit LATE

Imbens and Angrist (1994, Econometrica) show that even if the assumptions that would justify application of IV for purpose of estimating TT are not valid, the IV

estimator still identifies the LOCAL AVERAGE TREATMENT EFFECT (LATE) LATE is the average treatment effect for the subset of individuals induced to change their treatment status by the instrument

Distinction between always-takers, compliers, never-takers

LATE is the average treatment effect for compliers only and compliers cannot be identified in the data Size of the group of compliers is unknown and the group may be instrument-dependent LATE assumes that the instrument affects everyones propensity to take the treatment in the same way (monotone response to the

instrument) Examples of LATE interpretation In Angrist (1990) example, get effect of military service for the subset induced to enter the military by

the draft lottery (excludes those who always join or who never go, despite the draft) Angrist and Krueger (1994) study effect of schooling on earnings using compulsory schooling laws as an instrument

Gives treatment effect for the subset induced to enter school by the instrument (tend to be low level schooling types) Angrist and Evans (1998) effect of fertility on labor supply using twins as an instrument

MTE and LIV estimation (Heckman and Vytlacil, 2005 Econometrica)

Develop a unifying theory for how TT, ATE, and LATE all relate to one another Propose a new concept called the marginal treatment effect (MTE) and show how to estimate it and how to build up the other parameters from it. Treatment effect model

Parameters of interest Interpret LATE in terms of MTE TT, ATE, LATE as function of MTE

Estimation strategy Uses fact that MTE is a limiting form of LATE

Bounding approaches (Heckman and Smith, 1995, Manski, 1997) Recall that from experiments, we can only learn about the marginal distributions of Y0 and Y1 and not about the joint distribution

Let Y0 and Y1 denote a discrete outcome, such as employment status (Y0,Y1) can take on the values (E,E),(E,N), (N,E) and (N,N) Would like to know treatment effect= PNEPEN

We only see marginals Frechet-Hoeffding bounds

Upper bound prob of joint event cannot exceed probability of the events that compose it Lower bound the sum of the individual

cell probabilities must equal one