Chapter 22 Selection on Unobservables and DML-IV

In our previous chapters on selection on observables, we reviewed methods such as regression, double machine learning, matching, propensity score weighting, and doubly robust approaches. Each of these techniques assumes that all confounders—factors that simultaneously influence both the treatment decision and the observed outcome—are measurable and observable. However, in many practical applications, even with vast amounts of data and numerous variables, unmeasured confounders may still affect both treatment assignment and outcomes. In fields like economics, social sciences, and health research, countless unobservable factors drive human behavior—common examples include motivation, ability, cultural influences, and social networks—necessitating alternative quasi-experimental approaches when observable data alone falls short. These unobservable confounding variables present a major challenge in causal inference, as they limit our ability to reliably isolate true causal effects; unobservable variables may be inherently unmeasurable or simply missing from the available data.

This chapter briefly outlines classical econometric strategies designed to account for unobservable confounders. These methods, while insightful, involve trade-offs and rely on strong assumptions such as valid instruments, parallel trends, or known thresholds. We cover the core ideas of four major approaches: Instrumental Variables (IV), which addresses endogeneity by introducing an external variable that affects the outcome solely through its impact on treatment; Difference-in-Differences (DiD), which exploits before-and-after comparisons between treated and untreated groups under the parallel trends assumption; Regression Discontinuity (RD), which leverages a cutoff point to compare units just above and below the treatment threshold; and Synthetic Control, which constructs a comparison unit from a weighted combination of controls when a traditional group is unavailable.

While these methods might not always yield the causal effects we typically discuss, such as the average treatment effect (ATE), they can offer valuable insights into different treatment effects specific to each model. It is crucial to understand the assumptions of each model and the particular causal effects being estimated. Although this may not be ideal, identifying these specific effects can inform the design of targeted policies and interventions, ultimately contributing to more nuanced decision-making.

It is important to note that these chapters do not aim to provide exhaustive derivations or proofs of these classical methods. Instead, the main focus is on how ML can refine, extend, and improve their application. In this chapter, we discuss how ML techniques can aid in instrument selection, strengthen parallel trends assessments, and ultimately strengthening causal inference in the presence of unobservable confounders.

22.1 Instrumental Variables (IV):

Observational studies in economics often face the endogeneity problem, where treatment assignment or policy adoption is correlated with unobserved factors that also affect outcomes. When unconfoundedness—relying on the assumption that all relevant confounders are observable—fails, instrumental variables (IV) methods emerge as a powerful solution. The central idea behind this method is to find an instrumental variable that influences treatment assignment while remaining independent of unobservable/unmeasured confounders and having no direct effect on the outcome except through its effect on treatment. This approach effectively extracts variation in the treatment that is free from the bias introduced by unobserved factors, allowing for more credible causal inference.

Randomization, as discussed in Chapter 17, is a fundamental tool in economics, political and social science, and health for establishing causality. Randomized controlled trials (RCTs) have been used in labor economics to assess the impact of job training programs, in political science to evaluate voter mobilization efforts, and in development economics to measure the effects of cash transfer programs on poverty alleviation. These experiments ensure that treatment assignment is exogenous, minimizing confounding issues and providing robust causal estimates. However, as we also discussed, randomization is not always feasible due to ethical, logistical, or political constraints.

In cases where actual randomization is not possible, researchers turn to exogenous randomization from natural experiments, policy changes, or other external shocks. Historically rooted in economics and widely adopted in the social sciences and genetics, natural experiments serve as instrumental variables for addressing endogeneity. Natural experiments occur when external circumstances or policy changes create conditions that mimic random assignment, offering exogenous variation in treatment assignment that researchers can exploit to estimate causal effects. For instance, researchers have used instruments like distance to college or quarter of birth to estimate the returns to education, and random supply shocks such as weather events or natural disasters to assess policy impacts, and many we will discuss later. From the Rubin Causal Model perspective, an instrument offers a means to indirectly assign treatment in a way that is plausibly exogenous, thus facilitating the identification of causal effects under specific conditions.

We use the term treatment broadly, originating from health sciences but applicable to economic policies, social programs, and exogenous shocks that impact outcomes. Whether referring to a subsidy, a regulatory intervention, or a natural disaster, treatment in this context encompasses any externally imposed variation that influences the variables of interest.

The validity of an instrumental variable hinges on assumptions that are inherently difficult to fully verify. Researchers assess the plausibility of an IV using a combination of theoretical justification, various empirical tests, sensitivity analyses, and robustness checks. While they rely on economic theory and contextual information to support the relevance and excludability of an an instrument, no formal test can confirm with absolute certainty that the instrument is completely independent of unmeasured confounders. Ultimately, although these methods help build confidence in the validity of a carefully chosen IV, they cannot guarantee it, so careful interpretation of the results remains essential.

The earliest concept of instrumental variables is generally attributed to Philip G. Wright and his son Sewall (1928), with IV methods traditionally formulated within structural equation models (SEMs) that rely on the key assumption that the instrument is uncorrelated with the structural error terms. These classical approaches typically assume homogeneous treatment effects, and methods such as two-stage least squares (2SLS) have proven powerful under these linear conditions. However, caution is warranted when applying 2SLS to non-linear models—whether involving quadratic treatment effects or logistic regressions for binary treatments—as several open questions remain. For example, if the true outcome model includes interactions between covariates and the treatment, suggesting heterogeneous treatment effects, or if data are clustered, the interpretation and consistency of the standard 2SLS estimator may be compromised, necessitating alternative estimation strategies like random effects models. In landmark papers during the 1990s, Angrist, Imbens, and Rubin advanced the field by linking IV methods to the potential outcomes framework in causal inference. Motivated by the issue of noncompliance, where individuals fail to follow assigned treatments or treatments are not perfectly assigned (e.g., in randomized controlled trials), they showed that IV offers a way to still estimate causal effects, even when randomization fails to produce perfectly controlled treatment assignment. This shift not only allowed for the accommodation of heterogeneous effects but also clarified the assumptions necessary for causal interpretation, addressing some of the limitations inherent in traditional SEM approaches.

Moreover, recent advances in machine learning have improved the application of IV methods. Techniques such as LASSO and random-forest help in the data-driven selection of valid instruments from a large pool of candidates, while methods like Post-LASSO and double selection refine the estimation process by mitigating omitted variable bias. Additionally, machine learning tools—including random forests, gradient boosting, and deep neural networks (as seen in Deep IV approaches)—enable flexible, non-linear modeling of the relationship between instruments and endogenous regressors, as well as the assessment of instrument validity. In this section, we will explore these classical IV strategies, and consider practical examples that illustrate various IV methods in estimating causal effects. If you are already familiar with IV from econometrics, you can jump to the double machine learning IV section.

22.1.1 Core IV Concepts

Instrumental variables (IV) methods offer a solution to the endogeneity problem in observational studies, where treatment assignment is often correlated with unobserved factors that also affect outcomes. In the traditional linear model, the outcome \(Y_i\) is expressed as

\[\begin{equation} Y_i = \beta_0 + \beta_1 D_i + \beta_2' X_i + \varepsilon_i \end{equation}\]

with \(D_i\) representing the endogenous treatment variable and \(X_i\) exogeneous covariates. Here, the target parameter is \(\beta_1\) which is causal/treatment effect; however, because the error term \(\varepsilon_i\) is correlated with \(D_i\) (as we cannot assume independent/unconfoundedness), direct OLS estimation of \(\beta_1\) is biased. Keep in mind in selection in observable chapters, we implement various methods to address this bias, such as regression, matching, weighting, and double machine learning. However, these methods rely on the assumption that all confounders are observable, which may not always be the case.

To address this, IV methods introduce a vector of instruments \(Z_i\) that must satisfy two key conditions. First, the instrument must be highly correlated, meaning it is highly related with the endogenous regressor \(D_i\). Second, it must have exclision restriction—that is, \(Z_i\) should have no direct effect on the outcome \(Y_i\) except through its effect on \(D_i\). In a sense, the instrument allows us to extract the variation in the treatment that is free from the bias introduced by unobserved confounders. When the dimension of \(Z_i\) is one, the model is just-identified, which is the most common case; if there are multiple instruments, the model is over-identified.

In the just-identified case without covariates, the IV estimator for \(\beta_1\) is computed as the ratio of the covariance between \(Y_i\) and the instrument \(Z_i\) to the covariance between \(D_i\) and \(Z_i\). The IV estimator of \(\beta_1\) is given by the ratio of covariances⁶⁵:

\[\begin{equation} \hat{\beta}_{1,\text{IV}} = \frac{\text{Cov}(Y_i, Z_i)}{\text{Cov}(D_i, Z_i)} = \frac{\frac{1}{n}\sum_{i=1}^n (Y_i - \bar{Y})(Z_i - \bar{Z})}{\frac{1}{n}\sum_{i=1}^n (D_i - \bar{D})(Z_i - \bar{Z})} \end{equation}\]

There are two common interpretations of the IV estimator—one based on the indirect least squares (ILS) approach (please see the footnote⁶⁶ for details). Essentially, since we cannot directly regress \(Y\) on \(D\) due to unobservable confounders, we use \(Z\)—a variable highly correlated with \(D\)—to isolate the variation in \(D\) that is independent of these confounders, employing a two-stage process to estimate the effect on \(Y\). In the first stage, we predict the treatment using OLS:

\[\begin{equation} \hat{D}_i = \hat{c}_{20} + \hat{c}_{21} Z_i \end{equation}\]

and in the second stage, we replace the actual treatment with \(\hat{D}_i\) in the outcome equation:

\[\begin{equation} Y_i = \beta_0 + \beta_1 \hat{D}_i + v_i \end{equation}\]

The OLS estimate of \(\beta_1\) in this second stage, \(\hat{\beta}_{1,2SLS}\), is equivalent to the IV estimator, \(\hat{\beta}_{1,IV}\). Specifically, we can derive that

\[\begin{equation} \hat{\beta}_{1,2SLS} = \frac{\text{Cov}(Y_i,Z_i)}{\text{Cov}(D_i,Z_i)} \end{equation}\]

which matches the ratio-of-covariances formulation of the IV estimator. Moreover, this estimator is also identical to the indirect least squares (ILS) estimator, demonstrating that under these conditions, the IV, ILS, and 2SLS estimators are equivalent.

The discussion extends naturally to the case where covariates are included.⁶⁷ In the 2SLS framework, the first stage includes the covariates:

\[\begin{equation} \hat{D}_i = \hat{c}_{20} + \hat{c}_{21} Z_i + \hat{c}_{2X}' X_i \end{equation}\]

and the second stage is then

\[\begin{equation} Y_i = \beta_0 + \beta_1 \hat{D}_i + \beta_2' X_i + v_i \end{equation}\]

The OLS estimate of \(\beta_1\) in this second stage is the 2SLS estimator, and it can be shown that the IV, ILS, and 2SLS estimators are also equivalent under these conditions.

Thus far, we have discussed instrumental variables within the context of structural equation models, which assume homogeneous treatment effects. While these traditional methods have been valuable, they have notable shortcomings—such as difficulties in accommodating non-linear models (e.g., quadratic effects or binary treatments estimated via logit), limited capacity to handle interactions between covariates and treatment, inflexible to extend to more complex settings,and challenges in managing clustered data. All these shortcomings motivate our transition to the potential outcomes framework, which accommodates heterogeneous effects and offers a clearer basis for causal interpretation.

22.1.2 IV in the Potential Outcomes Framework and RCTs

Instrumental Variables (IV) methods became particularly valuable with the implementation to the randomized experiments that suffer from noncompliance—a situation in which individuals fail to follow their assigned treatments, or treatments are not perfectly implemented. In such experiments, traditional methods of causal inference face significant challenges because the treatment actually received, \(D_i\), does not perfectly align with the treatment assignment, \(Z_i\). Motivated by this issue, Angrist, Imbens, and Rubin (1996, JASA) connected the IV approach to the potential outcomes framework, demonstrating that random assignment itself can serve as a strong instrument. By doing so, they showed that even in the presence of noncompliance, IV methods can identify local causal effects by focusing on a subgroup of individuals—known as compliers—who adhere to their assigned treatment. For more detailed discussion of connecting potential framework to IV, please see also the last section.

Consider a randomized trial for a new drug intended to improve blood pressure control. In this trial, 1,000 patients are randomly assigned to either the treatment group (\(Z_i=1\)) or the control group (\(Z_i=0\)), with 500 patients in each group. The treatment assignment \(Z_i\) indicates whether a patient is assigned to take the new drug. However, noncompliance is evident: in the treatment group, only 300 out of 500 patients actually take the drug (\(D_i=1\)), while in the control group, 50 out of 500 patients take the drug despite being in the control group. Thus, the observed treatment uptake rates are 0.6(=300/500) for the treatment group and 0.1(=50/500) for the control group. The outcome \(Y_i\) is a binary indicator of improvement in blood pressure, with an average outcome of 0.40 for all patients in the treatment group (including both those who take and do not take the treatment while being assigned to the treatment group) and 0.25 for those in the control group (including both those who take and do not take the treatment while being assigned to the control group).

There are two common naïve approaches to estimating causal effects in the presence of noncompliance. The first approach compares outcomes based on the actual treatment received, thus calculates average treatment effect (the effect of treatment):

\[\begin{equation} \delta^{\text{ATE}} = E[Y_i|D_i=1] - E[Y_i|D_i=0] \end{equation}\]

The ATE is the difference between “average outcome of treatment for all units actually are treated” and “average outcome of treatment for all units who actually are not treated”. However, this approach suffers from selection bias because the decision to take the drug(treatment) is not completely randomized as there are people who take the drug even if they are in control group. When noncompliance occurs, simply comparing groups by the treatment actually received can lead to biased estimates because the decision to comply is self-selected, breaking the benefits of randomization. Patients who opt to take the drug may differ systematically in unobserved ways (e.g., in their health status or motivation) from those who do not, so the resulting difference in means does not yield a valid estimate of the causal effect.

The second approach uses the randomized treatment assignment to compute the Intention-to-Treat (ITT) causal/treatment effect (the effect of treatment assignment):

\[\begin{equation} \delta^{\text{ITT}} = E[Y_i|Z_i=1] - E[Y_i|Z_i=0] \end{equation}\]

Since \(Z_i\) is randomly assigned, this difference is unbiased for the effect of being assigned to take the drug. The ITT effect is the difference between “average outcome of treatment for all units in the treatment group” and “average outcome of treatment for all units in the control group”. In our example, ITT is \(\delta_{\text{ITT}} = 0.40 - 0.25 = 0.15\).

Yet, while the Intention-to-Treat (ITT) approach preserves randomization by comparing outcomes based on assigned treatment, it only estimates the effect of the assignment rather than the treatment received. ITT procedure gives a valid estimate of the causal effect of the assignment on outcome (effectiveness), but not the effect of the treatment received on outcome (efficacy).This may not be the effect of the treatment itself, as it includes the effects of both taking the drug and being assigned to take it (i.e there are people who does not comply the assignment rules). The ITT effect is often smaller than the true treatment effect because it includes the effects of noncompliance and other factors that may dilute the treatment effect.

To overcome these limitations, the instrumental variable approach is employed. Instead of comparing groups solely based on \(D_i\) or \(Z_i\), instrumental variables (IV) offer a method to recover the causal effect of the treatment among those who comply with their assignment. In this context, the random assignment \(Z_i\) serves as an instrument for the treatment received \(D_i\). Under the potential outcomes framework, define \(Y_i(1)\) and \(Y_i(0)\) as the potential outcomes under treatment and control, respectively, and \(D_i(1)\) and \(D_i(0)\) as the potential treatment statuses when assigned to treatment or control. Patients who take the drug if assigned (\(D_i(1)=1\)); and do not take it if not assigned (\(D_i(0)=0\)) are called as compliers (while those who do not follow their assignment—either always-takers (\(D_i(1)=D_i(0)=1\)) or never-takers (\(D_i(1)=D_i(0)=0\))—are called noncompliers).

Before introducing the LATE estimator, it is important to outline the key assumptions required for its identification. First, independence of \(Z_i\) ensures that the instrument is as good as randomly assigned, this assumtion is plausible in the context of a randomized experiment. Second, the existence of compliers (i.e., a valid first stage) requires that the instrument meaningfully affects treatment uptake. This assumption is generally uncontroversial and can be empirically tested by regressing \(D_i\) on \(Z_i\) and checking for statistical significance. Third, the monotonicity assumption states that there are no defiers—individuals who would take the treatment when assigned to control but refuse it when assigned to treatment. While typically very reasonable, this assumption is untestable. Finally, the exclusion restriction requires that the instrument affects the outcome only through treatment. Unlike the first-stage condition, this assumption is impossible to verify empirically and is often the most controversial (particularly in observational studies which is discussed below). These assumptions collectively enable the identification of the Local Average Treatment Effect (LATE).

The IV estimator (Wald Estimator) for the Local Average Treatment Effect (LATE), with the aforementioned assumptions, is given by the ratio:

\[\begin{equation} \delta_{Wald}^{\text{LATE}} = \frac{E[Y_i|Z_i=1] - E[Y_i|Z_i=0]}{E[D_i|Z_i=1] - E[D_i|Z_i=0]} \end{equation}\]

represents the average change in the outcome that can be attributed to a change in treatment status among those individuals whose treatment behavior is influenced by the instrument \(Z\) (whether assigned to treatment or control group by experimenter or exogenously). In other words, the numerator captures the difference in average outcomes between those who were assigned to take the treatment (\(Z=1\)) and those who were not (\(Z=0\)), while the denominator measures the corresponding difference in treatment uptake. This ratio tells us the causal effect of the treatment on the outcome specifically for the subgroup of compliers—those whose decision to take the treatment is driven by the assignment.⁶⁸ While LATE itself measures the treatment effect only among compliers, the denominator provides an implicit measure of the proportion of compliers in the sample. This is why a weak first-stage (small denominator) can lead to imprecise and unstable LATE estimates, as the estimator is dividing by a small number (see the last section for a detailed proof).

For our drug trial, the difference between treatment uptake in treatment group and treatment uptake in control group (proportion of compliers). This ratio is

\[ E[D_i|Z_i=1] - E[D_i|Z_i=0] = 0.6 - 0.1 = 0.5 \]

and, as calculated above, the ITT treatment effect is 0.15. Hence, the LATE is

\[ \delta_{\text{LATE}} = \frac{0.15}{0.5} = 0.30 \]

This result implies that among the compliers—patients who take the drug only if assigned—the new drug increases the probability of improvement in blood pressure level by 30 percentage points. Keep in mind IV will not give us the ATE - it will give us an average treatment effect for a particular subgroup (LATE).

Up to this point, our approach has relied on a single binary instrument (\(Z_i\)) and a binary treatment (\(D_i\)). Under these conditions, along with the necessary assumptions, we have been able to estimate the Intention-to-Treat (ITT) effect and the Local Average Treatment Effect (LATE). However, in many empirical applications, it is desirable to generalize this framework to accommodate more complex scenarios, such as non-binary treatments and instruments, the inclusion of covariates (particularly when \(Z_i\) is not randomly assigned), and the use of multiple instruments to strengthen identification.

Up to this point, our approach has relied on a single binary instrument (\(Z_i\)) and a binary treatment (\(D_i\)). Under these conditions, and with the necessary assumptions, we can estimate the Intention-to-Treat (ITT) effect and the Local Average Treatment Effect (LATE). However, many empirical applications call for a more general framework that accommodates non-binary treatments and instruments, incorporates covariates (especially when \(Z_i\) is not randomly assigned), and uses multiple instruments to strengthen identification.

A widely used method for such generalizations is Two-Stage Least Squares (2SLS). When additional covariates \(X_i\) are included, the estimation proceeds in two stages:

First Stage:
Regress the endogenous treatment variable on the instrument(s) and covariates to obtain the predicted treatment:

\[\begin{equation} D_i = \alpha_1 + \beta_1 Z_i + \gamma_1 X_i + \epsilon_{1i} \end{equation}\]

Here, \(\hat{D}_i\) represents the fitted values that capture only the exogenous variation in \(D_i\) driven by \(Z_i\) after controlling for \(X_i\).

Second Stage:
Regress the outcome on the predicted treatment and the same covariates:

\[\begin{equation} Y_i = \alpha_2 + \beta_2 \hat{D}_i + \gamma_2 X_i + \epsilon_{2i} \end{equation}\]

The coefficient \(\hat{\beta}_2\) provides an estimate of the causal effect—interpreted as LATE among compliers.

This method is called “Two-Stage Least Squares” (2SLS) because it involves estimating two separate Ordinary Least Squares (OLS) regressions sequentially. Unlike OLS, which suffers from bias due to endogeneity when treatment is correlated with unobserved factors, 2SLS isolates variation in treatment that is induced by the instrument—while controlling for covariates—allowing for a more credible causal estimate of treatment effects.

The intuition behind Two-Stage Least Squares (2SLS) is as follows:

The first-stage coefficient \(\hat{\beta}_1\) measures the average change in the treatment variable \(D_i\) induced by a one-unit change in the instrument \(Z_i\), holding the covariates \(X_i\) constant. This ensures that \(D_i\) is predicted only from the exogenous variation in \(Z_i\).
The variation in the fitted values \(\hat{D}_i\) reflects only the component of \(D_i\) that is driven by \(Z_i\) after accounting for \(X_i\). This removes any endogeneity present in \(D_i\) due to omitted variables or selection bias.
The second-stage coefficient \(\hat{\beta}_2\) provides an unbiased estimate of the causal effect of \(D_i\) on the outcome \(Y_i\) for compliers—the group of individuals whose treatment status changes in response to the instrument. Since the instrument changes treatment randomly, the estimated treatment effect applies specifically to those who comply with their assigned \(Z_i\).
Ultimately, \(\hat{\beta}_2\) should be interpreted as the effect of a one-unit increase in the predicted treatment \(\hat{D}_i\) on the outcome \(Y_i\), but only for the subset of individuals for whom \(Z_i\) induces a change in \(D_i\). This is why the LATE interpretation is necessary in IV settings, distinguishing it from the standard ATE.

Unlike the Wald estimator, which relies on a binary instrument and binary treatment, 2SLS can handle continuous, categorical, or multi-valued instruments and treatments. When treatment assignment is not purely randomized, controlling for covariates helps reduce residual confounding and improve precision. In such cases, 2SLS allows the inclusion of additional covariates in both the first and second stages. 2SLS can also handle settings where more than one instrument is available. This is useful when a single instrument may be weak or when multiple sources of exogenous variation exist. If multiple instruments are used, the first-stage regression includes all instruments, and the second stage continues to use the predicted values from the first stage. Overall, 2SLS is a more general framework that allows for greater flexibility in estimation, making it the preferred approach when binary instruments and treatments are not sufficient for capturing causal relationships.

We want to emphasize the importance of careful inference: while separate two-stage regressions will yield the same treatment effect estimate, the standard errors, confidence intervals, and p-values will be incorrect. In 2SLS, while the estimation conceptually proceeds in two steps, in practice, we do not simply regress the first-stage fitted values \(\hat{D}_i\) in a second step as if they were observed data. Instead, the entire system of equations is estimated simultaneously to ensure that standard errors (,confidence intervals and p-values) are correctly computed, correcting for the fact that \(\hat{D}_i\) is an estimated quantity. This is done using instrumental variables regression, which accounts for the first-stage estimation uncertainty in the second stage, leading to corrected standard errors. In practice, statistical software packages estimate 2SLS using a single command (e.g., ivreg in R, ivregress in Stata, or 2SLS in Python), which automatically adjusts the standard errors appropriately. Thus, while 2SLS is conceptually a two-step procedure, it is statistically estimated as a single model to ensure proper inference.

Before we conclude our discussion of IV in randomized control studies, we want to highlight an important distinction between controlling for a variable in regression adjustment and using it as an IV. While both approaches aim to address confounding, they differ fundamentally in how they isolate variation in the treatment variable. The following section explains this key difference and its implications for causal inference.

In the previous RCT chapter, we observed that when controlling for a confounder \(X_i\) in a regression to estimate the effect of \(D_i\) on \(Y_i\), the estimated coefficient \(\beta_1\) captures the relationship between \(Y_i\) and the part of \(D_i\) that is not explained by \(X_i\), meaning it reflects the residual variation in \(D_i\) after accounting for \(X_i\). This is because in an OLS regression, the coefficient on \(D_i\) measures its effect on \(Y_i\) after adjusting for other included covariates.

In contrast, when we use an instrument \(Z_i\) to estimate the effect of \(D_i\) on \(Y_i\) via 2SLS, the estimated coefficient \(\beta_{12}\) captures the relationship between \(Y_i\) and only the part of \(D_i\) that is explained by \(Z_i\)—i.e., the predicted values from the first-stage regression. This means that rather than relying on the naturally occurring variation in \(D_i\) (which may be endogenous), we isolate only the exogenous variation in \(D_i\) that is driven by \(Z_i\). This ensures that the estimated effect is free from confounding, allowing for a valid causal interpretation of \(\beta_{12}\) as the LATE among compliers.

In this section, up until now, we have focused on how IV estimators help address noncompliance which leads to bias in causal effects in randomized experiments, where some units fail to follow their assigned treatment. IV estimators provide a solution to this problem by isolating exogenous variation in treatment, and estimating local average treatment effect. In the following section, we shift our focus to a more common application of IV methods—addressing omitted variable bias in observational data.

22.1.3 IV in the Observational Data

Instrumental variables (IV) serve as a critical tool for addressing selection bias when treatment assignment is not random. In previous discussions, we explored how IV methods handle non-random non-compliance in experiments, where treatment assignment was random but actual treatment uptake was not. This lecture extends the IV framework beyond experimental settings to observational studies, where treatment is not assigned randomly, and selection bias is a more severe challenge.

Until nowin various chapters, we have explored ways to address omitted variable and selection bias, including randomization, which ensures unbiased treatment assignment, conditioning on observables, which controls for measured confounders, and conditioning on unobservables, where panel data methods help mitigate bias from time-invariant factors. IV methods offer an alternative solution by leveraging an external source of variation—an instrument \(Z_i\)—that influences the outcome only through its effect on treatment, isolating exogenous variation for causal inference.

In many social science and economic settings, treatment assignment is not random. However, instruments serve as exogenous nudges/shocks, influencing treatment uptake without directly affecting the outcome. This allows us to estimate the LATE by isolating only the variation in \(D_i\) driven by the instrument \(Z_i\). The estimated causal effect applies specifically to those individuals whose treatment status would change in response to the instrument. To credibly identify causal effects using IV, we rely on four key assumptions, which mirror those discussed in the RCT section. Here, we formally present them:

Assumption I: Independence of the Instrument: In an experimental setting, the instrument \(Z_i\) is randomly assigned. However, in observational settings, randomization may not hold. Instead, we assume conditional independence:

\[\begin{equation} (Y_{0i}, Y_{1i}, D_{0i}, D_{1i}) \perp\!\!\!\perp Z_i | X_i \end{equation}\]

This means that, after conditioning on covariates \(X_i\), the instrument is as good as randomly assigned. This assumption parallels the conditional independence of the treatment assumption in RCTs and the selection on observables framework.

Assumption II: First-Stage Relevance: The instrument must meaningfully induce variation in treatment:

\[\begin{equation} 0 < P(Z_i = 1) < 1 \quad \text{and} \quad P(D_i(1) = 1) \neq P(D_i(0) = 1) \end{equation}\]

This ensures that \(Z_i\) is strongly correlated with the endogenous treatment variable \(D_i\), allowing us to identify a causal effect.

Assumption III: Monotonicity: The instrument should shift individuals only in one direction—toward treatment, but never away from it:

\[\begin{equation} D_i(1) \geq D_i(0) \quad \forall i \end{equation}\]

This eliminates the possibility of defiers, individuals who would do the opposite of what the instrument assigns.

Assumption IV: Exclusion Restriction: The instrument should affect the outcome only through its impact on treatment, meaning:

\[\begin{equation} Y(D_i = 1, Z_i = 1) = Y(D_i = 1, Z_i = 0) \end{equation}\]

\[\begin{equation} Y(D_i = 0, Z_i = 1) = Y(D_i = 0, Z_i = 0) \end{equation}\]

This means \(Z_i\) must not have a direct effect on \(Y_i\), aside from influencing \(D_i\). The exclusion restriction is often the most controversial and difficult to justify in observational studies, as it requires a strong argument that the instrument does not affect the outcome through any other pathway.

In observational studies, identifying causal effects using IV relies on the same four key assumptions—independence of the instrument, first-stage relevance, monotonicity, and exclusion restriction—as in experimental settings. Since treatment assignment is not random, IV methods help isolate exogenous variation in treatment, allowing researchers to estimate the LATE. 2SLS remains the primary estimation method, enabling the correction of endogeneity while controlling additional covariates when necessary.

The estimation process follows the same two-stage procedure outlined earlier (equations 19.2.6 and 19.2.7). The first stage predicts treatment using the instrument, ensuring that only exogenous variation is used, while the second stage regresses the outcome on the predicted treatment values. This approach corrects for omitted variable bias and selection issues that would otherwise confound the treatment effect. The interpretation remains the same: the estimated effect applies only to compliers—those whose treatment status is influenced by the instrument.

One key difference in observational studies is that we often rely on natural experiments, policy changes, or historical events as instruments rather than controlled experimental designs. Additionally, observational data often require stronger justification for the exclusion restriction, as ensuring that the instrument affects the outcome only through treatment is more challenging. The concerns around weak instruments are particularly relevant in this setting, emphasizing the need for strong first-stage relationships and proper statistical inference. As discussed earlier, 2SLS must be estimated as a single system rather than as separate regressions to obtain valid standard errors, confidence intervals, and p-values. While the core methodology remains unchanged, observational studies demand careful instrument selection and extensive robustness.

Finding a valid instrument requires institutional knowledge and a deep understanding of the mechanisms driving the variable of interest. A strong instrument must be highly correlated with treatment while affecting the outcome solely through its influence on treatment. Identifying such an instrument is often challenging and requires both theoretical justification and empirical validation. One fundamental limitation of IV methods is that the exclusion restriction is untestable, meaning researchers must build a convincing argument for why their instrument satisfies this condition. Because there is no definitive test to confirm whether an instrument affects the outcome only through treatment, the credibility of an IV study hinges on the strength of its theoretical and contextual foundation.

Many influential studies in economics and social sciences have leveraged instrumental variables derived from natural experiments, policy changes, historical events, and supply-side shocks to identify causal effects. One of the most well-known examples is Card (1990), who used the Mariel Boatlift as an instrument to study the effect of immigration on wages, exploiting the sudden influx of Cuban migrants to Miami as an exogenous labor supply shock. Similarly, Angrist and Krueger (1991) used quarter of birth as an instrument for educational attainment, relying on compulsory schooling laws that dictate when students can legally drop out. Another widely cited instrument is distance to college, used by Card (1995) to estimate the returns to education, assuming that individuals living closer to a college are more likely to enroll, yet distance itself does not directly affect earnings.

Policy-driven variation has also been a rich source of instruments. Angrist (1990) used Vietnam draft lottery numbers to estimate the effects of military service on earnings, leveraging the random assignment of draft eligibility as an exogenous source of variation in veteran status. Similarly, Currie and Gruber (1996) exploited Medicaid expansions to evaluate the impact of public health insurance on infant and child health outcomes. Another example is Acemoglu, Johnson, and Robinson (2001), who used historical settler mortality rates as an instrument for institutional development, arguing that European colonization strategies shaped modern economic institutions in a way that persists today.

Supply-side shocks often serve as strong instruments in studies of credit constraints and economic development. Klemperer and Meyer (1986) used oil price shocks as an instrument for firms’ investment decisions, while Autor, Dorn, and Hanson (2013) relied on China’s export growth as an exogenous shock to study the impact of trade exposure on local labor markets. Natural disasters have also been utilized as instruments; for instance, Miguel, Satyanath, and Sergenti (2004) used rainfall variation as an instrument for economic shocks to examine the link between economic downturns and civil conflict in Africa.

More recent work has expanded the range of instrumental variables used in applied research. Nunn and Qian (2014) used historical potato cultivation patterns as an instrument for long-term agricultural productivity and economic development, demonstrating how historical shocks can provide exogenous variation. Dinkelman (2011) exploited the rollout of electrification in South Africa as an instrument to estimate the effect of household access to electricity on female labor force participation. Aizer and Doyle (2015) used random assignment of public defenders in the U.S. court system to study the impact of legal representation on case outcomes, while Dobkin et al. (2018) used hospital admission thresholds as an instrument to examine the causal effect of hospital care on long-term health outcomes. More recently, Goldsmith-Pinkham, Hull, and Kolesár (2020) provided new methods for assessing the strength and validity of instruments, helping researchers navigate common IV pitfalls in modern applications.

For a comprehensive review of instrumental variable strategies across disciplines, Borjas (2021) and Nakamura and Steinsson (2018) discuss recent advances and applications of IV in labor and macroeconomics. Mogstad and Torgovitsky (2018) provide an updated discussion of IV estimation in settings with treatment effect heterogeneity, extending the standard LATE framework. Meanwhile, Young (2022) critiques common IV approaches and offers guidance on addressing weak instruments and overfitting concerns. Kang et al. (2024) provide an in-depth examination of identification and inference when dealing with potentially invalid instruments, addressing challenges in satisfying standard IV assumptions. Levis, Kennedy, and Keele (2024) discuss nonparametric identification and efficient estimation of causal effects using instrumental variables, presenting advanced techniques for settings with minimal parametric assumptions. These reviews illustrate how instrumental variable methods continue to evolve, providing new insights while refining best practices in applied econometrics.

22.1.4 Weak Instrument

Even when an instrument satisfies the independence and exclusion restriction assumptions, it may still fail to be useful if it does not induce enough variation in treatment. A weak instrument provides little leverage for identifying causal effects, leading to large standard errors and potentially biased estimates. Thus, even with a well-justified instrument, researchers must ensure that it has a sufficiently strong first-stage relationship with treatment to make IV estimation reliable. Using a weak instrument can lead to biased estimates and large standard errors. Next, we discuss how to determine whether an instrument is weak.

In instrumental variables (IV) estimation, the strength of the instrument is crucial for obtaining reliable and unbiased estimates. An instrument is considered weak if it has a minimal impact on the endogenous explanatory variable \(D_i\), leading to instability in the IV estimator. This weakness can result in estimates that are biased toward those obtained from ordinary least squares (OLS) regression, with standard errors that are underestimated, thereby compromising statistical inference. Notably, this bias persists even as the sample size increases, making it a significant concern in empirical research (Stock & Yogo, 2005; Staiger & Stock, 1997).

To assess the strength of an instrument, researchers often examine the first-stage regression, where the endogenous variable \(D_i\) is regressed on the instrument \(Z_i\) and any additional covariates \(X_i\). The key statistic from this regression is the F-statistic, which tests the null hypothesis that the instrument has no explanatory power over the endogenous variable. A commonly used rule of thumb, as suggested by Staiger and Stock (1997), is that an F-statistic less than 10 indicates a weak instrument.⁶⁹ However, this threshold is not absolute; the appropriate critical value may vary depending on the specific context and the number of instruments used. Stock and Yogo (2005) provide more refined critical values for detecting weak instruments, emphasizing that the conventional threshold may be insufficient in certain cases.

For a more comprehensive discussion on the implications of weak instruments and alternative approaches, readers are encouraged to consult Andrews, Stock, and Sun (2018), which recommends using identification-robust Anderson-Rubin confidence intervals when only a single instrument is available. Addressing weak instruments is critical, as they can lead to biased parameter estimates and invalid inference, ultimately undermining the credibility of empirical results. Other techniques to handle weak instruments include LIML estimators, which reduce bias and improve inference in weak instrument settings.⁷⁰ Machine learning methods offer data-driven solutions for identifying and constructing stronger instruments, helping to mitigate the challenges posed by weak instruments—an approach we will explore in the next section.

IV methods are a powerful solution to selection bias, but their credibility hinges on strong assumptions. In observational settings, the challenge is not only finding valid instruments but also justifying their exogeneity and ensuring their strength. Weak instruments and violations of the exclusion restriction can lead to substantial bias, making careful instrument selection essential. 2SLS provides a flexible way to implement IV, incorporating multiple instruments, covariates, and continuous treatments.

22.2 Double Machine Learning IV (DML-IV)

In econometrics, Instrumental Variables (IVs) are essential for addressing endogeneity, such as unobservable (unmeasured, hidden) confounders, when estimating causal effects. However, weak instruments—IVs with low correlation to the endogenous variable—can lead to biased estimates, inconsistent inference, and wide confidence intervals. Double Machine Learning (DML) offers a framework to handle high-dimensional confounders and weak instruments (\(Z\)) while maintaining valid statistical inference. This section focuses on the machine learning methods for IV. We’ll explore the algorithm, practical implementation, and illustrate the process with an R simulation tailored for applied researchers.

22.2.1 Structural Model and Instrumental Variables Estimation

We consider the structural equation model (Partially Linear IV Model):

\[\begin{equation} Y = D\delta + f(X) + \varepsilon \end{equation}\]

where \(Y\) is the outcome variable, \(D\) is the endogenous treatment variable, and \(\delta\) is the parameter of interest. The function \(f(X)\) captures the effect of covariates \(X\), while \(\varepsilon\) represents the error term. Since \(D\) is endogenous, we introduce an instrumental variable ( external factor, exogenous shock) \(Z\) that satisfies:

\[\begin{equation} D = Z\pi + g(X) + \eta \end{equation}\]

where \(\pi\) denotes the first-stage effect of \(Z\) on \(D\), \(g(X)\) accounts for the dependence of \(D\) on \(X\), and \(\eta\) is the first-stage residual. Given that \(Z\) may also depend on \(X\), we model:

\[\begin{equation} Z = h(X) + \nu \end{equation}\]

where \(h(X)\) captures the dependence of \(Z\) on \(X\), and \(\nu\) is the residual in the instrument equation.

22.2.2 Double Machine Learning (DML-IV) Estimation

To estimate \(\delta\), we use a double machine learning (DML) approach (i.e using cross-fitted residualized forms), requiring three nuisance functions:

\(m_1(X) = E[Y | X] = f(X)\)
\(m_2(X) = E[D | X] = g(X)\)
\(m_3(X) = E[Z | X] = h(X)\)

\(r_1\) corresponds to \(\tilde{Y}\), the residualized outcome after removing the conditional expectation \(m_1(X) = E[Y | X]\). In the notation often used in double machine learning, residualizing means partialing out the effect of covariates, so we have:

\[\begin{equation} r_1 = Y - m_1(X) = \tilde{Y} \end{equation}\]

Similarly:

\[\begin{equation} r_2 = D - m_2(X) = \tilde{D}, \quad r_3 = Z - m_3(X) = \tilde{Z} \end{equation}\]

Thus, the estimator for \(\delta\) can be written as:

\[\begin{equation} \hat{\delta} = \frac{\mathbb{E}[r_3 r_1]}{\mathbb{E}[r_3 r_2]} = \frac{\mathbb{E}[\tilde{Z} \tilde{Y}]}{\mathbb{E}[\tilde{Z} \tilde{D}]} \end{equation}\]

which is equivalent to a two-stage least squares (2SLS) regression using residualized variables, ensuring orthogonality between \(\tilde{Z}\) and the error term.

22.2.3 Neyman Orthogonality and Robustness

The Neyman Orthogonality condition ensures that small errors in estimating \(m_1(X)\), \(m_2(X)\), and \(m_3(X)\) do not introduce bias in \(\delta\). This is crucial in double machine learning, making the estimator robust to overfitting and regularization errors.

The condition is expressed through the score function:

\[\begin{equation} \mathbb{E}[\psi(W; \theta, \eta)] = 0 \end{equation}\]

where \(W = (Y, D, Z, X)\) represents the observed data, \(\theta = \delta\) is the parameter of interest, and \(\eta = (m_1, m_2, m_3)\) denotes the nuisance functions.

A common score function for DML-IV is:

\[\begin{equation} \psi(W; \delta, \eta) = (Y - m_1(X) - \delta (D - m_2(X))) (Z - m_3(X)) \end{equation}\]

This ensures that residualized variables satisfy orthogonality, making the estimation of \(\delta\) robust to errors in the machine learning-based nuisance estimates. The Neyman-Orthogonality condition holds if small errors in estimating the nuisance functions do not affect the moment equation. Mathematically, this means:

\[\begin{equation} \frac{\partial \mathbb{E}[\psi(W; \delta, \eta)]}{\partial \eta} = 0 \end{equation}\]

ensuring that small errors in nuisance function estimation do not impact the final estimator.

To verify that the Neyman-Orthogonality condition holds, we take partial derivatives of the score function with respect to the nuisance functions \(m_1(X), m_2(X), m_3(X)\):

\[\begin{equation} \frac{\partial \psi}{\partial m_1(X)} = - (Z - m_3(X)) \end{equation}\]

This term should have an expectation of zero to ensure robustness.

\[\begin{equation} \frac{\partial \psi}{\partial m_2(X)} = \delta (Z - m_3(X)) \end{equation}\]

Since \(m_2(X)\) estimates \(E[D | X]\), any misestimation should not affect the expected score function.

\[\begin{equation} \frac{\partial \psi}{\partial m_3(X)} = - (Y - m_1(X) - \delta (D - m_2(X))) \end{equation}\]

This ensures that any error in instrument modeling does not introduce bias into the estimation of \(\delta\). Using these derivatives, we define residualized variables for estimation:

\[\begin{equation} r_1 = Y - \hat{m}_1(X), \quad r_2 = D - \hat{m}_2(X), \quad r_3 = Z - \hat{m}_3(X) \end{equation}\]

The moment condition that should hold in estimation is:

\[\begin{equation} \mathbb{E} \left[ (r_1 - \delta r_2) r_3 \right] = 0 \end{equation}\]

This formulation ensures that even if \(\hat{m}_1(X), \hat{m}_2(X), \hat{m}_3(X)\) are estimated imperfectly, the moment equation remains valid.

Below in the simulation, we explicitly compute the score function and check whether the Neyman-Orthogonality condition is satisfied. Normally, when using built-in R, Stata, or Python packages for DML-IV, this step is not shown explicitly. However, we believe covering it here helps readers understand the underlying mechanics and allows them to verify orthogonality if needed.

It is recommended to use the “partial-out” score function (the one above and same as equation 4.59, Chernozhukov et al. (2018)) over an alternative formulation (the one below and same as equation 4.59, Chernozhukov et al. (2018)), as it involves only conditional mean functions, which can be efficiently estimated using machine learning. The alternative score function is: \(\psi(W; \delta, \eta) = (Y - D\delta - m_1(X)) (Z - m_3(X))\)

We adopt a generalized DML-IV algorithm, extending the approach used in DML-Lasso, as detailed in the lasso section. This allows flexibility in implementing various machine learning methods such as lasso, random forests, and other regularized estimators for nuisance function estimation.

22.2.4 Algorithm: DML-IV

Step 1: Inputs

Observed Data:

\[ W_i = (Y_i, D_i, Z_i, X_i), \quad i = 1, \dots, N \]

where \(Y_i\) is the outcome, \(D_i\) is the endogenous treatment, \(Z_i\) is the instrument, and \(X_i\) represents control variables.
Machine Learning Estimation of Nuisance Functions:
To estimate conditional expectations, define the following models:
- \(m_1(X) = E[Y | X]\) (Outcome model)
- \(m_2(X) = E[D | X]\) (Treatment model)
- \(m_3(X) = E[Z | X]\) (Instrument model)
By defining them this way, the Neyman-Orthogonal Score Function is implicitly implemented through residualization and 2SLS estimation. In simulations, we use Lasso as a baseline, but other ML methods such as Random Forest, Neural Networks, and Gradient Boosting can also be applied. The choice of estimator depends on the complexity of the data and the desired level of flexibility.

Step 2: Train ML Predictors Using Cross-Fitting

Split the Data into \(K\) folds: \(\{I_k\}_{k=1}^{K}\), ensuring equal partitioning. A typical choice is \(K = 5\) cross-fitting.
Train the Nuisance Models \(\hat{m}_{1,k}(X)\), \(\hat{m}_{2,k}(X)\), and \(\hat{m}_{3,k}(X)\), excluding observations in fold \(I_k\) to prevent overfitting. Each model is trained on \(N - N/K\) observations.
- We can tune hyperparameters using cross-validation or apply plug-in penalty selection methods like rlasso.
- The plug-in penalty method is the preferred approach, as discussed in detail in the DML-Lasso section.
Compute Cross-Fitted Residuals for each \(i \in I_k\):

\[ r_{1i} = Y_i - \hat{m}_{1,k}(X_i) \quad \text{(Residualized Outcome)} \]

\[ r_{2i} = D_i - \hat{m}_{2,k}(X_i) \quad \text{(Residualized Treatment)} \]

\[ r_{3i} = Z_i - \hat{m}_{3,k}(X_i) \quad \text{(Residualized Instrument)} \]

Cross-fitting mitigates overfitting by ensuring that each observation’s nuisance function estimate is derived from a model trained on different data. This technique is crucial in DML as it reduces bias in the final estimator.

Step 3: Estimate the Causal Parameter \(\delta_0\)

Solve for \(\delta_0\) using the moment equation:

\[ \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \psi(W_i; \delta_0, \hat{m}_{1,k}, \hat{m}_{2,k}, \hat{m}_{3,k}) = 0 \]

where:

\[ \psi(W; \delta, \eta) = (r_1 - \delta r_2) r_3 \]
This step follows the DML2 approach, where residuals from all folds are merged before estimating \(\delta_0\) using instrumental variables regression (2SLS).

Step 4: Implementing in Practice (Example: R Simulation)

The estimation is equivalent to 2SLS using residualized variables:
- First stage: Residualized treatment \(r_2\) is regressed on residualized instrument \(r_3\).
- Second stage: Residualized outcome \(r_1\) is regressed on the predicted values from the first stage.
- The coefficient on \(r_2\) provides the IV estimate of \(\delta_0\).
- Standard errors and confidence intervals from IV regression are valid if the first-stage is strong (i.e., the instrument \(r_3\) strongly predicts \(r_2\)).
- If the first-stage is weak, standard IV inference may be misleading. In such cases, consider methods for Inference with Weak Instruments, such as Anderson-Rubin test, Kleibergen-Paap statistic, or Conditional Likelihood Ratio (CLR) inference.

22.2.5 DML2-IV Simulation using LASSO

In this simulation, we illustrate the complete DML-IV framework for estimating a causal effect in a high-dimensional setting. We begin by generating synthetic data that mimics a realistic empirical environment: the outcome is influenced by both the endogenous treatment and a set of high-dimensional continious and binary covariates, while the instrument is strongly related to the treatment but only weakly related to the outcome. This setup captures the common challenge in empirical work where endogeneity must be addressed through valid instruments.

Once the data are generated, we estimate the necessary nuisance functions using Lasso. To avoid overfitting, we employ cross-fitting: the data are divided into multiple folds, and for each fold, the nuisance functions are estimated on the remaining observations. After obtaining the cross-fitted residuals, we proceed to the causal parameter estimation via 2SLS regression. Here, the key insight is that by regressing the residualized outcome on the residualized treatment—using the residualized instrument as an instrument—we effectively solve the moment equation that embodies the Neyman-Orthogonality condition. This condition guarantees that small errors in the nuisance function estimates do not influence the final causal effect estimate. We further verify this by checking that the expectation of the score function’s derivatives with respect to the nuisance functions is near zero.

We also assess the strength of the instrument through a weak instrument test. By running a first-stage regression of the residualized treatment on the residualized instrument and computing a robust F-statistic (using heteroskedasticity-robust standard errors), we can determine whether the instrument provides sufficient variation to yield reliable IV estimates. A robust F-statistic well above the conventional threshold suggests that the instrument is strong; otherwise, one might need to consider inference methods tailored to weak instruments.

Although our example uses simulated data to clearly demonstrate each step of the DML-IV process, you can simply replace the simulated data with your data and proceed with the cross-fitting procedure described in Step 2.

# Load required libraries
library(MASS)        # For generating multivariate normal data
library(estimatr)    # For HC3 robust SEs (lm_robust)
library(AER)         # For IV regression (2SLS)
library(hdm)         # For rlasso (Plug-in Penalty Selection)

set.seed(123)        # Ensure reproducibility

# === Step 1: Define Data ===
N <- 10000  # Number of observations
p <- 10     # Number of covariates (high-dimensional X)
# Split X into 1/3 continuous and 2/3 binary covariates
p_cont <- round(p / 3)  # Number of continuous covariates
p_bin <- p - p_cont     # Number of binary covariates
# Continuous covariates ~ N(0,1)
X_cont <- mvrnorm(N, mu = rep(0, p_cont), Sigma = diag(p_cont))
# Binary covariates ~ Bernoulli(0.5)
X_bin <- matrix(rbinom(N * p_bin, 1, 0.5), nrow = N, ncol = p_bin)
# Combine X into a single matrix
X <- cbind(X_cont, X_bin)
# Generate Instrument Z (Strongly Correlated with D, Nearly Uncorrelated with Y) ===
Z <- 0.7 * (X_cont[,1]) + 0.3 * rnorm(N)  
# Generate Treatment D (Strongly Correlated with Z) ===
D_prob <- plogis(25 * Z + 0.8 * X %*% runif(p, -0.5, 0.5))  # First-stage
D <- rbinom(N, 1, D_prob)  # Binary treatment
# Generate Outcome Y (Affected by D, Weakly Related to Z) ===
delta_true <- 2  # True effect of D on Y
beta <- runif(p, -1, 1)  # Coefficients for X
Y <- delta_true * D + X %*% beta + 0.1 * Z + rnorm(N)  # Small correlation between Z and Y
# === Check Correlations ===
cat("Correlation(Y, D):", cor(Y, D), "\n")

## Correlation(Y, D): 0.3697644

cat("Correlation(Z, D):", cor(Z, D), "\n")

## Correlation(Z, D): 0.797973

cat("Correlation(Z, Y):", cor(Z, Y), "\n")

## Correlation(Z, Y): 0.2315305

# === Define Neyman-Orthogonal Score Function ===
score_function <- function(Y, D, Z, m1, m2, m3, delta) {
  return((Y - m1 - delta * (D - m2)) * (Z - m3))
}
# === Step 2: Define Fold Assignments (Cross-Fitting) ===
K <- 5  # Number of cross-fitting folds
folds <- split(sample(1:N), rep(1:K, length.out=N))  # Create K folds
# Initialize residual vectors
Y_resid <- rep(NA, N)  # Residuals for Y
D_resid <- rep(NA, N)  # Residuals for D
Z_resid <- rep(NA, N)  # Residuals for Z
# === Step 3: Estimate Nuisance Functions and Compute Residuals ===
for (k in 1:K) {
    # Define training (all but k-th fold) and validation (k-th fold) indices
    train_idx <- unlist(folds[-k])       
    val_idx <- folds[[k]]                
    # Training Data
    X_train <- X[train_idx, ]
    Y_train <- Y[train_idx]
    D_train <- D[train_idx]
    Z_train <- Z[train_idx]
    # Validation Data
    X_val <- X[val_idx, ]
    # ---- Fit Lasso for Outcome Model (Y ~ X) ----
    lasso_y <- rlasso(X_train, Y_train)
    Y_hat_val <- predict(lasso_y, X_val)
    # ---- Fit Lasso for Treatment Model (D ~ X) ----
    lasso_d <- rlasso(X_train, D_train)
    D_hat_val <- predict(lasso_d, X_val)
    # ---- Fit Lasso for Instrument Model (Z ~ X) ----
    lasso_z <- rlasso(X_train, Z_train)
    Z_hat_val <- predict(lasso_z, X_val)
    # ---- Compute Cross-Fitted Residuals ----
    Y_resid[val_idx] <- Y[val_idx] - Y_hat_val  # r1 = Y - m1(X)
    D_resid[val_idx] <- D[val_idx] - D_hat_val  # r2 = D - m2(X)
    Z_resid[val_idx] <- Z[val_idx] - Z_hat_val  # r3 = Z - m3(X)
}
# === Step 4: Solve Moment Equation Using 2SLS ===
library(sandwich) # Load necessary packages for robust standard errors
library(lmtest)
# The estimator is given by:
# delta_hat = E[r3 * r1] / E[r3 * r2], which is solved via IV regression
dml_iv_model <- ivreg(Y_resid ~ D_resid | Z_resid)  # 2SLS estimation
# Compute robust standard errors
robust_se <- coeftest(dml_iv_model, vcov = vcovHC(dml_iv_model, type = "HC3"))
# Extract and Print Results with Robust SEs
cat("Estimated IV effect using DML-IV:", robust_se[2, 1])

## Estimated IV effect using DML-IV: 2.242062

cat("Robust Standard Error:", robust_se[2, 2], "\n")

## Robust Standard Error: 0.06641111

cat("95% CI:", confint(dml_iv_model, vcov. = vcovHC(dml_iv_model, type = "HC3"))[2, ])

## 95% CI: 2.111222 2.372903

# === Verify the Moment Condition (Check Neyman Orthogonality) ===
delta_hat <- coef(dml_iv_model)[2]  # Estimated delta from 2SLS
# Compute derivatives of the score function
d_score_m1 <- -Z_resid
d_score_m2 <- delta_hat * Z_resid
d_score_m3 <- -(Y_resid - delta_hat * D_resid)
# Compute residualized expectations (should be close to zero)
orthogonality_m1 <- mean(d_score_m1)
orthogonality_m2 <- mean(d_score_m2)
orthogonality_m3 <- mean(d_score_m3)
# Orthogonality Condition Check (Should be close to 0)
cat("E[d_score_m1]:", orthogonality_m1, "\n")

## E[d_score_m1]: 2.453253e-05

cat("E[d_score_m2]:", orthogonality_m2, "\n")

## E[d_score_m2]: -5.500345e-05

cat("E[d_score_m3]:", orthogonality_m3, "\n")

## E[d_score_m3]: 0.0001864073

# Compute the expected value of the score function (should be close to zero)
moment_check <- mean(score_function(Y_resid, D_resid, Z_resid, orthogonality_m1, 
              orthogonality_m2, orthogonality_m3, delta_hat))
cat("Moment Equation Check (Should be close to 0):", moment_check)

## Moment Equation Check (Should be close to 0): 7.050893e-08

# === Weak Instrument Test (Compute First-Stage F-statistic) ===
# First-Stage Regression using lm()
first_stage <- lm(D_resid ~ Z_resid)
# Compute heteroskedasticity-robust variance-covariance matrix
robust_vcov <- vcovHC(first_stage, type = "HC3")
# Compute robust standard errors
robust_se <- sqrt(diag(robust_vcov))
# Compute robust t-statistic for Z_resid coefficient
t_stat_robust <- coef(first_stage)[2] / robust_se[2]
# Compute robust F-statistic
f_stat_robust <- t_stat_robust^2  # Since F = t^2 in single regressor case
cat("Robust First-Stage F-statistic (Weak Instrument Test):", f_stat_robust)

## Robust First-Stage F-statistic (Weak Instrument Test): 2444.535

22.2.6 DML2-IV Simulation using RANDOM FOREST

Below is a simulation of a complete DML-IV simulation that uses Random Forest to estimate the nuisance functions instead of Lasso. You can review Random Forest as covered in Chapter 14 and the DML2 method with Random Forest in Section 19.3. We use the exact same synthetic data as above. Cross-fitting is performed by splitting the data into folds. For each fold, we estimate the outcome, treatment, and instrument models using Random Forest on the training set and predict on the held-out data to obtain cross-fitted residuals. These residuals are then used in a 2SLS regression, just as in the previous example, to solve the moment equation and obtain an estimate of the causal (LATE) effect, \(\delta\). Robust standard errors are computed to ensure valid inference.

# Load required libraries
library(MASS)         # For generating multivariate normal data
library(estimatr)     # For HC3 robust SEs (lm_robust)
library(AER)          # For IV regression (2SLS)
library(randomForest) # For Random Forest estimation of nuisance functions
library(sandwich)     # For robust standard errors
library(lmtest)       # For robust inference

set.seed(123)         # Ensure reproducibility

# === Step 1: Define Data ===
N <- 10000  # Number of observations
p <- 10     # Number of covariates (high-dimensional X)

# Split X into 1/3 continuous and 2/3 binary covariates
p_cont <- round(p / 3)  # Number of continuous covariates
p_bin <- p - p_cont     # Number of binary covariates

# Continuous covariates ~ N(0,1)
X_cont <- mvrnorm(N, mu = rep(0, p_cont), Sigma = diag(p_cont))
# Binary covariates ~ Bernoulli(0.5)
X_bin <- matrix(rbinom(N * p_bin, 1, 0.5), nrow = N, ncol = p_bin)
# Combine X into a single matrix
X <- cbind(X_cont, X_bin)

# Generate Instrument Z (Strongly Correlated with D, Nearly Uncorrelated with Y)
Z <- 0.7 * (X_cont[,1]) + 0.3 * rnorm(N)

# Generate Treatment D (Strongly Correlated with Z)
D_prob <- plogis(25 * Z + 0.8 * X %*% runif(p, -0.5, 0.5))
D <- rbinom(N, 1, D_prob)

# Generate Outcome Y (Affected by D, Weakly Related to Z)
delta_true <- 2
beta <- runif(p, -1, 1)
Y <- delta_true * D + X %*% beta + 0.1 * Z + rnorm(N)

# === Step 2: Define Fold Assignments (Cross-Fitting) ===
K <- 5  # Number of cross-fitting folds
folds <- split(sample(1:N), rep(1:K, length.out = N))

# Initialize residual vectors
Y_resid <- rep(NA, N)  # Residuals for Y
D_resid <- rep(NA, N)  # Residuals for D
Z_resid <- rep(NA, N)  # Residuals for Z

# === Step 3: Estimate Nuisance Functions and Compute Residuals using Random Forest ===
for (k in 1:K) {
    # Define training (all but k-th fold) and validation (k-th fold) indices
    train_idx <- unlist(folds[-k])
    val_idx <- folds[[k]]
    
    # Training Data
    X_train <- X[train_idx, ]
    Y_train <- Y[train_idx]
    D_train <- D[train_idx]
    Z_train <- Z[train_idx]
    
    # Validation Data
    X_val <- X[val_idx, ]
    
    # ---- Fit Random Forest for Outcome Model (Y ~ X) ----
    rf_y <- randomForest(x = X_train, y = Y_train)
    Y_hat_val <- predict(rf_y, newdata = X_val)
    
    # ---- Fit Random Forest for Treatment Model (D ~ X) ----
    rf_d <- randomForest(x = X_train, y = D_train)
    D_hat_val <- predict(rf_d, newdata = X_val)
    
    # ---- Fit Random Forest for Instrument Model (Z ~ X) ----
    rf_z <- randomForest(x = X_train, y = Z_train)
    Z_hat_val <- predict(rf_z, newdata = X_val)
    
    # ---- Compute Cross-Fitted Residuals ----
    Y_resid[val_idx] <- Y[val_idx] - Y_hat_val  # r1 = Y - m1(X)
    D_resid[val_idx] <- D[val_idx] - D_hat_val  # r2 = D - m2(X)
    Z_resid[val_idx] <- Z[val_idx] - Z_hat_val  # r3 = Z - m3(X)
}

# === Step 4: Solve Moment Equation Using 2SLS ===
dml_iv_model <- ivreg(Y_resid ~ D_resid | Z_resid)

# Compute robust standard errors
robust_se <- coeftest(dml_iv_model, vcov = vcovHC(dml_iv_model, type = "HC3"))

# Extract and Print Results with Robust SEs
cat("Estimated IV effect using DML-IV (Random Forest):", robust_se[2, 1], "\n")

## Estimated IV effect using DML-IV (Random Forest): 2.108413

cat("Robust Standard Error:", robust_se[2, 2], "\n")

## Robust Standard Error: 0.07169011

cat("95% CI:", confint(dml_iv_model, vcov. = vcovHC(dml_iv_model, 
                                                    type = "HC3"))[2, ], "\n")

## 95% CI: 1.967593 2.249233

22.3 Closing Remarks: ML for IV

Instrumental Variables (IV) methods address endogeneity in observational studies where treatment assignment is not ignorable. Traditional approaches like 2SLS rely on linear models, but modern data environments require more flexible methods. ML methods like DML-IV improves IV estimation by modeling nonlinearities, handling high-dimensional covariates, and identifying heterogeneous treatment effects, expanding IV beyond classical assumptions. Recent nonparametric IV methods extend IV estimation to nonlinear and non-separable models. Deep Instrumental Variables (DeepIV) use deep learning to approximate structural functions under endogeneity, while Kernel IV provides a nonparametric alternative when both first- and second-stage relationships are complex.

Recent work, such as Qian et al. (2024) Learning Decision Policies with Instrumental Variables through Double Machine Learning, implemented DML-IV on various real-world datasets and compared it to other IV regression methods, including Deep IV, DeepGMM, Kernel IV, and DFIV. The real-world datasets include aeroplane ticket sales and pricing data in a ticket demand scenario (Hartford et al., 2017), the IHDP dataset (747 units, 470 training samples, available at IHDP), and the PM-CMR dataset (2132 counties, 1350 training samples, available at PM-CMR). The findings show that DML-IV outperforms the other IV regression methods across all datasets. The authors also demonstrate that a computationally efficient version of DML-IV can estimate local treatment/causal effects as accurately as standard DML-IV on low-dimensional datasets. We recommend reading this paper if you are interested in a detailed comparison of different machine learning methods for IV regression, along with their implementation (Code available at GitHub, using PyTorch).

Deep IV (Hartford et al., 2017) Deep IV: A flexible approach for estimating individual causal effects, which also uses neural networks but may suffer from bias in high-dimensional settings, is noted for the fact that DML-IV’s use of cross-fitting and orthogonal scores provides better bias reduction than DEEP IV. The Python library for Deep IV, built on Keras, is available at GitHub, highlighting practical implementation differences.

Chen et al. (2021) Debiased/Double Machine Learning for Instrumental Variable Quantile Regressions demonstrates how DML techniques can be used for quantile treatment effects, revealing how the impact of a treatment varies across They further demonstrate the robustness of DML-IVQR through Monte Carlo experiments and on the 401(k) dataset, it provides detailed quantile treatment effects, standardized by outcome SD, showing varying effects across wealth quantiles.

Instrumental Variable Regression Trees (IV Trees) and Generalized Random Forests (GRF) allow localized treatment effect estimation, uncovering how the Local Average Treatment Effect (LATE) varies across different subpopulations. Understanding heterogeneous LATE effects has critical policy implications, which we will cover in the next chapter. Many real-world interventions—such as education subsidies, job training programs, or medical treatments—do not affect all individuals equally. ML-based IV methods can detect and prioritize subgroups where policy interventions are most effective, ensuring efficient allocation of resources. By systematically searching across covariate spaces, these methods allow policymakers to better target interventions.

Several software tools support machine learning-based instrumental variable (IV) methods across different programming environments. In R, the DoubleML package (documentation) implements double machine learning (DML-IV), while AER::ivreg() provides classical IV estimation, and grf supports random forest-based IV. In Stata, ddml (documentation) and crhdreg (Yuya Sasaki’s crhdreg) implement DML-IV with LASSO. Standard IV methods are available via ivregress 2sls, while ivlasso supports LASSO-based IV estimation, and cfreg links to Chernozhukov’s DML-IV framework.

In Python, the econml package (documentation) includes DMLIV, DeepIV, and random forest-based IV estimators. The DeepIV approach (Keras-based) is available on GitHub, while DML-IV (PyTorch) is at GitHub. For Julia, the IVModels.jl package provides IV estimators for scalable computation. These tools enable efficient implementation of IV methods, from classical econometric techniques to modern machine learning approaches.

ML strengthen IV estimation by capturing nonlinearities, selecting relevant instruments, and identifying local average treatment and heterogeneous effects, but valid inference requires careful adjustment. Cross-fitting, orthogonality conditions, and robust standard errors are essential to prevent overfitting and bias. As empirical economics embraces big data, combining ML with IV methods provides deeper causal insights—provided that the core IV assumptions (relevance, exogeneity, exclusion restriction) remain rigorously defended.

22.4 Technical: Connecting IV to potential outcomes to 2sls

We begin by defining the framework for using an instrument in the presence of noncompliance. For each unit \(i\), we consider a binary instrument \(Z_i=z\) defined as follows:

\(Z_i = 1\) if unit \(i\) is assigned to the treatment group,
\(Z_i = 0\) if unit \(i\) is assigned to the control group.

Now, let’s define \(D_i(z)\) which indicates the potential treatment statuses of each unit (either “take” or “not take” the treatment) given \(Z_i=z\). This “potential treatment status” happens as not everyone will follow all the randomized control trial (A/B testing) process (if it was complete/perfect assignment then everyone who is assigned to treatment group will receive treatment,and whoever assigned control group will not). Specifically, we have four potential treatment statuses for each unit \(i\):

When \(Z_i=1\): \(D_i(1)=1\) is “takes treatment” if the unit is in treatment group OR \(D_i(1)=0\) is “not take the treatment” even if the unit is in treatment group
When \(Z_i=0\): \(D_i(0)=0\) is “not take the treatment” if the unit is in control group OR \(D_i(0)=1\) is “take the treatment” although the unit is in control group

As assigning units (by experimenter) to treatment or control groups doesnot mean each unit will actually be compliant. During the experiment there may be noncompliance—a situation in which individuals fail to follow their assigned treatments, or treatments are not perfectly implemented. Thus The observed treatments \(D_i\) is related to these potential treatments \(D_i(1)\) and \(D_i(0)\) by:

\[ D_i = Z_i \cdot D_i(1) + (1 - Z_i) \cdot D_i(0) \]

This equation simply states that if a unit is assigned to the treatment group (\(Z_i=1\)), we observe \(D_i = D_i(1)\). However, we cannot observe what would have happened—whether the unit would have taken the treatment or not—had it been assigned to the control group instead. Similarly, if the unit is assigned to the control group (\(Z_i=0\)), we observe \(D_i = D_i(0)\).

Based on the values of potential outcomes,\(D_i(1)\) and \(D_i(0)\), units are classified into four types:

Compliers: Units for which \(D_i(1)=1\) and \(D_i(0)=0\) (i.e \(D_i(1) > D_i(0)\)); they take the treatment if the unit is in treatment group, and do “not take the treatment” if the unit is in control group.
Always-takers: Units for which \(D_i(1)=D_i(0)=1\); they take the treatment regardless of assigned to treatment or control group.
Never-takers: Units for which \(D_i(1)=D_i(0)=0\); they do not take the treatment regardless of assignment.
Non-compliers/Defiers: Units for which \(D_i(1)=0\) and \(D_i(0)=1\) ; these units act contrary to the assignment (i.e. they take the treatment when not assigned and vice versa).

Because we only observe one of \(D_i(1)\) or \(D_i(0)\) for each individual (the fundamental problem of causal inference), we cannot directly determine each unit’s compliance type. However, under certain assumptions we can identify the proportion of each type in the RCT sample.

Assumptions for Estimating Compliance Proportions

To recover the causal effect among compliers, we require several key assumptions:

Independence of the Instrument:

\[ (Y_i(0), Y_i(1), D_i(0), D_i(1)) \perp Z_i \]

This means that the instrument is as good as randomly assigned, so that its effect on both outcomes and treatment statuses is not confounded by other variables.

First-Stage (Relevance):
The instrument must induce variation in the treatment:

\[ 0 < P(Z_i=1) < 1 \quad \text{and} \quad P(D_i(1)=1) \neq P(D_i(0)=1) \]

In other words, there must be some compliers for the instrument to be useful.

Monotonicity:
We assume that

\[ D_i(1) \ge D_i(0) \quad \text{for all } i \]

This eliminates defiers (thus \(\pi_D=0\), meaning no one systematically does the opposite of what the instrument assigns.

Under these assumptions, we can estimate the proportions of each type. For example: - The proportion of always-takers is given by:

\[ \pi_A = E[D_i \mid Z_i = 0] \]

since among those not encouraged, any unit taking the treatment must be an always-taker (assuming monotonocity/no defiers). - The proportion of compliers is:

\[ \pi_C = E[D_i \mid Z_i = 1] - E[D_i \mid Z_i = 0] \]

And the proportion of never-takers is:

\[ \pi_N = 1 - E[D_i \mid Z_i = 1] \]

With these proportions in hand—and with one further assumption (the exclusion restriction, which we describe next)—we can move from the intention-to-treat (ITT) effect to a causal effect for compliers.

From ITT to the Local Average Treatment Effect (LATE)

The exclusion restriction assumes that the instrument affects the outcome only through its effect on the treatment. Formally, for each unit \(i\):

\[ Y_i(D, Z=1) = Y_i(D, Z=0) \quad \text{for all}\quad D = 0, 1 \]

This means that, apart from its effect on the treatment received, the instrument has no direct effect on the outcome.

Under these assumptions, the Local Average Treatment Effect (LATE) is defined as the average causal effect of the treatment among compliers:

\[ \tau_{\text{LATE}} = E[Y_i(1) - Y_i(0) \mid D_i(1) > D_i(0)] \]

Because we cannot observe compliance types directly, LATE is estimated using the ratio of the difference in mean outcomes by instrument value to the difference in mean treatment uptake by instrument value. That is,

\[ \tau_{\text{LATE}} = \frac{E[Y_i \mid Z_i=1] - E[Y_i \mid Z_i=0]}{E[D_i \mid Z_i=1] - E[D_i \mid Z_i=0]} \]

An alternative way to express this estimator, when the instrument \(Z_i\) is binary, is by using a ratio-of-coefficients (or covariances) representation:

\[ \hat{\beta}_{1,\text{IV}} = \frac{\text{Cov}(Y_i, Z_i)}{\text{Cov}(D_i, Z_i)} \]

Since for a binary \(Z_i\) the difference in conditional means is equivalent to the covariance (up to a scaling factor), these two expressions are mathematically identical.

To see the connection verbally: the numerator, \(E[Y_i \mid Z_i=1] - E[Y_i \mid Z_i=0]\), represents the Intention-to-Treat (ITT) —the effect of the instrument on the outcome— effect on the outcome, while the denominator, \(E[D_i \mid Z_i=1] - E[D_i \mid Z_i=0]\), reflects how much the instrument shifts the probability of receiving the treatment. Their ratio isolates the effect of the treatment on the outcome for the specific group which is called as compliers- whose action changed based on treatment assignment.

Expressing ITT as a Weighted Average

It is also informative to note that the overall ITT effect can be viewed as a weighted average of the causal effects for different subgroups:

\[ \text{ITT} = \tau_{\text{LATE}} \cdot \pi_C + \tau_A \cdot \pi_A + \tau_N \cdot \pi_N \]

where \(\tau_A\) and \(\tau_N\) are the (typically zero) treatment effects for always-takers and never-takers, respectively. Under the exclusion restriction, the instrument has no effect on the outcome for always-takers and never-takers, so their contributions drop out, leaving:

\[ \text{ITT} = \tau_{\text{LATE}} \cdot \pi_C \]

Thus, rearranging gives the LATE as:

\[ \tau_{\text{LATE}} = \frac{\text{ITT}}{\pi_C} = \frac{E[Y_i \mid Z_i=1] - E[Y_i \mid Z_i=0]}{E[D_i \mid Z_i=1] - E[D_i \mid Z_i=0]} \]

This detailed discussion—incorporating definitions, potential outcomes, assumptions (independence, first stage, monotonicity, exclusion restriction), and the derivation of LATE both as a ratio of differences in means and via the ratio-of-covariances approach—provides a comprehensive overview of the identification of causal effects using instrumental variables. This proof shows that the denominator of LATE corresponds to the proportion of compliers in the population. This also highlights why weak instruments (small denominator) lead to imprecise LATE estimates, as dividing by a small \(\pi_C\) amplifies statistical noise.

When the instrument \(Z_i\) is binary, this simplifies to the Wald estimator: \[ \hat{\beta}_{1,\text{IV}} = \frac{\bar{Y}_1 - \bar{Y}_0}{\bar{D}_1 - \bar{D}_0} \] where \(\bar{Y}_1\) and \(\bar{Y}_0\) are the sample means of the outcome for the two groups defined by \(Z_i\), and \(\bar{\pi}_1\) and \(\bar{\pi}_0\) are the corresponding means of the treatment variable.↩︎
The first is the indirect least squares (ILS) approach, which involves estimating two reduced-form regressions: \[ Y_i = c_{10} + c_{11} Z_i + u_{1i} \] \[ D_i = c_{20} + c_{21} Z_i + u_{2i} \] The ILS estimator is then obtained as \[ \hat{\beta}_{1,\text{ILS}} = \frac{\hat{c}_{11}}{\hat{c}_{21}} \] This ratio of reduced-form coefficients can be interpreted as the ratio of the intent-to-treat (ITT) estimates in a randomized trial with a binary treatment.↩︎
In the indirect least squares (ILS) approach: with covariates, the reduced-form regressions become \[ Y_i = c_{10} + c_{11} Z_i + c_{1X}' X_i + u_{1i} \] \[ D_i = c_{20} + c_{21} Z_i + c_{2X}' X_i + u_{2i} \] and the corresponding ILS estimator remains \(\hat{\beta}_{1,\text{ILS}} = \hat{c}_{11} / \hat{c}_{21}\).↩︎
An alternative way to express the LATE Wald-estimator is through a ratio of coefficients approach. The first-stage equation models the relationship between the endogenous treatment \(D_i\) and the instrument \(Z_i\): \[ D_i = \alpha_1 + \beta_1 Z_i + \epsilon_{1i} \] The reduced-form equation captures how the instrument affects the outcome \(Y_i\): \[ Y_i = \alpha_2 + \beta_2 Z_i + \epsilon_{2i} \] By taking the ratio of the two coefficients, we obtain the LATE estimator: \[ \delta_{\text{LATE}} = \frac{\hat{\beta}_2}{\hat{\beta}_1} = \frac{\text{Effect of } Z_i \text{ on } Y_i}{\text{Effect of } Z_i \text{ on } D_i} \] This ratio-based interpretation aligns with the previous derivation of LATE as the causal effect of treatment on the outcome for compliers. Thus, \(\hat{\beta}_1\) captures the first-stage effect of the instrument \(Z_i\) on treatment \(D_i\), while \(\hat{\beta}_2\) represents the reduced-form effect of the instrument on the outcome \(Y_i\). The LATE estimator, given by the ratio \(\hat{\beta}_2 / \hat{\beta}_1\), isolates the causal effect of treatment on the outcome for compliers—those who take the treatment only when assigned to it. As seen from the equation, if \(\beta_1 = 0\), then the effect of \(Z\) in the reduced-form equation must also be zero. If the first-stage coefficient \(\beta_1\) is zero (or very small) while the reduced-form effect \(\beta_2\) is significant, this implies that the instrument is affecting the outcome through a channel other than the treatment. This would be a violation of the exclusion restriction.↩︎
To assess whether an instrument is weak, we compare two models: Full first-stage model: Includes the instrument \(Z_i\) and covariates \(X_i\): \(D_i = \alpha_1 + \beta_Z Z_i + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon_{1i}\) and Restricted model: Excludes the instrument:\(D_i = \alpha_1 + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon_{1i}\). Using an F-test, we check if adding \(Z_i\) significantly improves the fit: \[ F = \frac{(R^2_{\text{full}} - R^2_{\text{restricted}}) / k_{\text{full}}}{(1 - R^2_{\text{full}}) / (n - k_{\text{full}} - 1)} \] Rule of thumb: If \(F \geq 10\), the instrument is sufficiently strong.↩︎
LIML and Fuller estimators provide alternatives to 2SLS) in the presence of weak instruments, reducing bias and improving inference, particularly in small samples. LIML, derived from a likelihood-based framework, is less biased than 2SLS, especially when multiple instruments are used. The Fuller estimator, a modification of LIML (Fuller, 1977), applies a shrinkage factor to further correct small-sample bias, improving efficiency. These estimators are frequently used in health economics and epidemiology, particularly in Mendelian randomization and policy-based IV studies (e.g., Medicaid expansions, hospital admissions) where instrument strength may be weak. While LIML is less sensitive to over-identification bias, its computational demands make it less common than 2SLS. Readers interested in a detailed discussion of these methods are encouraged to consult Stock & Yogo (2005) for weak instrument diagnostics and the relative performance of LIML and Fuller estimators.↩︎