Preface

The book will be available at this link. Available for pre-order on December 1, 2025.

Why read this book

This book is designed to fill a crucial gap in the intersection of econometrics, machine learning, and causal inference for applied economics, social and health scientists. While there are excellent texts in each of these areas, few books provide a unified, conceptually coherent, and practically grounded guide that connects classical statistical reasoning with modern machine learning tools—especially in the context of health, economics, and social science research.

Many existing resources on machine learning and causal inference are highly technical, often dense with notation, and filled with field-specific jargon. They may assume a deep familiarity with underlying concepts, reducing foundational explanations to just a few sentences or paragraphs. In contrast, we have deliberately chosen to explain even the most basic concepts in detail, especially in the initial chapters. If you are already comfortable with these topics, feel free to skip ahead. However, our experience has shown that many beginners lack clarity on these foundational ideas, and we believe a thorough understanding is essential for progressing to more advanced methods. Additionally, while most texts focus on using pre-built packages and discussing results, we go a step further. We often provide detailed insights into what these programs are doing behind the scenes — from raw data to algorithmic decisions. Whether you are exploring established methods or the latest approaches from the most recent papers of 2025, this book is designed to be a comprehensive learning companion.

Our motivation for writing this book stems from years of working at the intersection of econometrics and machine learning. We found that while machine learning provides powerful tools, its application in causal research is often misunderstood. Many researchers hesitate to adopt these methods due to concerns about interpretability and identification. With this book, we aim to bridge that gap—providing clear, practical guidance for applying machine learning in a way that complements traditional econometric approaches.

What sets this book apart is its layered approach to both teaching and understanding modeling. It builds from first principles—starting with foundational topics like estimation, model error, and bias-variance trade-offs—and walks the reader forward into high-dimensional predictive models, algorithmic tuning, and cutting-edge methods for causal inference using machine learning. Along the way, it maintains a careful balance between theoretical clarity and practical implementation, ensuring that each method is not only explained but also demonstrated through simulation, visualization, and applied examples. While the book provides a solid theoretical foundation, every concept is paired with applied examples, simulations, and case studies to ensure readers gain practical experience.

Simulation plays a central role throughout the book. For every concept—from OLS bias to double machine learning—we show how estimators behave when the truth is known. This simulation-based pedagogy allows readers to build intuition, evaluate estimator performance, and visually understand identification failures and model misspecification. This is not commonly emphasized in econometrics texts, and it makes the learning process more transparent and engaging.

Another distinguishing feature is the book’s integration of predictive and causal goals, rather than treating them as opposing paradigms. We show how predictive tools (e.g., penalized regression, trees, boosting) are used not just for forecasting, but also as core components of modern causal inference—particularly in estimating heterogeneous treatment effects, handling selection bias, and conducting robust policy evaluation. Rather than separating statistical modeling and machine learning into silos, we provide a framework that connects them, rooted in both theory and empirical needs.

We also place special emphasis on interpretability, identification, and transparency, especially for readers in economics, public health, and policy fields. Throughout, we draw attention to assumptions, estimation targets, and evaluation strategies. This helps bridge the gap between abstract modeling and applied credibility.

Finally, the book includes several chapters that are rarely covered in causal ML texts aimed at economists and social scientists—including causal inference for time series, deep learning for policy problems, causal feature engineering, and text-based causal analysis. These additions reflect both emerging research needs and the evolving nature of empirical data in the social sciences.

In short, this book is not just a technical manual or a high-level overview. It is a comprehensive, practical, and intellectually coherent guide for students, researchers, and practitioners who want to understand and apply modern modeling strategies with confidence, clarity, and rigor. By the end of this book, readers will be able to critically evaluate machine learning methods, apply them to empirical problems, and interpret results in a credible and transparent manner. Whether it’s estimating the effect of a policy intervention or forecasting economic trends, readers will gain both the theoretical understanding and practical skills needed to conduct rigorous, data-driven research.

Structure of the book

This book is structured to guide readers from core modeling ideas to advanced machine learning and causal inference techniques, always anchored in practical applications from economics, health, and the social sciences. We begin with fundamental statistical and causal reasoning, gradually progressing to high-dimensional models, algorithmic tools, and real-world policy applications. The emphasis throughout is on both understanding the mechanics of methods and implementing them thoughtfully in applied work.

The first few chapters introduce the building blocks of data-driven analysis—how we move from ideas to formal models, distinguish estimation from prediction, and build learning systems that generalize well. Core concepts such as correlation, regression, error decomposition, overfitting, and the bias-variance trade-off are developed with simulations and clear explanations. These early chapters form the conceptual foundation for later techniques.

The next group of chapters covers classical and modern estimation and prediction strategies. We begin with parametric and nonparametric estimators, then transition to topics like hyperparameter tuning, classification models, and model selection. This section includes an extended treatment of penalized regression methods—Ridge, Lasso, Elastic Net, and Adaptive Lasso—and ensemble methods such as bagging, random forests, and boosting. Each method is developed from the ground up with hands-on simulations and technical derivations.

From Chapter 16 onward, the book turns to causal inference. We begin with the Rubin Causal Model and the logic of counterfactual comparisons, followed by the design and analysis of randomized controlled trials, including regression adjustment, randomization inference, and covariate balance checks. We then cover causal inference under selection on observables, introducing regression adjustment, matching, and inverse probability weighting (IPW). These chapters build toward doubly robust methods—particularly Augmented IPW (AIPW)—that combine modeling and weighting strategies for more reliable estimation. All methods are demonstrated through intuitive illustrations and simulation-based exercises, and many include machine learning implementations, such as XGBoost-assisted matching or Lasso-based adjustment.

After establishing tools for dealing with observed confounding, we turn to models that handle selection on unobservables. Chapter 22 introduces instrumental variables (IV) in both the potential outcomes framework and classical linear models. We connect the logic of IV to randomization, explain how to assess instrument validity, and develop the core 2SLS estimator before extending it to high-dimensional and nonlinear settings. The chapter also introduces Double Machine Learning for IV using Lasso and Random Forests, emphasizing the role of orthogonal scores, robustness to model misspecification, and finite sample performance. Simulations and algorithmic steps provide readers with the tools to implement DML-IV in applied research. We then explore how to estimate heterogeneous treatment effects using interaction models, causal forests, and meta-learners such as S-, T-, X-, R-, and DR-learners.

Chapters 26 to 27 introduce modern causal designs like Difference-in-Differences and Regression Discontinuity Designs, including machine-learning extensions (e.g., DML-DiD, synthetic control, synthetic DiD). These chapters link traditional identification strategies with predictive methods.

The final part of the book focuses on advanced and emerging areas: causal inference for time series, neural networks and deep learning, matrix decomposition tools like PCA and factor analysis, and core optimization algorithms such as gradient descent. Together, these topics expand the book’s reach to high-dimensional, time-dependent, and unstructured data applications.

By organizing the book this way, we aim to support a wide range of readers—from those seeking clarity on core modeling logic to researchers implementing frontier methods. Throughout, examples, simulations, and applied illustrations reinforce each concept and provide a clear path from theory to practice.

Who Can Use This Book?

This book is designed for a wide range of readers who work with data to answer empirical questions—particularly in the fields of economics, public health, policy, and the social sciences. It is accessible to learners with a basic background in statistics or econometrics and is intended to support both classroom instruction and self-guided learning.

  • Undergraduate and Master’s students in economics, data science, public policy, or health disciplines will benefit from the step-by-step development of modeling concepts, supported by simulation and intuitive explanations.

  • Ph.D. students and early-career researchers can use this book to bridge the gap between traditional econometrics and modern machine learning techniques, especially for empirical work focused on prediction, estimation, or causal inference.

  • Instructors and course designers can use the modular structure to build full-semester courses or topical modules—spanning estimation theory, machine learning, and causal inference with modern computational tools.

  • Industry Professionals and Government Analysts who apply data-driven decision-making in fields such as finance, healthcare, or policy evaluation will find practical insights for applying machine learning to solve real-world problems..

  • Readers with coding experience in R or Python will find the simulations, examples, and implementations easy to follow, though prior programming is not strictly required to understand the concepts.

Whether you are building predictive models, estimating treatment effects, or exploring new forms of data like text or time series, this book provides a comprehensive and flexible resource to support rigorous, transparent, and effective empirical research.

Acknowledgments

We would like to extend our heartfelt gratitude to our loved ones for their constant support during the creation of this book. Their unwavering belief in our abilities and vision has been invaluable, and we could not have reached this milestone without them.

Yigit is deeply grateful for the sacrifices Isik has made and for her steadfast encouragement throughout the pursuit of this dream. He is also thankful for the opportunity to share his passion for learning with his son, Ege.

Mutlu would like to express his profound thanks to his wife, Mevlude, whose love, patience, and understanding have been a constant source of strength and inspiration. He also extends his gratitude to his sons, Eren and Kaan, whose laughter, curiosity, and boundless energy have fueled his determination to work harder and build a lasting legacy.

We’re grateful to the colleagues and friends who read early drafts and offered thoughtful comments on specific chapters—your insights helped sharpen our arguments and improve the clarity of our writing. We also thank those who generously shared code, datasets, and lecture notes online; your openness made our work easier and this book possible.

Yigit Aydede
Mutlu Yuksel