7 Augmented Inverse Propensity Weighting

An alternative estimator is the augmented inverse probability weighted estimator (AIPW). It combines both the properties of the regression based estimator and the inverse probability weighted estimator. It is therefore a ‘doubly robust’ method in that it only requires either the propensity or outcome model to be correctly specified but not both. This model holds the same assumptions as the Inverse Probability Weighted Estimator (IPW).

Here’s a brief overview of how the AIPW estimator works:

Inverse Probability Weighting: The first step in AIPW involves estimating the probabilities of receiving the treatment (propensity scores) and not receiving the treatment for each individual in the study based on their observed covariates (confounding variables). This is typically done using logistic regression or other modeling techniques. Each individual’s outcome is weighted by the inverse of their propensity score. This weighting gives more importance to the outcomes of individuals whose treatment assignment is less predictable based on their covariates.
Outcome Mode:l You then fit a model to predict the outcome variable based on treatment status and covariates. This model estimates the expected outcome for each individual under both treatment and control conditions.
AIPW Estimation: The AIPW estimator combines the results from steps 1 and 2. It calculates the difference between the expected outcomes under treatment and control for each individual, weighted by the inverse propensity score. Summing these weighted differences provides an estimate of the average treatment effect while accounting for potential bias due to confounding. \end{enumerate}

The AIPW estimator is particularly useful when there are complex interactions and confounding variables in observational studies. By adjusting for the estimated propensity scores and using inverse probability weighting, it helps to mitigate bias and provide more accurate estimates of treatment effects.

It’s important to note that while the AIPW estimator can be a powerful tool for causal inference, it relies on the assumption that the estimated propensity scores are correctly specified and that there are no unmeasured confounding variables. Careful consideration of model assumptions and sensitivity analyses are typically conducted when using the AIPW estimator in practice.

Definition 7.1: Oracle Augmented Inverse Propensity Weighting estimator

We denote $\hat \tau_{\text{AIPW}}^*$ the oracle AIPW estimator, where all the nuisance parameters are supposed to be known, such that: \[\begin{align*} \hat \tau_{\text{AIPW}}^* &= \frac{1}{|\mathcal{I}|}\sum_{i \in \mathcal{I}}\left(\mu_{(1)}(X_i) - \mu_{(0)}(X_i) + \frac{T_i.(Y_i - \mu_{(1)}(X_i))}{e(X_i)}- \frac{(1 - T_i)(Y_i-\mu_{(0)}(X_i))}{1 - e(X_i)}\right) \end{align*}\] where $\mu_{(t)}(X) = \mathbb{E}\left[Y | T=t,X\right]$

Proposition 7.1: Asymptotic properties of $\hat \tau_{\text{AIPW}}^*$

The oracle AIPW, denoted $\hat \tau_{\text{AIPW}}^*$, is consistent, \[\hat \tau_{\text{AIPW}}^* \stackrel{p}{\longrightarrow} \tau, \]

and is an asymptotically normal estimator, that is,

\[\sqrt{n}\left(\hat \tau_{\text{AIPW}}^* - \tau \right) \stackrel{d}{\rightarrow} \mathcal{N}\left(0, V_{\text {\tiny AIPW }}^*\right),\]

where, \[V_{\text{AIPW}}^* =\mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X)\right)^2 }{e(X)} \right]+\mathbb{E}\left[ \frac{(Y^{(0)}-\mu_0(X))^2}{1-e(X)} \right] + \operatorname{Var}[\mu_1(X) - \mu_0(X)].\]

Proof

This oracle doubly-robust estimator can be understood as an M-estimation problem where a function $\Psi(.)$ can be introduced such that

\[\psi(X, A, Y, \boldsymbol{\theta})=\left(\begin{array}{l} A\frac{Y-\mu_1(X)}{e(X)} + \mu_1(X)-\theta_{0} \\ (1-A)\frac{Y-\mu_0(X)}{1-e(X)} + \mu_0(X)-\theta_{1} \\ \left(\theta_{0}-\theta_{1}\right)-\theta_{2} \end{array}\right).\]

Ensuring the condition $\mathbb{E}[\psi(X, T, Y, \boldsymbol{\theta})] = \boldsymbol{0}$, where $\boldsymbol{\theta} = (\theta_0, \theta_1, \theta_2))$ gives, \[\begin{align*} \theta_0 &= \mathbb{E}\left[ T\frac{Y-\mu_1(X)}{e(X)} + \mu_1(X)\right] \\ &= \mathbb{E}\left[ T\frac{Y}{e(X)}\right] - \mathbb{E}\left[ A\frac{\mu_1(X)}{e(X)}\right] + \mathbb{E}\left[\mu_1(X)\right] \\ &= \mathbb{E}\left[ \mathbb{E}[T\frac{Y}{e(X)} \mid X ]\right] - \mathbb{E}\left[ T\frac{\mathbb{E}[Y^{(1)} \mid X, T = 1]}{e(X)}\right] + \mathbb{E}\left[\mu_1(X)\right] \\ &= \mathbb{E}\left[ T\frac{\mathbb{E}[Y^{(1)} \mid X, T = 1]}{e(X)}\right] - \mathbb{E}\left[ T\frac{\mathbb{E}[Y^{(1)} \mid X, T = 1]}{e(X)}\right] + \mathbb{E}\left[\mu_1(X)\right] \\ &= \mathbb{E}\left[\mu_1(X)\right] \\ &= \mathbb{E}\left[Y^{(1)}\right], \end{align*}\]

and similarly \[\theta_1 = \mathbb{E}\left[\mu_0(X)\right] = \mathbb{E}\left[Y^{(0)}\right],\] so that \[\theta_2 = \tau.\]

Derivations around this function $\psi(.)$ gives,

\[\frac{\partial \psi}{\partial \theta} (X, A, Y, \theta)=\left(\begin{array}{ccc} -1 & 0 & 0 \\ 0 & -1 & 0 \\ 1 & -1 & -1 \end{array}\right) \quad \Rightarrow \quad A\left(\boldsymbol{\theta} \right)= \left(\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ -1 & 1 & 1 \end{array}\right).\]

In particular,

\[A^{-1}\left(\boldsymbol{\theta} \right)= \left(\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & -1 & 1 \end{array}\right)\]

\[\left(A^{-1}\left(\boldsymbol{\theta} \right)\right)^{T}= \left(\begin{array}{ccc} 1 & 0 & 1 \\ 0 & 1 & -1 \\ 0 & 0 & 1 \end{array}\right)\]

On the other hand, denoting

\[m_1(X,T,Y) := T\frac{Y-\mu_1(X)}{e(X)} + \mu_1(X),\] and \[m_0(X,T,Y) := (1-T)\frac{Y-\mu_0(X)}{1-e(X)} + \mu_0(X),\]

one can obtain \[\mathbb{E}[\psi \cdot \psi^T] =\left(\begin{array}{ccc} \mathbb{E}[\left(m_1(X,T,Y)-\theta_{0}\right)^{2}] & \mathbb{E}[\left(m_1(X,T,Y)-\theta_{0}\right)\left(m_0(X,T,Y)-\theta_{1}\right)] & 0 \\ \mathbb{E}[\left(m_1(X,T,Y)-\theta_{0}\right)\left(m_0(X,T,Y)-\theta_{1}\right)] & \mathbb{E}[\left(m_0(X,T,Y)-\theta_{1}\right)^{2}] & 0 \\ 0 & 0 & 0 \end{array}\right).\]

Computing $V = A^{-1}\, \mathbb{E}[\psi \cdot \psi^T]\, \left(A^{-1}\right)^{T}$ gives the following asymptotic variance

\[\begin{equation*} V_{3,3} = \mathbb{E}[\left(m_1(X,T,Y)-\theta_{0}\right)^{2}] + \mathbb{E}[\left(m_0(X,T,Y)-\theta_{1}\right)^{2}]. \\ \end{equation*}\]

This expression can be further developed noting that,

\[\begin{equation*} \mathbb{E}[\left(m_1(X,T,Y)-\theta_{0}\right)^{2}] = \mathbb{E}\left[ \left(T\frac{Y-\mu_1(X)}{e(X)} \right)^2 \right]+ 2\,\mathbb{E}[(T\frac{Y-\mu_1(X)}{e(X)})\,(\mu_1(X) - \theta_0)] + \mathbb{E}[(\mu_1(X)- \theta_0)^2] \end{equation*}\] Using Consistency, Unconfoundedness, and definition or $\mu_1(X) = \mathbb{E}[Y \mid X, T=1]$, \[\begin{align*} \mathbb{E}\left[ \left(T\frac{Y-\mu_1(X)}{e(X)} \right)^2 \right] &= \mathbb{E}\left[ \left(T\frac{Y^{(1)}-\mu_1(X)}{e(X)} \right)^2 \right] && \text{Consistency} \\ &=\mathbb{E}\left[ \mathbb{E}\left[\left(T\frac{Y^{(1)}-\mu_1(X)}{e(X)} \right)^2 \mid X \right]\right] && \text{Total expectation}\\ &= \mathbb{E}\left[ \mathbb{E}\left[T\left(\frac{Y^{(1)}-\mu_1(X)}{e(X)} \right)^2 \mid X \right]\right] && \text{T is binary}\\ &= \mathbb{E}\left[ \mathbb{E}\left[\mathbb{1}_{\left\{T=1\right\}}\left(\frac{Y^{(1)}-\mu_1(X)}{e(X)} \right)^2 \mid X \right]\right] && \text{T written as an indicator} \\ &= \mathbb{E}\left[ \frac{1}{e(X)^2} \mathbb{E}\left[\mathbb{1}_{\left\{T=1\right\}}\left(Y^{(1)}-\mu_1(X)\right)^2 \mid X \right]\right] && \text{$e(X)$ is a function of $X$} \\ &= \mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X)\right)^2 }{e(X)^2} \mathbb{E}\left[\mathbb{1}_{\left\{T=1\right\}}\mid X \right]\right] && \text{Uncounf. \& $\mu_1(.)$ is func. of $X$} \\ &= \mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X)\right)^2 }{e(X)^2} e(X)\right] && \text{Definition of $e(X)$} \\ &= \mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X)\right)^2 }{e(X)} \right], \\ \end{align*}\]

And we also have that:

\[\begin{align*} 2\,\mathbb{E}[(T\frac{Y-\mu_1(X)}{e(X)})\,(\mu_1(X) - \theta_0)] &= 2\,\mathbb{E}[ T\frac{Y\mu_1(X)}{e(X)} - T\frac{\mu_1(X)^2}{e(X)} -\theta_1 T\frac{Y}{e(X)} + \theta_0 \frac{\mu_1(X)}{e(X)}]\\ &= 2\,\mathbb{E}[ T\mu_1(X)\frac{Y - \mu_1(X)}{e(X)}] - 2\theta_1\,\mathbb{E}[ T\frac{Y}{e(X)}] + 2\theta_0\,\mathbb{E}[ \frac{\mu_1(X)}{e(X)}] \\ &= 2\,\mathbb{E}[ \frac{T\mu_1(X)Y}{e(X)}] - 2\,\mathbb{E}[T \frac{\mu_1(X)^2}{e(X)}] -2\theta_1\theta_0 + 2\theta_0 \mathbb{E}[ \frac{\mu_1(X)}{e(X)}]\\ &= 2\,\mathbb{E}\left[\mathbb{E}[ \frac{T\mu_1(X)Y}{e(X)}\mid X]\right] - 2\,\mathbb{E}[T \frac{\mu_1(X)^2}{e(X)}] -2\theta_1\theta_0 + 2\theta_0 \mathbb{E}[ \frac{\mu_1(X)}{e(X)}] \\ &= 2\,\mathbb{E}[T \frac{\mu_1(X)^2}{e(X)}] - 2\,\mathbb{E}[T \frac{\mu_1(X)^2}{e(X)}] -2\theta_1\theta_0 + 2\theta_0 \mathbb{E}[ \frac{\mu_1(X)}{e(X)}] \\ &= 0 - 2\theta_1\theta_0 + 2\theta_0 \mathbb{E}[ \frac{\mu_1(X)}{e(X)}] \\ &= - 2\theta_1\theta_0 + 2\theta_0 \mathbb{E}[ \frac{\mathbb{E}[Y \mid X, T=1]}{e(X)}] \\ &= - 2\theta_1\theta_0 + 2\theta_0 \mathbb{E}[ \mathbb{E}[\frac{TY}{e(X)} \mid X, T=1]] \\ &= - 2\theta_1\theta_0 + 2\theta_1\theta_0 \\ &=0 \end{align*}\]

Finally, \[\begin{equation*} \mathbb{E}[\left(m_1(X,T,Y)-\theta_{0}\right)^{2}] = \mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X)\right)^2}{e(X)} \right] + \mathbb{E}[(\mu_1(X)- \theta_0)^2]. \end{equation*}\]

On the other side, and similarly, one can show that \[\begin{equation*} \mathbb{E}[\left(m_0(X,T,Y)-\theta_{1}\right)^{2}] = \mathbb{E}\left[\frac{(Y^{(0)}-\mu_0(X))^2}{1-e(X)} \right] + \mathbb{E}[(\mu_0(X)- \theta_1)^2]. \end{equation*}\]

Therefore,

\[\hat \tau_{\text{AIPW}}^* \stackrel{p}{\longrightarrow} \tau\]

where \[\begin{equation*} V_{\text{AIPW}}^* =\mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X)\right)^2 }{e(X)} \right]+\mathbb{E}\left[ \frac{(Y^{(0)}-\mu_0(X))^2}{1-e(X)} \right] + \mathbb{E}[(\mu_0(X)- \theta_1)^2] + \mathbb{E}[(\mu_1(X)- \theta_0)^2]. \\ \end{equation*}\]

The large sample variance can be expressed a bit differently using the following derivations,

\[\begin{equation*} \operatorname{Var}[\mu_1(X) - \mu_0(X)] = \mathbb{E}[(\mu_0(X)- \theta_1)^2] + \mathbb{E}[(\mu_1(X)- \theta_0)^2] + 2(\theta_1\theta_0 - \mathbb{E}[\mu_1(X)\mu_0(X)]). \end{equation*}\]

As $\mu_1(X) \perp\mkern-9.5mu\perp \mu_0(X) \mid X$, \[\begin{equation*} \operatorname{Var}[\mu_1(X) - \mu_0(X)] = \mathbb{E}[(\mu_0(X)- \theta_1)^2] + \mathbb{E}[(\mu_1(X)- \theta_0)^2]. \end{equation*}\]

Finally,

\[\begin{equation*} V_{\text{AIPW}}^* =\mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X)\right)^2 }{e(X)} \right]+\mathbb{E}\left[ \frac{(Y^{(0)}-\mu_0(X))^2}{1-e(X)} \right] + \operatorname{Var}[\mu_1(X) - \mu_0(X)]. \\ \end{equation*}\] Another proof of consistency: The consistency of the oracle IPW estimator can directly be shown with the weak law of large number. Denoting \[\begin{equation*} Z_i = \mu_{(1)}(X_i) - \mu_{(0)}(X_i) + \frac{T_i.(Y_i - \mu_{(1)}(X_i))}{e(X_i)}- \frac{(1 - T_i)(Y_i-\mu_{(0)}(X_i))}{1 - e(X_i)} \end{equation*}\] And we consider the iid series $Z_1, Z_2, \dots, Z_n$ with finite mean ($\mathbb{E}[Z] = \tau$), then the weak law of large number gives \[\begin{equation*} \bar{Z} \stackrel{p}{\longrightarrow} \tau \quad \text { as } n \rightarrow \infty. \end{equation*}\]

This ensures the consistency of the oracle AIPW estimator.

Asymptotic normality: We consider the iid sequence $\left\{Z_{1}, Z_{2}, \ldots, Z_{n}\right\}$ with mean $\tau$ and variance $V_{\text{\tiny AIPW}}^*$. Then, the central limit theorem ensures that the sample average $Z_{n}$, corresponding to the oracle IPW estimator, has an asymptotic standard normal distribution with mean $\tau$ and variance $\frac{V_{\text{\tiny AIPW}}^*}{n}$. We denote this,

\[\begin{equation*} \sqrt{n}\left(\hat{\tau}_{\text{\tiny IPW}}^* - \tau \right) \stackrel{d}{\rightarrow} \mathcal{N}\left( 0,V_{\text{\tiny AIPW}}^* \right) \end{equation*}\]

Now, the oracle AIPW will be compared to the AIPW estimator where the nuisance function are fitted on an hold out set of data. First we need to make a new assumption about the product of the residuals of propensity score and the response function.

Definition 7.2: Crossfitted Augmented Inverse Propensity Weighting estimator

We denote $\tilde\tau_{\text{AIPW}}^*$ the crossfitted AIPW estimator, where $\mathcal{I}_1, \mathcal{I}_2,..., \mathcal{I}_K$ are splited data such that: \[\forall i, j \text{ \quad } \mathcal{I}_i \cap \mathcal{I}_j = \varnothing \text{ and } \bigcup_{k=1}^K \mathcal{I}_k = \mathcal{I}\]

\[\begin{align*} \tilde \tau_{\text{AIPW}}^* &=\frac{1}{|\mathcal{I}|} \sum_{k = 1}^K \sum_{i \in \mathcal{I_k}}\left(\hat\mu_{(1)}^{\bar{\mathcal{I}_k}}(X_i) - \hat\mu_{(0)}^{\bar{\mathcal{I}_k}}(X_i) + \frac{T_i.(Y_i - \hat\mu_{(1)}^{\bar{\mathcal{I}_k}}(X_i))}{\hat e^{\bar{\mathcal{I}_k}}(X_i)}- \frac{(1 - T_i)(Y_i-\hat\mu_{(0)}^{\bar{\mathcal{I}_k}}(X_i))}{1 - \hat e^{\bar{\mathcal{I}_k}}(X_i)}\right) \end{align*}\] where $\hat\mu_{(t)}^{\bar{\mathcal{I}_k}}(X)$ and $\hat e^{\bar{\mathcal{I}_k}}(X)$ are the estimates of $\mu_{(t)}$ and $e$ obtained using only the sample $\bar{\mathcal{I}_k}$.

Proposition 7.2: Convergence of the crossfitted AIPW estimator toward its oracle estimator

Assume that Overlap is satisfied and that for $t\in\{0,1\}$: \[\sup _{x \in \mathcal{X}}|(\mu_t^{\mathcal{I}}(X)-\hat{\mu_t}(x))^{\frac{1}{2}}(e(x)-\hat{e}^{\mathcal{I}}(x))^{\frac{1}{2}}|=\mathcal{O}_{P}\left(\frac{1}{\sqrt{n}}\right)\] then: \[\sqrt{n}(\tilde{\tau}_{\text{\tiny AIPW}}-\hat{\tau}_{\text{\tiny AIPW}}^{*}) \stackrel{p}{\longrightarrow} 0, \]

Proof

First we introduce,

\[m_1(X,T,Y) := T\frac{Y- \mu_1(X)}{ e(X)} + \mu_1(X),\] \[m_0(X,T,Y) := (1-T)\frac{Y- \mu_0(X)}{1- e(X)} + \mu_0(X),\]

and,

\[\hat m_1^k(X,T,Y) := T\frac{Y-\hat \mu_1^{\bar{\mathcal{I}_k}}(X)}{\hat e^{\bar{\mathcal{I}_k}}(X)} + \hat \mu_1^{\bar{\mathcal{I}_k}}(X),\] \[\hat m_0^k(X,T,Y) := (1-T)\frac{Y-\hat \mu_0^{\bar{\mathcal{I}_k}}(X)}{1-\hat e^{\bar{\mathcal{I}_k}}(X)} + \hat \mu_0^{\bar{\mathcal{I}_k}}(X),\] the same quantities as previously but where nuisance parameters are estimated with the hold out data set. The difference between the oracle estimator and the AIPW estimator can be decomposed such as,

\[\begin{align*} \sqrt{n}\, \left( \tilde \tau_{\text{AIPW}} - \hat\tau_{\text{AIPW}}^*\right) &= \sqrt{n}\, \frac{1}{|\mathcal{I}|}\sum_{k = 1}^K \sum_{i \in \mathcal{I_k}} \left( \hat m_1^k(X_i,T_i,Y_i) - m_1(X_i,T_i,Y_i)\right) - \left( \hat m_0^k(X_i,T_i,Y_i) - m_0(X_i,T_i,Y_i)\right). \end{align*}\]

Then, \[\begin{align*} \sqrt{n}\, \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I_k}} (\hat m^k_1- m_1)(X_i,T_i,Y_i) =& \frac{1}{\sqrt{n}} \sum_{i \in \mathcal{I_k}}\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)+T_{i} \frac{Y_{i}-\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}{\hat{e}\left(X_{i}\right)}-\mu_{1}\left(X_{i}\right)-T_{i} \frac{Y_{i}-\mu_{1}\left(X_{i}\right)}{e\left(X_{i}\right)}\right) \\ =& \frac{1}{\sqrt{n}} \sum_{i \in \mathcal{I_k}}\left(\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right) && \text{Further denoted $A_n^k$}\\ &+\frac{1}{\sqrt{n}} \sum_{i \in \mathcal{I_k}} T_{i}\left(\left(Y_{i}-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right)\right) && \text{Further denoted $B_n^k$}\\ &-\frac{1}{\sqrt{n}} \sum_{i \in \mathcal{I_k}} T_{i}\left(\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right)\right).&& \text{Further denoted $C_n^k$} \end{align*}\]

The first two terms can be demonstrated to converge toward 0 in probability. To do this, one can show that each of these two first terms converge in $L^2$ norm.

First, one can show that the expectation of $\frac{A_n^k}{\sqrt{n}}$ is null,

\[\begin{align*} \mathbb{E}\left[\frac{1}{n} \sum_{i \in \mathcal{I_k}}\left(\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right) \right] &= \frac{1}{n} \sum_{i \in \mathcal{I_k}}\mathbb{E}\left[\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right] \\ &= \frac{|\mathcal{I_k}|}{n} \mathbb{E}\left[\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X\right)-\mu_{1}\left(X\right)\right)\left(1-\frac{T}{e\left(X\right)}\right)\right] && \text{i.i.d.} \\ &= \frac{|\mathcal{I_k}|}{n} \mathbb{E}\left[ \mathbb{E}\left[ \left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X\right)-\mu_{1}\left(X\right)\right)\left(1-\frac{T}{e\left(X\right)}\right)\mid X\right] \right] \\ &= \frac{|\mathcal{I_k}|}{n} \mathbb{E}\left[\mathbb{E}\left[ \left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X\right)-\mu_{1}\left(X\right)\right)\left(1-\frac{\mathbb{E}\left[T\mid X \right] }{e\left(X\right)}\right)\right]\right] \\ &= \frac{|\mathcal{I_k}|}{n} \mathbb{E}\left[\mathbb{E}\left[ \left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X\right)-\mu_{1}\left(X\right)\right)\left(1-\frac{e\left(X\right) }{e\left(X\right)}\right)\right]\right] \\ &= 0. \end{align*}\]

This will be helpful in the next series of derivations.

Now, consider the expectation of the square of $\frac{A_n^k}{\sqrt{n}}$,

\[\begin{align*} \mathbb{E}\left[ \left(\frac{1}{n} \sum_{i \in \mathcal{I_k}}\left(\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right)\right)^2 \right] &= \operatorname{Var}\left[ \frac{1}{n} \sum_{i \in \mathcal{I_k}}\left(\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right) \right] \\ &= \frac{1}{n^2} \operatorname{Var}\left[ \sum_{i \in \mathcal{I_k}}\left(\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right) \right]\\ &= \frac{1}{n^2} \sum_{i \in \mathcal{I_k}} \operatorname{Var}\left[ \left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right) \right] &&\text{iid} \\ &= \frac{|\mathcal{I_k}|}{n^2} \mathbb{E}\left[\left( \left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right)^2 \right] \\ &= \frac{|\mathcal{I_k}|}{n^2} \mathbb{E}\left[\mathbb{E}\left[\left( \left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)\right)^2 |X \right] \right] \\ &= \frac{|\mathcal{I_k}|}{n^2} \mathbb{E}\left[ \left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)^2\mathbb{E}\left[\left(1-\frac{T_{i}}{e\left(X_{i}\right)}\right)^2 |X \right] \right] \\ &= \frac{|\mathcal{I_k}|}{n^2} \mathbb{E}\left[\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)^2\frac{1}{e\left(X_{i}\right)^2}\mathbb{E}\left[ \left(e\left(X_{i}\right)-T_{i}\right)^2 |X \right] \right] \\ &= \frac{|\mathcal{I_k}|}{n^2} \mathbb{E}\left[\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)^2\frac{e\left(X_{i}\right)(1-e\left(X_{i}\right))}{e\left(X_{i}\right)^2} \right] \\ &= \frac{|\mathcal{I_k}|}{n^2} \mathbb{E}\left[\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)^2\left(\frac{1}{e\left(X_{i}\right)} -1\right) \right] \\ &\leq \frac{|\mathcal{I_k}|}{\eta n^2} \mathbb{E}\left[\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)^2 \right] && \text{Overlap} \\ &\leq \frac{|\mathcal{I_k}|}{ n^2}o_{\mathbb{P}}(1) \end{align*}\]

Therefore, because convergence on $L^{2}$-norm provides convergence in probability (Chebyshev inequality), we have for $k \in \{1, K\}$:

\[\sqrt{n}\, \frac{1}{n} \sum_{i \in \mathcal{I_k}}\left(\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(1-\frac{T_i}{e\left(X_{i}\right)}\right)\right) \stackrel{p}{\longrightarrow} 0\]

The second term can also controlled using similar arguments. Before detailed derivation, note that due to the uniform convergence of $\hat e(.)$ and the overlap assumption, there exist $M$ such that for all $n > M$, and for all $X_i$,

\[ \frac{\eta}{2} \le \hat e(X_i) \le 1- \frac{\eta}{2}.\]

Therefore, there exist $M$ such that for all $n > M$, and for all $X_i$, \[\begin{align*} \frac{1}{\hat e(X)} - \frac{1}{e(X)} &= \frac{e(X) - \hat e(X)}{\hat e(X) e(X)} \\ & \le 2\, \frac{e(X) - \hat e(X)}{\eta^2}. \end{align*}\]

Derivations are very close to the ones for the first term, noting that, \[\mathbb{E}\left[ \mathbb{E}\left[\frac{1}{n} \sum_{i \in \mathcal{I_k}} T_i\left(\left(Y_{i}-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right)\right)\mid X_i \right]\right] =0,\] so that,

\[\begin{align*} \mathbb{E}\left[ \left( \frac{1}{n} \sum_{i \in \mathcal{I_k}} T_i\left(Y_{i}-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right) \right)^2 \right] &= \operatorname{Var}\left[ \frac{1}{n} \sum_{i \in \mathcal{I_k}} T_i\left(Y_{i}-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right) \right]\\ &= \frac{1}{n^2} \sum_{i \in \mathcal{I_k}} \operatorname{Var}\left[ T_i\left(Y_{i}-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right) \right] && \text{iid}\\ &= \frac{|\mathcal{I_k}|}{n^2} \mathbb{E}\left[ T\left(Y-\mu_{1}\left(X\right)\right)^2\left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X\right)}-\frac{1}{e\left(X\right)}\right)^2 \right] \\ &\leq \frac{4|\mathcal{I_k}|}{\eta^4 n^2} \mathbb{E}\left[ T\left(Y-\mu_{1}\left(X\right)\right)^2\left(\hat{e}^{\bar{\mathcal{I}_k}}\left(X\right)-e\left(X\right)\right)^2 \right]\\ &\leq \frac{4|\mathcal{I_k}|}{\eta^4 n^2} \mathbb{E}\left[ \left(Y-\mu_{1}\left(X\right)\right)^2\left(\hat{e}^{\bar{\mathcal{I}_k}}\left(X\right)-e\left(X\right)\right)^2 \right]&& \text{Sicne $T\leq 1$}\\ &\leq \frac{4|\mathcal{I_k}|}{\eta^4 n^2} \mathbb{E}\left[\mathbb{E}\left[ \left(Y-\mu_{1}\left(X\right)\right)^2\left(\hat{e}^{\bar{\mathcal{I}_k}}\left(X\right)-e\left(X\right)\right)^2 | X \right]\right]\\ &\leq \frac{4|\mathcal{I_k}|}{\eta^4 n^2} \mathbb{E}\left[ \mathbb{E}\left[\left(Y-\mu_{1}\left(X\right)\right)^2| X \right]\left(\hat{e}^{\bar{\mathcal{I}_k}}\left(X\right)-e\left(X\right)\right)^2 \right]\\ &\leq \frac{4|\mathcal{I_k}|}{\eta^4 n^2} \mathbb{E}\left[ \operatorname{Var}\left[Y| X \right]\left(\hat{e}^{\bar{\mathcal{I}_k}}\left(X\right)-e\left(X\right)\right)^2 \right]\\ &\leq \frac{4\operatorname{Var}\left[Y| X \right]|\mathcal{I_k}|}{\eta^4 n^2} \mathbb{E}\left[ \left(\hat{e}^{\bar{\mathcal{I}_k}}\left(X\right)-e\left(X\right)\right)^2 \right]\\ &\leq \frac{4\operatorname{Var}\left[Y| X \right]|\mathcal{I_k}|}{\eta^4 n^2} o_{\mathbb{P}}(1)\\ \end{align*}\]

Therefore, for $k \in \{1, K\}$:

\[\sqrt{n}\, \frac{1}{n} \sum_{i \in \mathcal{I_k}} T_i\left(Y_{i}-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right) \stackrel{p}{\longrightarrow} 0.\]

For the last term, the approach is different and will involve another assumption on the product of residuals,

\[\begin{align*} C_n^k &= \sqrt{n}\frac{1}{n} \sum_{i \in \mathcal{I_k}}T_i\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)\left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right) \\ & \le \sqrt{n}\sqrt{\frac{1}{n} \sum_{i \in \mathcal{I_k}}T_i\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)^2 } \sqrt{\frac{1}{n} \sum_{i \in \mathcal{I_k}} \left(\frac{1}{\hat{e}^{\bar{\mathcal{I}_k}}\left(X_{i}\right)}-\frac{1}{e\left(X_{i}\right)}\right)^2} && \text{C.S.} \\ & \le \frac{\sqrt{n}}{ \eta} \sqrt{\frac{1}{n} \sum_{i \in \mathcal{I_k}}T_i\left(\hat \mu_1^{\bar{\mathcal{I}_k}}\left(X_{i}\right)-\mu_{1}\left(X_{i}\right)\right)^2 } \sqrt{\frac{1}{n} \sum_{i \in \mathcal{I_k}} \left(e(X_i) - \hat e^{\bar{\mathcal{I}_k}}(X_i)\right)^2 } && \text{Overlap} \\ & = \frac{\sqrt{n}}{ \eta}\, o_{\mathbb{P}}(\frac{1}{\sqrt{n}}) && \text{Assumption} \\ & = \frac{1}{ \eta}\, o_{\mathbb{P}}(1) \end{align*}\]

Each term $A_n^k$, $B_n^k$, and $C_n^k$ has been shown to be bounded by a term in $o_{\mathbb{P}}(1)$. The remainder term with function $\hat m^k_0(Y,A,X) - m_0(Y,A,X)$ can also be controlled with the same derivation, except using the uniform convergence of $\hat \mu^{\bar{\mathcal{I}_k}}_0(.)$. Since we have: \[\sqrt{n} (\hat \tau_{\text{AIPW}} - \hat\tau_{\text{AIPW}}^*) = \sum_{k=1}^K A_n^k + B_n^k + C_n^k \] Therefore, $\sqrt{n} (\hat \tau_{\text{AIPW}} - \hat\tau_{\text{AIPW}}^*) \stackrel{p}{\longrightarrow} 0$, so that $\hat \tau_{\text{AIPW}}$ has the same large sample properties as the oracle estimator.

To conclude,

\[\begin{equation*} \sqrt{n} (\hat \tau_{\text{AIPW}} - \tau) = \underbrace{\sqrt{n} (\hat \tau_{\text{AIPW}} - \hat\tau_{\text{AIPW}}^*)}_\textrm{$\stackrel{p}{\longrightarrow} 0$} + \underbrace{\sqrt{n}(\hat\tau_{\text{AIPW}}^* - \tau)}_\textrm{$\stackrel{d}{\rightarrow} \mathcal{N}\left(0, V_{\text{AIPW}}^* \right)$}, \end{equation*}\]

where \[V_{\text{AIPW}}^* =\mathbb{E}\left[ \frac{\left(Y^{(1)}-\mu_1(X, V)\right)^2 }{e(X)} \right]+\mathbb{E}\left[ \frac{(Y^{(0)}-\mu_0(X, V))^2}{1-e(X)} \right] + \operatorname{Var}[\mu_1(X,V) - \mu_0(X,V)].\]