6  Inverse Propensity Weighting

A first way to understand the Inverse Propensity Weighting (IPW) is to understand what the oracle IPW. Let us say that we somehow found a way to predict perfectly $ x X()$ the propensity score \(e(x)\). Then we can define the Oracle Inverse Propensity Weighting estimator as:

Definition 6.1: Oracle Inverse Propensity Weighting estimator
Assume the propensity score \(e(X)\) is a known function, then \(\hat{\tau}_{\text{\tiny IPW }}^{*}\) denotes the oracle IPW estimator, \[\begin{equation*} \hat{\tau}_{\text {\tiny IPW }}^{*}=\frac{1}{n} \sum_{i=1}^{n}\left(\frac{T_{i} Y_{i}}{e\left(X_{i}\right)}-\frac{\left(1-T_{i}\right) Y_{i}}{1-e\left(X_{i}\right)}\right) . \end{equation*}\]

Proposition 6.1: Oracle IPW unbiasedness
The oracle IPW, denoted \(\hat{\tau}_{\text {\tiny IPW }}^{*}\), is an unbiased estimator of the average treatment effect \(\tau\), that is, \[\begin{equation*} \mathbb{E}[\hat{\tau}_{\text{\tiny IPW }}^{*}] = \tau. \end{equation*}\]

Proof
Expectancy of the first term in the IPW gives, \[\begin{align*} \mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^{n}\left(\frac{T_{i} Y_{i}}{e\left(X_{i}\right)} \right)\right] &= \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}\left[\frac{T_{i} Y_{i}}{e\left(X_{i}\right)} \right] && \text{Linearity of $\mathbb{E}[.]$} \\ &= \mathbb{E}\left[\frac{T_{i} Y_{i}^{(1)}}{e\left(X_{i}\right)} \right] && \text{\textit{iid} and consistency} \\ &= \mathbb{E}\left[ \mathbb{E}\left[\frac{T_{i} Y_{i}^{(1)}}{e\left(X_{i}\right)} \mid X_i \right] \right] && \text{Total probability}\\ &= \mathbb{E}\left[ \frac{1}{e(X_i)}\mathbb{E}\left[T_{i} Y_{i}^{(1)} \mid X_i \right] \right] && \text{$e(X)$ is a function of $X$}\\ &= \mathbb{E}\left[ \frac{1}{e(X_i)}\mathbb{E}\left[Y_{i}^{(1)} \mid X_i \right] \mathbb{E}\left[T_{i}\mid X_i \right] \right] && \text{Uncounf.}\\ &= \mathbb{E}\left[ \mathbb{E}\left[Y_{i}^{(1)} \mid X_i \right] \right] && \text{Def. of $e(X)$} \\ &= \mathbb{E}[Y_{i}^{(1)} ]. \end{align*}\]

Similarly one can show that,

\[\begin{equation*} \mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^{n}\left(\frac{(1-T_{i}) Y_{i}}{1-e\left(X_{i}\right)} \right)\right] = \mathbb{E}[Y_{i}^{(0)} ], \end{equation*}\]

such that \(\mathbb{E}[\hat{\tau}_{\text {\tiny IPW }}^{*}] = \mathbb{E}[Y_{i}^{(1)}] - \mathbb{E}[Y_{i}^{(0)}] = \tau\)

In addition, the oracle IPW estimator is consistent with an asymptotic normality property.

Proposition 6.2: Consistency and asymptotic normality of the oracle IPW estimator
The oracle IPW, denoted \(\hat{\tau}_{\text {\tiny IPW }}^{*}\), is consistent, \[\hat{\tau}_{\text{\tiny IPW}}^* \stackrel{p}{\longrightarrow} \tau, \]

and is an asymptotically normal estimator, that is,

\[\sqrt{n}\left(\hat{\tau}_{\text{\tiny IPW}}^* - \tau \right) \stackrel{d}{\rightarrow} \mathcal{N}\left(0, V_{\text {\tiny IPW }^*}\right),\]

where, \[V_{\text {\tiny IPW }}^{*} = \mathbb{E}\left[\frac{\left( Y^{(0)} \right)^2}{1-e(X)} + \frac{\left( Y^{(1)} \right)^2}{e(X)}\right] - \tau^2.\]

Proof
Consistentcy: We have already showed that the oracle IPW is an unbiased estimator of \(\tau\). We can now calculate the variance of the oracle IPW:

\[\begin{align*} V_{\text {\tiny IPW }}^* &= \operatorname{Var}\left[ \frac{1}{n} \sum_{i=1}^{n}\left(\frac{T_{i} Y_{i}}{e\left(X_{i}\right)}-\frac{\left(1-T_{i}\right) Y_{i}}{1-e\left(X_{i}\right)}\right) \right] \\ &= \frac{1}{n^2} \operatorname{Var}\left[ \sum_{i=1}^{n}\left(\frac{T_{i} Y_{i}}{e\left(X_{i}\right)}-\frac{\left(1-T_{i}\right) Y_{i}}{1-e\left(X_{i}\right)}\right) \right] && \text{Variance property} \\ &= \frac{1}{n} \operatorname{Var}\left[ \left(\frac{T Y}{e\left(X\right)}-\frac{\left(1-T\right) Y}{1-e\left(X\right)}\right) \right] && \text{iid} \\ &= \frac{1}{n} \mathbb{E}\left[ \left(\frac{(1-T) Y}{1-e(x)}-\mathbb{E}[Y^{(0)}]\right)^{2} + \left(\frac{T Y}{e(x)}-\mathbb{E}[Y^{(1)}]\right)^{2} \right] \\ &\quad- \frac{2}{n} \mathbb{E}\left[ \left(\frac{(1-T) Y}{1-e(x)}-\mathbb{E}[Y^{(0)}]\right)\left(\frac{T Y}{e(x)}-\mathbb{E}[Y^{(1)}]\right) \right] \\ &= \frac{1}{n} \left( \mathbb{E}\left[ \left(\frac{(1-T) Y}{1-e(x)}\right)^{2} \right] - \mathbb{E}[Y^{(0)}]^2 \right)+ \frac{1}{n} \left(\mathbb{E}\left[ \left(\frac{T Y}{e(x)}\right)^{2} \right] - \mathbb{E}[Y^{(1)}]^2\right) \\ & \quad \quad - \frac{2}{n} \left(\underbrace{\mathbb{E}\left[ \frac{T (1-T) Y^2}{e(x)(1-e(x))}\right] }_{=0} - \underbrace{\mathbb{E}[Y^{(1)}]\mathbb{E}\left[\frac{(1-T) Y}{1-e(x)}\right]-\mathbb{E}[Y^{(0)}]\mathbb{E}\left[\frac{T Y}{e(x)}\right] +\mathbb{E}[Y^{(1)}]\mathbb{E}[Y^{(0)}]}_{=\mathbb{E}[Y^{(1)}]\mathbb{E}[Y^{(0)}]} \right) \\ &= \frac{1}{n} \left(\mathbb{E}\left[ \left(\frac{(1-T) Y}{1-e(x)}\right)^{2} \right] + \mathbb{E}\left[ \left(\frac{T Y}{e(x)}\right)^{2} \right] - \mathbb{E}[Y^{(1)}]^2 - \mathbb{E}[Y^{(0)}]^2 + 2\, \mathbb{E}[Y^{(0)}]\mathbb{E}[Y^{(1)}] \right)\\ &= \frac{1}{n} \left(\mathbb{E}\left[ \left(\frac{(1-T) Y}{1-e(x)}\right)^{2} \right] + \mathbb{E}\left[ \left(\frac{T Y}{e(x)}\right)^{2} \right] - \left(\mathbb{E}[Y^{(1)}] - \mathbb{E}[Y^{(0)}] \right)^2\right). \end{align*}\] We can further simplify this expression noting that,

\[\begin{align*} \mathbb{E}\left[ \left( \frac{T Y}{e(X)} \right)^2 \right] &=\mathbb{E}\left[\left( \frac{T Y^{(1)}}{e\left(X\right)} \right)^2\right] && \text{Consistency} \\ &=\mathbb{E}\left[\mathbb{1}_{\left\{T=1\right\}} \left(\frac{ Y^{(1)}}{e\left(X\right)} \right)^2\right] && \text{T is binary} \\ &=\mathbb{E}\left[\mathbb{E}\left[\mathbb{1}_{\left\{T=1\right\}} \left(\frac{ Y^{(1)}}{e\left(X\right)} \right)^2 \mid X\right]\right] \\ &=\mathbb{E}\left[\frac{1}{e(X)^2}\mathbb{E}\left[\mathbb{1}_{\left\{T=1\right\}} \left( Y^{(1)} \right)^2 \mid X\right]\right] \\ &=\mathbb{E}\left[\frac{\left( Y^{(1)} \right)^2}{e(X)^2}\mathbb{E}\left[\mathbb{1}_{\left\{T=1\right\}} \mid X\right]\right] &&\text{Uncounfoundness} \\ &=\mathbb{E}\left[\frac{\left( Y^{(1)} \right)^2}{e(X)^2}e(X)\right] &&\text{Definition of e(X)} \\ &=\mathbb{E}\left[\frac{\left( Y^{(1)} \right)^2}{e(X)}\right]. \end{align*}\]

Similarly, \[\mathbb{E}\left[ \left( \frac{(1-A)Y}{1-e(X)} \right)^2 \right] = \mathbb{E}\left[\frac{\left( Y^{(0)} \right)^2}{1-e(X)}\right]. \]

Therefore, we recover the asymptotic variance for the oracle IPW \(\hat{\tau}_{\text {IPW}}^{*}\),

\[ V_{\text {\tiny IPW }}^* = \frac{1}{n} \left( \mathbb{E}\left[\frac{\left( Y^{(0)} \right)^2}{1-e(X)} + \frac{\left( Y^{(1)} \right)^2}{e(X)}\right] - \tau^2 \right) \]

Since the variance will converge to 0 when we increase the sample size \(n\) and that we also have an unbiased estimator we can conclude on consistency of \(\hat \tau_{\text{IPW}}^*\).

Another proof of consistency: The consistency of the oracle IPW estimator can directly be shown with the weak law of large number. Denoting \(Z_i = \frac{T_{i}Y_{i}}{e(X_{i})} - \frac{(1-T_{i})Y_{i}}{1-e(X_{i})}\), and considering the series \(Z_1, Z_2, \dots, Z_n\) with finite mean (\(\mathbb{E}[Z] = \tau\)), then the weak law of large number gives \[\begin{equation*} \bar{Z} \stackrel{p}{\longrightarrow} \tau \quad \text { as } n \rightarrow \infty. \end{equation*}\]

This ensures the consistency of the oracle IPW estimator.

Asymptotic normality: We denote \(Z_i = \frac{T_{i}Y_{i}}{e(X_{i})} - \frac{(1-T_{i})Y_{i}}{1-e(X_{i})}\), we consider an iid sequence \(\left\{Z_{1}, Z_{2}, \ldots, Z_{n}\right\}\) with mean \(\tau\) and variance \(V_{\text{\tiny IPW}}^*:= \mathbb{E}\left[\frac{\left( Y^{(0)} \right)^2}{1-e(X)} + \frac{\left( Y^{(1)} \right)^2}{e(X)}\right] - \tau^2\). Then, the central limit theorem ensures that the sample average \(Z_{n}\), corresponding to the oracle IPW estimator, has an asymptotic standard normal distribution with mean \(\tau\) and variance \(\frac{V_{\text{\tiny IPW}}^*}{n}\). We denote this,

\[\begin{equation*} \sqrt{n}\left(\hat{\tau}_{\text{\tiny IPW}}^* - \tau \right) \stackrel{d}{\rightarrow} \mathcal{N}\left( 0,V_{\text{\tiny IPW}}^* \right) \end{equation*}\] Note that another proof is possible using M-estimation theory.

Definition 6.2: Inverse Propensity Weighting estimator
We denote by \(\hat{\tau}_{\text{\tiny IPW }}\) denotes the IPW estimator, \[\begin{equation*} \hat{\tau}_{\text {\tiny IPW }}=\frac{1}{n} \sum_{i=1}^{n}\left(\frac{T_{i} Y_{i}}{\hat{e}\left(X_{i}\right)}-\frac{\left(1-T_{i}\right) Y_{i}}{1-\hat{e}\left(X_{i}\right)}\right) . \end{equation*}\] where \(\hat{e}\) is our prediction of the propensity score.

Furthermore, we can show that if our estimation of the propensity score satisfies a certain condition, the under other assumption we have a convergence property of the Inverse Propensity Weighting estimator to its oracle estimator.

Proposition 6.3: convergence of the IPW estimator toward its oracle estimator
Assume that Overlap is satisfied and that there exists \(M\in\mathcal{R}^{+*}\) such that \(\left|Y\right| \leq M\), and \(\sup _{x \in \mathcal{X}}|e(x)-\hat{e}(x)|=\) \(\mathcal{O}_{P}\left(a_{n}\right)\) where \(\left(a_{n}\right) \rightarrow 0\). Then: \[|\hat{\tau}_{\text{IPW}} - \hat{\tau}_{\text{IPW}^*}| = \mathcal{O}_p\left(\frac{a_nM}{\eta^2}\right)\]

Proof
For \(n \in \mathbb{N}^*\) we have: \[\begin{align*} |\hat{\tau}_{\text{IPW},n} - \hat{\tau}_{\text{IPW}^*,n}|&= \left|\frac{1}{n}\sum_{i=1}^{n}T_{i} Y_{i}\left(\frac{1}{\hat{e}(X_{i})} -\frac{1}{e(X_{i})}\right) - (1 - T_{i})Y_{i}\left(\frac{1}{1-\hat{e}(X_{i})} -\frac{1}{1-e(X_{i})}\right)\right|\\ &\leq \left|\frac{1}{n}\sum_{i=1}^{n}T_{i} Y_{i}\left(\frac{e(X_{i})-\hat{e}(X_{i})}{\hat{e}(X_{i})e(X_{i})}\right) - (1 - T_{i})Y_{i}\left(\frac{\hat{e}(X_{i})-e(X_{i})}{(1-\hat{e}(X_{i}))(1-e(X_{i}))}\right)\right|\\ &\leq \max_{1 \leq i \leq n} \left|T_{i} Y_{i}\left(\frac{e(X_{i})-\hat{e}(X_{i})}{\hat{e}(X_{i})e(X_{i})}\right) - (1 - T_{i})Y_{i}\left(\frac{\hat{e}(X_{i})-e(X_{i})}{(1-\hat{e}(X_{i}))(1-e(X_{i}))}\right)\right|\\ &\frac{1}{\eta}\leq \max_{1 \leq i \leq n} \left|T_{i} Y_{i}\left(\frac{e(X_{i})-\hat{e}(X_{i})}{\hat{e}(X_{i})}\right) - (1 - T_{i})Y_{i}\left(\frac{\hat{e}(X_{i})-e(X_{i})}{1-\hat{e}(X_{i})}\right)\right| && \text{Overlap} \\ &\leq \frac{2}{\eta^2}\max_{1 \leq i \leq n} \left|T_{i} Y_{i}\left(e(X_{i})-\hat{e}(X_{i})\right) - (1 - T_{i})Y_{i}\left(\hat{e}(X_{i})-e(X_{i})\right)\right| && \text{(*)} \\ &\leq \frac{2M}{\eta^2} \max_{1 \leq i \leq n} |e(X_{i})-\hat{e}(X_{i})| && \left|Y\right| \leq M\\ \end{align*}\]

where in \((*)\) we used the fact that there exista a big enough \(n\) such that \(\frac{\eta}{2} \leq \hat{e}(X_{i}) \leq 1- \frac{\eta}{2}\) since we have that \(\sup _{x \in \mathcal{X}}|e(x)-\hat{e}(x)|=\) \(\mathcal{O}_{P}\left(a_{n}\right)\), and that \(|\hat{\tau}_{\text{IPW},n} - \hat{\tau}_{\text{IPW}^*,n}| \leq \frac{2M}{\eta^2} \max_{1 \leq i \leq n} |e(X_{i})-\hat{e}(X_{i})|\)

Therefore, we have that: \[|\hat{\tau}_{\text{IPW}} - \hat{\tau}_{\text{IPW}^*}| = \mathcal{O}_p\left(\frac{a_nM}{\eta^2}\right)\]

Hence, using now both previuos propotions we can built the main theorem of the IPW estimator.

Theorem 6.1: Asymptotic properties of \(\hat{\tau}_{\text{\tiny IPW }}\)
Assume that Overlap is satisfied and that there exists \(M\in\mathcal{R}^{+*}\) such that \(\left|Y\right| \leq M\), and \(\sup _{x \in \mathcal{X}}|e(x)-\hat{e}(x)|=\) \(\mathcal{O}_{P}\left(a_{n}\right)\) where \(\left(a_{n}\right) \rightarrow 0\), then: The IPW estimator, denoted $ _{}$, is consistent, \[\hat{\tau}_{\text{\tiny IPW}} \stackrel{p}{\longrightarrow} \tau, \]

and is an asymptotically normal estimator, that is,

\[\sqrt{n}\left(\hat{\tau}_{\text{\tiny IPW}} - \tau \right) \stackrel{d}{\rightarrow} \mathcal{N}\left(0, V_{\text {\tiny IPW }^*}\right),\]

where, \[V_{\text {\tiny IPW }}^{*} = \mathbb{E}\left[\frac{\left( Y^{(0)} \right)^2}{1-e(X)} + \frac{\left( Y^{(1)} \right)^2}{e(X)}\right] - \tau^2.\]

Furthermore, if \(\left(a_{n}\right)\) converges to \(0\) at a \(\sqrt{n}\)-rate we can show that \(\hat{\tau}{\text{\tiny IPW}}\) is \(\sqrt{n}\) consistent.

Proof
We compare the IPW estimator with the oracle one, and we write several inequalities being,

\[\begin{align*} |\hat{\tau}_{\text{\tiny IPW}} - \tau| &= |\hat{\tau}{\text{\tiny IPW}} - \tau{\text{\tiny IPW}}^* + \tau{\text{\tiny IPW}}^* - \tau| \\ &\leq |\hat{\tau}{\text{\tiny IPW}}- \tau{\text{\tiny IPW}}^*| + |\tau{\text{\tiny IPW}}^* - \tau| \end{align*}\]

We already showed that the oracle IPW estimator is consistent and asymptotically normally distributed. We also showed that \(|\hat{\tau}_{\text{IPW}} - \tau_{\text{IPW}^*}| = \mathcal{O}_p\left(\frac{a_nM}{\eta}\right)\). Therefore using the inequality from above we get that $ _{}$, is consistent.

Furthermore if we assume that \(\left(a_{n}\right)\) converges to \(0\) at a \(\sqrt{n}\)-rate we can write:

\[\begin{equation*} \sqrt{n} (\hat \tau_{\text{\tiny IPW}} - \tau) = \underbrace{\sqrt{n} (\hat \tau_{\text{\tiny IPW}} - \tau_{\text{\tiny IPW}} ^*)}_\textrm{$\stackrel{p}{\longrightarrow} 0$} + \underbrace{\sqrt{n}(\tau_{\text{\tiny IPW}} ^* - \tau)}_\textrm{$\stackrel{\mathcal{L} }{\rightarrow} \mathcal{N}\left(0, V_{\text{\tiny IPW}}^* \right)$}, \end{equation*}\] And we can conclude that \(\hat{\tau}{\text{\tiny IPW}}\) is \(\sqrt{n}\) consistent and its variance is the same as \(\tau_{\text{\tiny IPW}}^*\).

Therefore, if we can estimate the propensity score for all individuals then we can use IPW estimator estimator to compute the ATE. However, the IPW estimator has many problems. In particular, it is not invariant to translation of the outcome. For example, if we change \(Y\) to \(Y + c\) with \(c\) a constant , then we get: \[\begin{align*} \hat{\tau}_{\text{\tiny IPW}}^{bis} &= \frac{1}{n} \sum_{i=1}^{n}\left(\frac{T_{i} (Y_{i} + c)}{\hat{e}\left(X_{i}\right)}-\frac{\left(1-T_{i}\right) (Y_{i}+c)}{1-\hat{e}\left(X_{i}\right)}\right)\\ &= \hat{\tau}_{\text{\tiny IPW}} + c\left( \sum_{i=1}^n \frac{T_i}{\hat{e}\left(X_{i}\right)} - \frac{1-T_i}{1-\hat{e}\left(X_{i}\right)}\right) \end{align*}\] If we have the true propensity score, then \(\sum_{i=1}^n \frac{T_i}{\hat{e}\left(X_{i}\right)} - \frac{1-T_i}{1-\hat{e}\left(X_{i}\right)}\) should converge to 0. However, in general, it is not equal to 0 in finite samples. Since adding a constant to every outcome should not change the average causal effect, this estimator is not reasonable because of its dependence on \(c\).

A simple fix to the problem is to normalize the weights of our IPW estimator:

Definition 6.3: Oracle Inverse Propensity Weighting estimator with normalization
Assume the propensity score \(e(X)\) is a known function, then $ _{}^{*}$ denotes the oracle normalized IPW estimator, \[\begin{equation*} \hat{\tau}_{\text {\tiny IPW norm.}}^*=\left(\sum_{i=1}^{n} \frac{T_{i}}{e\left(X_{i}\right)}\right)^{-1} \sum_{i=1}^{n} \frac{T_{i} Y_{i}}{e\left(X_{i}\right)}-\left(\sum_{i=1}^{n} \frac{1-T_{i}}{1- e\left(X_{i}\right)}\right)^{-1} \sum_{i=1}^{n} \frac{\left(1-T_{i}\right) Y_{i}}{1-e\left(X_{i}\right)} . \end{equation*}\]

Proposition 6.4: Oracle normalized IPW unbiasedness
The oracle normalized IPW, denoted \(\hat{\tau}_{\text {\tiny IPW, norm }}^{*}\), is an unbiased estimator of the average treatment effect \(\tau\), that is,

\[\begin{equation*} \mathbb{E}[\hat{\tau}_{\text {\tiny IPW, norm }}^{*}] = \tau. \end{equation*}\]