1 Statistical intoduction

1.1 Statistical models

Definition 1.1: Statistical model

A statistical model $\mathcal{M}$ is the association of:

a state space $\mathcal{X}$
a family of probabilities $\mathcal{C}$ on $(\mathcal{X}, \mathcal{T}(\mathcal{X}))$

With $\mathcal{T}(\mathcal{X})$ is the $\sigma$-field induced by $\mathcal{X}$. We say that a model $\mathcal{M} = (\mathcal{X}, \mathcal{C})$ is parametric if $\mathcal{C}$ corresponds to a family of probabilities $\{\mathbb{P}_\theta, \theta \in \Theta\}$, where $\Theta$ is a sub-set of $\mathbb{R}^d$ with $d\geq1$; there exists a function mapping for each $\theta$ an element $\mathbb{P}_\theta \in \mathcal{C}$.

The model is identifiable if the mapping function is injective, if $\mathbb{P}_\theta = \mathbb{P}_{\theta'}\Rightarrow \theta = \theta'$.

In other words, $\mathcal{X}$ is the space in which the data, namely the observed random variables take their values. For instance, if we want to study whether a coin is balanced or not, we can set $\mathcal{X} =\{0,1\}$ (0 for heads and 1 for tails) and $\mathcal{C} = \{\mathbb{P}_{\theta}, \theta \in [0,1]\}$ where $\mathbb{P}_\theta$ is the Bernoulli distribution of parameter $\theta$.

Definition 1.2: Statistic

Let $\mathcal{M} = (\mathcal{X}, \mathcal{C})$ be a statistical model and $(\mathcal{Z}, \mathcal{T}(\mathcal{Z}))$ a measurable space. We define a statistic $\mathsf{S}$ on the model $\mathcal{M}$ as a measurable function of $(\mathcal{X}, \mathcal{T}(\mathcal{X}))$ to $(\mathcal{Z}, \mathcal{T}(\mathcal{Z}))$.

Definition 1.3: $n$-sample

Let $\mathcal{M} = (\mathcal{X}, \mathcal{C})$ be a statistical model and $n \in \mathbb{N}^*$. We say that $(X_1, ....., X_n)$ is a $n$-sample of $\mathcal{M}$ if $(X_1, ....., X_n)$ are n independent and identically distributed random variables from $\mathbb{Q} \in \mathcal{C}$

1.2 Estimators

Definition 1.4: Estimator

Let $\mathcal{M} = (\mathcal{X}, \{\mathbb{P}_\theta, \theta \in \Theta\})$ be a parametric statistical model and $g : \Theta \mapsto \mathbb{R}^q$ a function. An estimator $\mathsf{T}$ of $g(\theta)$ is a statistic in $\mathbb{R}^q$.

1.2.1 Bias and Variance of an estimator

Definition 1.5: Bias of an estimator

Let $\mathcal{M} = (\mathcal{X}, \{\mathbb{P}_\theta, \theta \in \Theta\})$ be a parametric statistical model, $g: \Theta\rightarrow \mathbb{R}^q$ a measurable function and $\mathsf{T}$ an estimator of the function $g: \Theta\mapsto g(\theta)$. The bias of the estimator $\mathsf{T}$ is defined as: \[ b_g(\theta, \mathsf{T}) = \mathbb{E}_\theta[T] - g(\theta)\] We say that $\mathsf{T}$ is an unbiased estimator of the function $g: \Theta\mapsto g(\theta)$ if: \[\forall \theta \in \Theta \quad \mathbb{E}_\theta[T] = g(\theta)\]

If we take back the previous example, let $(X_1, ....., X_n)$ be an $n$-sample of the statistical model $\mathcal{M} = (\mathcal{X}, \{\mathbb{P}_\theta, \theta \in \Theta\})$ where $\mathcal{X} =\{0,1\}$ and $\mathbb{P}_\theta$ is the Bernoulli distribution of parameter $\theta$. We define $\bar{X}_n$ as the empirical mean then have:

\[\begin{align*} \mathrm{E}[\bar{X}_n] &= \mathbb{E}\left[\frac{1}{n} \sum_{i=1}^{n} X_i\right] \\ &= \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[X_{i}] && \text{linearity of $\mathbb{E}[.]$}\\ &= \frac{1}{n} \sum_{i=1}^{n} \theta && \text{i.i.d according to $\mathbb{P}_\theta$} \\ &= \frac{1}{n} \, n \, \theta\\ &= \theta \end{align*}\]

Therefore the empirical mean is an unbiased estimator of the function $g:\theta \mapsto \theta$. Could we find another unbiased estimator of the average? Of course, in fact the estimator taking the $k^{th}$ value of the sample is also an unbiased estimator of $\theta$. However, it will suffer of what we will call a high variance.

Definition 1.6: Variance of an estimator

Let $\mathcal{M} = (\mathcal{X}, \{\mathbb{P}_\theta, \theta \in \Theta\})$ be a parametric statistical model, $g: \Theta\rightarrow \mathbb{R}^q$ a measurable function and $\mathsf{T}$ an estimator of the function $g: \Theta\mapsto g(\theta)$. The variance of the estimator $\mathsf{T}$ is defined as: \[\operatorname{Var}_\theta(\mathsf{T}) = \mathbb{E}_\theta[(T - \mathbb{E}_\theta[T])^2]\]

Beyond bias, a very important property is how estimates spread around the average value. For example, between two unbiased estimators, a way to choose a better one would be the one being the more precise; that is with the smaller variance. Again if we illustrate this with the coin example we have:

\[\begin{align*} \operatorname{Var}\left[\bar{X}_n\right] &= \operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} X_{i}\right] \\ &= \frac{1}{n^2} \operatorname{Var}\left[ \sum_{i=1}^{n} X_{i}\right] && \text{properties of variance} \\ &= \frac{1}{n^2} n\,\operatorname{Var}\left[ X\right] && \text{independent observations} \\ &= \frac{\operatorname{Var}\left[X \right]}{n}. \end{align*}\]

This formula illustrates that the properties of the estimator in this case depends on both the underlying distribution and the sample size. In particular, we see that the more value we use in our estimator the smaller its variance is and therefore the more precise our estimator is.

1.3 Large sample properties

These properties characterize estimators when the size of the $n$-sample gets large, and are also called asymptotic properties. A very famous property is called consistency, when it holds it ensures that the probability to find an estimator $\mathsf{T}$ close to its estimand is getting bigger with $n$.

Definition 1.7: Consistency

Let $\mathcal{M}_n = (\mathcal{X}_{n}, \{\mathbb{P}_{n,\theta}, \theta \in \Theta\})$ sequence of parametric statistical model, $g: \Theta\rightarrow \mathbb{R}^q$ a measurable function. Let $\{\mathsf{T}_n | n \in \mathbb{N}^* \}$ be a sequence of estimators of $g: \Theta\mapsto g(\theta)$, we say that $\mathsf{T}_n$ is a consistent estimator if: \[\forall \epsilon > 0 \quad \forall \theta \in \Theta \quad \lim_{n \rightarrow \infty} \mathbb{P}\left(\left|\mathsf{T}_n-g(\theta)\right|>\varepsilon\right) = 0\]

Note that this can be linked to $L^2$ convergence of random variables. Hence a sufficient condition to achieve Consistency [2] is $L^2$ convergence:

Proposition 1.1: Sufficient condition ta achieve consistency

Let $\{\mathsf{T}_n | n \in \mathbb{N}^* \}$ be a sequence of estimators of $g: \Theta\mapsto g(\theta)$ such that: \[\begin{equation*} \lim_{n \rightarrow \infty} \mathbb{E}\left[\left(\mathsf{T}_n-g(\theta)\right)^{2}\right] = 0 \end{equation*}\]

Then $\mathsf{T}_n$ is a consistent estimator of $g(\theta)$.

Proof

this is a direct consequence of Chebyshev’s inequality to have,

\[\begin{equation*} \mathbb{P}\left(\left|\widehat{\theta}-\theta\right| \geq \varepsilon\right) \leq \frac{\mathbb{E}\left[\left(\hat{\theta}-\theta\right)^{2}\right]}{\varepsilon^{2}}. \end{equation*}\]

This property means that the distribution of $\mathsf{T}_n$ concentrates around $g(\theta)$ as $n$ gets bigger. We also say that $\mathsf{T}_n$ is converging in law toward $g(\theta)$. Note that unlike unbiasedness which state an estimator’s property of any sample size, here the property characterize the distribution if $n$ is sufficiently large. If an estimator is not consistent, there is little interest - if no interest - to gather a lot of data.

Again if we illustrate this with the coin flip example, we have the consistency of the empirical mean estimator using the strong law of numbers.

Now we define mathematically the fact that an estimator’s distribution clusters around a target parameter, it says nothing about how it does it. It can be done with different rates and shapes. Therefore, we define below how we can understand a rate of convergence.

Definition 1.8: Rate of convergence

Let $\mathcal{M}_n = (\mathcal{X}_{n}, \{\mathbb{P}_{n,\theta}, \theta \in \Theta\})$ sequence of parametric statistical model, $g: \Theta\rightarrow \mathbb{R}^q$ a measurable function. Let $\{\mathsf{T}_n | n \in \mathbb{N}^* \}$ be a sequence of estimators of $g: \Theta\mapsto g(\theta)$ and $\{r_n | n \in \mathbb{N}^* \}$ a sequence that converges to infinity, we say that $\mathsf{T}_n$ has rate of convergence $r_{n}$ if: \[\forall \epsilon > 0 \quad \forall \theta \in \Theta \quad \lim_{n \rightarrow \infty} \mathbb{P}\left(\left|r_n(\mathsf{T}_n-g(\theta))\right|>\varepsilon\right) = 0\]

The rate $r_n$ tells us how fast our estimator clusters around its estimand. This is very important, because the higher $r_n$ the higher we can get information with a restricted set of data. Usually we can hear about $\sqrt{n}$ consistent estimator.

Another property we would like to have – beyond the rate – is the shape of the asymptotic distribution. It means that we can characterize not only the speed, but we know that $\mathsf{T}_n$ is clustering around $g(\theta)$. A very famous property is the asymptotic normality, that is the clustering is homogeneous around the target parameter.

Definition 1.9: Asymptotic normality

Let $\mathcal{M}_n = (\mathcal{X}_{n}, \{\mathbb{P}_{n,\theta}, \theta \in \Theta\})$ sequence of parametric statistical model, $g: \Theta\rightarrow \mathbb{R}^q$ a measurable function. Let $\{\mathsf{T}_n | n \in \mathbb{N}^* \}$ be a sequence of estimators of $g: \Theta\mapsto g(\theta)$, we say that $\mathsf{T}_n$ is said to have an asymptotic standard normal distribution if: \[\forall \theta \in \Theta \quad \sqrt{n}(\mathsf{T}_n-g(\theta))\stackrel{\mathbb{P}_\theta}{\longrightarrow } \mathcal{N}(0, \Gamma(\theta))\] Where for all $\theta \in \Theta$, $\Gamma(\theta)$ is a symmetrical and positive matrix.

Usually $\sqrt{n}$ consistency and asymptotically normality is the best we can do.