1 Statistical intoduction
1.1 Statistical models
Definition 1.1: Statistical model
- a state space \(\mathcal{X}\)
- a family of probabilities \(\mathcal{C}\) on \((\mathcal{X}, \mathcal{T}(\mathcal{X}))\)
With \(\mathcal{T}(\mathcal{X})\) is the \(\sigma\)-field induced by \(\mathcal{X}\). We say that a model \(\mathcal{M} = (\mathcal{X}, \mathcal{C})\) is parametric if \(\mathcal{C}\) corresponds to a family of probabilities \(\{\mathbb{P}_\theta, \theta \in \Theta\}\), where \(\Theta\) is a sub-set of \(\mathbb{R}^d\) with \(d\geq1\); there exists a function mapping for each \(\theta\) an element \(\mathbb{P}_\theta \in \mathcal{C}\).
The model is identifiable if the mapping function is injective, if \(\mathbb{P}_\theta = \mathbb{P}_{\theta'}\Rightarrow \theta = \theta'\).
In other words, \(\mathcal{X}\) is the space in which the data, namely the observed random variables take their values. For instance, if we want to study whether a coin is balanced or not, we can set \(\mathcal{X} =\{0,1\}\) (0 for heads and 1 for tails) and \(\mathcal{C} = \{\mathbb{P}_{\theta}, \theta \in [0,1]\}\) where \(\mathbb{P}_\theta\) is the Bernoulli distribution of parameter \(\theta\).
Definition 1.2: Statistic
Definition 1.3: \(n\)-sample
1.2 Estimators
Definition 1.4: Estimator
1.2.1 Bias and Variance of an estimator
Definition 1.5: Bias of an estimator
If we take back the previous example, let \((X_1, ....., X_n)\) be an \(n\)-sample of the statistical model \(\mathcal{M} = (\mathcal{X}, \{\mathbb{P}_\theta, \theta \in \Theta\})\) where \(\mathcal{X} =\{0,1\}\) and \(\mathbb{P}_\theta\) is the Bernoulli distribution of parameter \(\theta\). We define \(\bar{X}_n\) as the empirical mean then have:
\[\begin{align*} \mathrm{E}[\bar{X}_n] &= \mathbb{E}\left[\frac{1}{n} \sum_{i=1}^{n} X_i\right] \\ &= \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[X_{i}] && \text{linearity of $\mathbb{E}[.]$}\\ &= \frac{1}{n} \sum_{i=1}^{n} \theta && \text{i.i.d according to $\mathbb{P}_\theta$} \\ &= \frac{1}{n} \, n \, \theta\\ &= \theta \end{align*}\]
Therefore the empirical mean is an unbiased estimator of the function \(g:\theta \mapsto \theta\). Could we find another unbiased estimator of the average? Of course, in fact the estimator taking the \(k^{th}\) value of the sample is also an unbiased estimator of \(\theta\). However, it will suffer of what we will call a high variance.
Definition 1.6: Variance of an estimator
Beyond bias, a very important property is how estimates spread around the average value. For example, between two unbiased estimators, a way to choose a better one would be the one being the more precise; that is with the smaller variance. Again if we illustrate this with the coin example we have:
\[\begin{align*} \operatorname{Var}\left[\bar{X}_n\right] &= \operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} X_{i}\right] \\ &= \frac{1}{n^2} \operatorname{Var}\left[ \sum_{i=1}^{n} X_{i}\right] && \text{properties of variance} \\ &= \frac{1}{n^2} n\,\operatorname{Var}\left[ X\right] && \text{independent observations} \\ &= \frac{\operatorname{Var}\left[X \right]}{n}. \end{align*}\]
This formula illustrates that the properties of the estimator in this case depends on both the underlying distribution and the sample size. In particular, we see that the more value we use in our estimator the smaller its variance is and therefore the more precise our estimator is.
1.3 Large sample properties
These properties characterize estimators when the size of the \(n\)-sample gets large, and are also called asymptotic properties. A very famous property is called consistency, when it holds it ensures that the probability to find an estimator \(\mathsf{T}\) close to its estimand is getting bigger with \(n\).
Definition 1.7: Consistency
Note that this can be linked to \(L^2\) convergence of random variables. Hence a sufficient condition to achieve Consistency [2] is \(L^2\) convergence:
Proposition 1.1: Sufficient condition ta achieve consistency
Then \(\mathsf{T}_n\) is a consistent estimator of \(g(\theta)\).
Proof
\[\begin{equation*} \mathbb{P}\left(\left|\widehat{\theta}-\theta\right| \geq \varepsilon\right) \leq \frac{\mathbb{E}\left[\left(\hat{\theta}-\theta\right)^{2}\right]}{\varepsilon^{2}}. \end{equation*}\]
This property means that the distribution of \(\mathsf{T}_n\) concentrates around \(g(\theta)\) as \(n\) gets bigger. We also say that \(\mathsf{T}_n\) is converging in law toward \(g(\theta)\). Note that unlike unbiasedness which state an estimator’s property of any sample size, here the property characterize the distribution if \(n\) is sufficiently large. If an estimator is not consistent, there is little interest - if no interest - to gather a lot of data.
Again if we illustrate this with the coin flip example, we have the consistency of the empirical mean estimator using the strong law of numbers.
Now we define mathematically the fact that an estimator’s distribution clusters around a target parameter, it says nothing about how it does it. It can be done with different rates and shapes. Therefore, we define below how we can understand a rate of convergence.
Definition 1.8: Rate of convergence
The rate \(r_n\) tells us how fast our estimator clusters around its estimand. This is very important, because the higher \(r_n\) the higher we can get information with a restricted set of data. Usually we can hear about \(\sqrt{n}\) consistent estimator.
Another property we would like to have – beyond the rate – is the shape of the asymptotic distribution. It means that we can characterize not only the speed, but we know that \(\mathsf{T}_n\) is clustering around \(g(\theta)\). A very famous property is the asymptotic normality, that is the clustering is homogeneous around the target parameter.
Definition 1.9: Asymptotic normality
Usually \(\sqrt{n}\) consistency and asymptotically normality is the best we can do.