The estimate for \(\boldsymbol \theta\) can be found using standard optimisation routines.
Outliers
Mean-shift formulation
Over-parameterise the observation process, \[\mathbf{y}_t = A \mathbf{x}_t + {\color{#e64626}{\boldsymbol{\gamma}_t}} + \mathbf{v}_t\] for \(t=1,\ldots,n\) where \({\color{#e64626}{\boldsymbol{\gamma}_1}},\dots,{\color{#e64626}{\boldsymbol{\gamma}_n}}\) are shift parameters.
If \(\mathbf{y}_t\) is identified as an outlier then \(\color{#e64626}{\boldsymbol{\gamma}_t}\) will be non-zero.
Want to encourage some sparsity here!
Estimation
Minimize over \(\boldsymbol\theta, \boldsymbol\gamma_{t}\) (\(t=1,\dots,n\)):
We’re using a \(L_0\) style penalty, which is a “hard” penalty in the sense that there is a fixed cost for activating an outlier ar time \(t\), \[
P(\boldsymbol{\gamma}_t; \lambda) =\begin{cases}\lambda^2 & \text{if } \boldsymbol{\gamma}_t \neq 0 \\ 0 & \text{otherwise} \end{cases}
\]
Contrast with a “soft” type of penalty, e.g. \(L_1\) (Lasso), where the penalty grows with the size of the outlier and gives continuous shrinkage.
Benefit: hard thresholding should give better control over the number of outliers identified: masking and swamping issues (She & Owen, 2011).
Alternating estimation algorithm
At iteration \(k\):
Estimate \(\color{#e64626}{\boldsymbol\theta^{(k)}}\) given all \(\boldsymbol\gamma^{(k-1)}_{t}\) by finding the minimizer of:
Bayesian Information Criterion (BIC) is a data-driven way of selecting \(\lambda\), \[\begin{align}
\text{BIC}(\lambda) = - 2 \operatorname{Log-Likelihood}(\hat{\boldsymbol{\theta}}_\lambda, \hat{\boldsymbol{\gamma}}_\lambda) + |\operatorname{nz}(\lambda)|\log n,
\end{align}\] where \(|\text{nz}(\lambda)|\) is the number of timepoints that have been flagged as outliers.
Interpretation:
First term: model fit (better fit => smaller value)
Second term: penalty for model complexity (more detected outliers => higher penalty).
This balances fit vs parsimony, avoiding both over-detection (too many outliers) and under-detection (missing true outliers).
Choosing \(\lambda\)
First difference correlated random walk model
A specific example of a state space model, \[\begin{align}
\mathbf{x}_t & = \mathbf{x}_{t-1} + \alpha (\mathbf{x}_{t-1} - \mathbf{x}_{t-2}) + \mathbf{w}_{t} \\
\mathbf{y}_t & = \mathbf{x}_t + \mathbf{v}_t.
\end{align}\]
\(\sigma^2_{\mathbf{v},\text{lon}}\) and \(\sigma^2_{\mathbf{v},\text{lat}}\) are observation noise variances
\(\sigma^2_{\mathbf{w},\text{lon}}\) and \(\sigma^2_{\mathbf{w},\text{lat}}\) are state noise variances
The position at time \(t\), \(\mathbf{x}_t\), depends on the position at time \(t-1\), \(\mathbf{x}_{t-1}\), and the velocity at time \(t-1\), \(\mathbf{x}_{t-1} - \mathbf{x}_{t-2}\).
Benchmark methods
Classical SSM (maximum likelihood estimator)
ML estimator with Huber function replacing the quadratic loss (Crevits & Croux, 2019)
Maximum trimmed likelihood method discarding a fraction of the largest Mahalanobis distances from the likelihood (Crevits & Croux, 2019)