Outlier-robust estimation of state-space models using a penalized approach

Blue whale data

Goal: Recover the true path taken by the whale

Why? Animal scientists need accurate movement paths to study animal behaviour

Modelling considerations:

Randomness of the whale
Measurement error of the satellite

Goal

State-space model

\[\begin{align} \mathbf x_t &= \Phi \mathbf x_{t-1} + \mathbf w_t \\ \mathbf y_t &= A \mathbf x_t + \mathbf v_t \\ \end{align}\]

\(\mathbf x_t\): state vector
- true position of object
\(\Phi\): state transition matrix
\(\mathbf w_t \sim N(\mathbf 0, \Sigma_\mathbf{w})\)

\(\mathbf y_t\): observation vector
- measured position of object
\(A\): observation / measurement matrix
\(\mathbf v_t \sim N(\mathbf 0, \Sigma_\mathbf{v})\)

Usually, \(\Phi\), \(\Sigma_\mathbf{w}\) and \(\Sigma_\mathbf{v}\) depend on model parameters \(\boldsymbol \theta\).

State-space model

Where are State Space Models used?

Econometrics & Finance
- Unobserved components of GDP, inflation, interest rates
Engineering & Control
- Signal processing and target tracking (radar, sonar, GPS)
Environmental & Ecological Science
- Animal movement from telemetry data
Biomedicine
- EEG/ECG signal analysis

Key Idea: SSMs apply whenever a time series is driven by an unobserved dynamic state.

Classical estimation

Given a parameter vector \(\boldsymbol \theta\):

\(\mathbf {\hat y}_{t|t-1}(\boldsymbol\theta)\) is the prediction for observation \(t\) given data up to time \(t-1\)
\(\mathbf r_t(\boldsymbol\theta) := \mathbf y_{t} - \mathbf {\hat y}_{t|t-1}(\boldsymbol\theta)\) is the residual
\(S_{t|t-1}(\boldsymbol\theta)\) is the prediction variance

\(\mathbf {\hat y}_{t|t-1}(\boldsymbol\theta)\) and \(S_{t|t-1}(\boldsymbol\theta)\) are computed using the Kalman filter

\[ \min_{\boldsymbol\theta}\sum_{t=1}^n \left\{\log \left|S_{t|t-1}(\boldsymbol\theta)\right| + \mathbf r_t(\boldsymbol\theta)^\top S_{t|t-1}^{-1}(\boldsymbol\theta) \mathbf r_t(\boldsymbol\theta)\right\} \]

The estimate for \(\boldsymbol \theta\) can be found using standard optimisation routines.

Outliers

Mean-shift formulation

Over-parameterise the observation process, \[\mathbf{y}_t = A \mathbf{x}_t + {\color{#e64626}{\boldsymbol{\gamma}_t}} + \mathbf{v}_t\] for \(t=1,\ldots,n\) where \({\color{#e64626}{\boldsymbol{\gamma}_1}},\dots,{\color{#e64626}{\boldsymbol{\gamma}_n}}\) are shift parameters.

If \(\mathbf{y}_t\) is identified as an outlier then \(\color{#e64626}{\boldsymbol{\gamma}_t}\) will be non-zero.

Want to encourage some sparsity here!

Estimation

Minimize over \(\boldsymbol\theta, \boldsymbol\gamma_{t}\) (\(t=1,\dots,n\)):

\[ \small{ \sum_{t=1}^n \left\{{\color{#e64626}{1_{\{\boldsymbol\gamma_t = \mathbf0\}}}}\log \left|S_{t|t-1}(\boldsymbol\theta)\right| + \left(\mathbf r_t(\boldsymbol\theta) - {\color{#e64626}{\boldsymbol\gamma_t}}\right)^\top S_{t|t-1}^{-1}(\boldsymbol\theta) \left(\mathbf r_t(\boldsymbol\theta) - \color{#e64626}{\boldsymbol\gamma_t}\right)\right\} + {\color{#e64626}{\sum_{t=1}^n P(\boldsymbol{\gamma}_t; \lambda)}} } \]

\(P\) is the hard penalty:

\[ P(\boldsymbol{\gamma}_t; \lambda) =\begin{cases}\lambda^2 & \text{if } \boldsymbol{\gamma}_t \neq \mathbf{0} \\ 0 & \text{otherwise} \end{cases} \]

\(\lambda\) is a tuning parameter

The “hard” penalty

We’re using a \(L_0\) style penalty, which is a “hard” penalty in the sense that there is a fixed cost for activating an outlier ar time \(t\), \[ P(\boldsymbol{\gamma}_t; \lambda) =\begin{cases}\lambda^2 & \text{if } \boldsymbol{\gamma}_t \neq 0 \\ 0 & \text{otherwise} \end{cases} \]
Contrast with a “soft” type of penalty, e.g. \(L_1\) (Lasso), where the penalty grows with the size of the outlier and gives continuous shrinkage.
Benefit: hard thresholding should give better control over the number of outliers identified: masking and swamping issues (She & Owen, 2011).

Alternating estimation algorithm

At iteration \(k\):

Estimate \(\color{#e64626}{\boldsymbol\theta^{(k)}}\) given all \(\boldsymbol\gamma^{(k-1)}_{t}\) by finding the minimizer of:

\[ \small{ \sum_{t=1}^n \left\{1_{\{\boldsymbol\gamma^{(k-1)}_t = 0\}}\log \left|S_{t|t-1}({\color{#e64626}{\boldsymbol\theta}})\right| + \left(\mathbf r_t({\color{#e64626}{\boldsymbol\theta}}) - \boldsymbol\gamma^{(k-1)}_t\right)^\top S_{t|t-1}^{-1}({\color{#e64626}{\boldsymbol\theta}}) \left(\mathbf r_t({\color{#e64626}{\boldsymbol\theta}}) - \boldsymbol\gamma^{(k-1)}_t\right)\right\} } \]

Update each \(\color{#e64626}{\boldsymbol\gamma^{(k)}_{t}}\) given \(\boldsymbol\theta^{(k)}\) via hard-thresholding:

\[ \small{{\color{#e64626}{\boldsymbol\gamma_t^{(k)}}} = \begin{cases} \mathbf r_t(\boldsymbol\theta^{(k)}) & \text{if } \sqrt{\log\Big|S_{t|t-1}^{(k)}\Big| + \text{MD}^2\left(\mathbf{r}_t^{(k)}, S_{t|t-1}^{(k)}\right)} > \color{#e64626}{\lambda} \\ \mathbf 0 & \text{otherwise}\end{cases} } \]

Repeat until convergence.

Choosing \(\lambda\)

Bayesian Information Criterion (BIC) is a data-driven way of selecting \(\lambda\), \[\begin{align} \text{BIC}(\lambda) = - 2 \operatorname{Log-Likelihood}(\hat{\boldsymbol{\theta}}_\lambda, \hat{\boldsymbol{\gamma}}_\lambda) + |\operatorname{nz}(\lambda)|\log n, \end{align}\] where \(|\text{nz}(\lambda)|\) is the number of timepoints that have been flagged as outliers.

Interpretation:
- First term: model fit (better fit => smaller value)
- Second term: penalty for model complexity (more detected outliers => higher penalty).
This balances fit vs parsimony, avoiding both over-detection (too many outliers) and under-detection (missing true outliers).

Choosing \(\lambda\)

First difference correlated random walk model

A specific example of a state space model, \[\begin{align} \mathbf{x}_t & = \mathbf{x}_{t-1} + \alpha (\mathbf{x}_{t-1} - \mathbf{x}_{t-2}) + \mathbf{w}_{t} \\ \mathbf{y}_t & = \mathbf{x}_t + \mathbf{v}_t. \end{align}\]

Parameter vector \(\boldsymbol\theta = (\alpha, \sigma^2_{\mathbf{v},\text{lon}}, \sigma^2_{\mathbf{v},\text{lat}}, \sigma^2_{\mathbf{w},\text{lon}}, \sigma^2_{\mathbf{w},\text{lat}} )\), where

\(\alpha\) is the autocorrelation parameter
\(\sigma^2_{\mathbf{v},\text{lon}}\) and \(\sigma^2_{\mathbf{v},\text{lat}}\) are observation noise variances
\(\sigma^2_{\mathbf{w},\text{lon}}\) and \(\sigma^2_{\mathbf{w},\text{lat}}\) are state noise variances

The position at time \(t\), \(\mathbf{x}_t\), depends on the position at time \(t-1\), \(\mathbf{x}_{t-1}\), and the velocity at time \(t-1\), \(\mathbf{x}_{t-1} - \mathbf{x}_{t-2}\).

Benchmark methods

Classical SSM (maximum likelihood estimator)
ML estimator with Huber function replacing the quadratic loss (Crevits & Croux, 2019)
Maximum trimmed likelihood method discarding a fraction of the largest Mahalanobis distances from the likelihood (Crevits & Croux, 2019)

Simulation settings

First difference correlated random walk model
- \(n\) = 100, 200, 500, 1000
Outlier configurations:
- Clean data
- Fixed distance outliers
- Cluster outliers
- Multilevel outliers
True contamination rate 10%
Tuning parameter selected using BIC

Fixed distance – data

Fixed distance – true outliers

Fixed distance – tuning parameter selection

Fixed distance – flagged outliers

Fixed distance – comparison

Cluster – data

Cluster – true outliers

Cluster – tuning parameter selection

Cluster – flagged outliers

Cluster – comparison

Multilevel – data

Multilevel – true outliers

Multilevel – tuning parameter selection

Multilevel – flagged outliers

Multilevel – comparison

Detecting outliers

Parameter estimation: relative RMSE

Application to animal tracking

If \(x_{t|t}\) is the filtered state at time \(t\), then \(y_{t|t} = A x_{t|t}\) is the filtered observation at time \(t\). To get the predicted observation at time \(t\), you would first obtain the predicted state from the filtered state of the previous timepoint: \(x_{t|t-1} = Phi x_{t-1|t-1}\), and then obtain the predicted observation from the predicted state: \(y_{t|t-1} = A x_{t|t-1}\). To summarise:
- Filtered observation: \(y_{t|t} = A x_{t|t}\)
- Predicted observation: \(y_{t|t-1} = A Phi x_{t-1|t-1}\)

To get the smoothed states, take the filtered states and run them through the backward smoothing algorithm (Rauch–Tung–Striebel (RTS) smoother, can just say Kalman smoother). Then, multiply smoothed states by the A matrix to get the smoothed observations.

So for timepoint \(t\):

Predicted observations use information up to \(t-1\)
Filtered observations use information up to \(t\)
Smoothed observations use information up to \(n\) (all timepoints)

How might this be used in practice?

Increasingly common for cows to have wearable trackers (e.g. for virtual fencing)
Potentially use this data to proactively identify health issues

Advantages:

Robustly estimate parameters (e.g. the autoregressive coefficient)
Automatic identification of outliers (problematic devices?)

Summary

Robustly estimate parameters of a state-space model in the presence of measurement outliers
Outlier identification
Applications to animal tracking
Play with the app: garthtarr.shinyapps.io/robularized

Acknowledgements

Rajan Shankar (U of Sydney)
Ines Wilms (Maastricht U)
Jakob Raymaekers (U of Antwerp)

garth.tarr@sydney.edu.au

arXiv:2511.15155

References

Cipra, T., & Romera, R. (1997). Kalman filter with outliers and missing observations. Test, 6(2), 379–395. DOI: 10.1007/BF02564705

Crevits, R., & Croux, C. (2019). Robust estimation of linear state space models. Communications in Statistics - Simulation and Computation, 48(6), 1694–1705. DOI: 10.1080/03610918.2017.1422752

Laska, J.N., Davenport, M.A., & Baraniuk, R.G. (2009). Exact signal recovery from sparsely corrupted measurements through the Pursuit of Justice. In 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers (pp. 1556–1560). DOI: 10.1109/ACSSC.2009.5470141

McCann, L., & Welsch, R.E. (2007). Robust variable selection using least angle regression and elemental set sampling. Computational Statistics & Data Analysis, 52(1), 249–257. DOI: 10.1016/j.csda.2007.01.012

Petris, G. (2010). An R package for dynamic linear models. Journal of Statistical Software, 36(12), 1–16. https://www.jstatsoft.org/v36/i12/

Shankar, R., Wilms, I., Raymaekers, J., & Tarr, G. (2025). Robust outlier-adjusted mean-shift estimation of state-space models. https://arxiv.org/abs/2511.15155

She, Y., & Owen, A.B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494), 626–639. DOI: 10.1198/jasa.2011.tm10390

Out of sample performance

20 new observations
Compute one-step ahead predictions
Compute Mean Squared Forecast Error (MSFE)

Fixed distance – out-of-sample

Fixed distance – out-of-sample comparison

Cluster – out-of-sample

Cluster – out-of-sample comparison

Multilevel – out-of-sample

Multilevel – out-of-sample comparison

Outliers in the out-of-sample data

We’ve discussed outliers in the training data, but what if there are outliers in the testing data?

Use a “threshold” filter similar to what happens in the training data
Observations identified as too different from what we predict they should be are flagged as outliers and excluded
Potential for “cascading” outliers which we mitigate by inflating the variance to help it “snap back” to the true trajectory