Outlier-robust estimation of state-space models using a penalized approach


Rajan Shankar¹, Garth Tarr¹, Ines Wilms², Jakob Raymaekers³

¹University of Sydney, ²Maastricht University, ³University of Antwerp

Blue whale data

Blue whale data

Blue whale data

Goal: Recover the true path taken by the whale

Why? Animal scientists need accurate movement paths to study animal behaviour


Modelling considerations:

  • Randomness of the whale
  • Measurement error of the satellite

Goal

State-space model

\[\begin{align} \mathbf x_t &= \Phi \mathbf x_{t-1} + \mathbf w_t \\ \mathbf y_t &= A \mathbf x_t + \mathbf v_t \\ \end{align}\]

  • \(\mathbf x_t\): state vector
    • true position of object
  • \(\Phi\): state transition matrix
  • \(\mathbf w_t \sim N(\mathbf 0, \Sigma_\mathbf{w})\)
  • \(\mathbf y_t\): observation vector
    • measured position of object
  • \(A\): observation / measurement matrix
  • \(\mathbf v_t \sim N(\mathbf 0, \Sigma_\mathbf{v})\)

Usually, \(\Phi\), \(\Sigma_\mathbf{w}\) and \(\Sigma_\mathbf{v}\) depend on model parameters \(\boldsymbol \theta\).

State-space model

State-space model

State-space model

State-space model

State-space model

State-space model

State-space model

State-space model

State-space model

State-space model

State-space model

Where are State Space Models used?

  • Econometrics & Finance
    • Unobserved components of GDP, inflation, interest rates
  • Engineering & Control
    • Signal processing and target tracking (radar, sonar, GPS)
  • Environmental & Ecological Science
    • Animal movement from telemetry data
  • Biomedicine
    • EEG/ECG signal analysis

Key Idea: SSMs apply whenever a time series is driven by an unobserved dynamic state.

Classical estimation

Given a parameter vector \(\boldsymbol \theta\):

  • \(\mathbf {\hat y}_{t|t-1}(\boldsymbol\theta)\) is the prediction for observation \(t\) given data up to time \(t-1\)
  • \(\mathbf r_t(\boldsymbol\theta) := \mathbf y_{t} - \mathbf {\hat y}_{t|t-1}(\boldsymbol\theta)\) is the residual
  • \(S_{t|t-1}(\boldsymbol\theta)\) is the prediction variance

\(\mathbf {\hat y}_{t|t-1}(\boldsymbol\theta)\) and \(S_{t|t-1}(\boldsymbol\theta)\) are computed using the Kalman filter

\[ \min_{\boldsymbol\theta}\sum_{t=1}^n \left\{\log \left|S_{t|t-1}(\boldsymbol\theta)\right| + \mathbf r_t(\boldsymbol\theta)^\top S_{t|t-1}^{-1}(\boldsymbol\theta) \mathbf r_t(\boldsymbol\theta)\right\} \]

The estimate for \(\boldsymbol \theta\) can be found using standard optimisation routines.

Outliers

Mean-shift formulation

Over-parameterise the observation process, \[\mathbf{y}_t = A \mathbf{x}_t + {\color{#e64626}{\boldsymbol{\gamma}_t}} + \mathbf{v}_t\] for \(t=1,\ldots,n\) where \({\color{#e64626}{\boldsymbol{\gamma}_1}},\dots,{\color{#e64626}{\boldsymbol{\gamma}_n}}\) are shift parameters.


  • If \(\mathbf{y}_t\) is identified as an outlier then \(\color{#e64626}{\boldsymbol{\gamma}_t}\) will be non-zero.


  • Want to encourage some sparsity here!

Estimation

Minimize over \(\boldsymbol\theta, \boldsymbol\gamma_{t}\) (\(t=1,\dots,n\)):

\[ \small{ \sum_{t=1}^n \left\{{\color{#e64626}{1_{\{\boldsymbol\gamma_t = \mathbf0\}}}}\log \left|S_{t|t-1}(\boldsymbol\theta)\right| + \left(\mathbf r_t(\boldsymbol\theta) - {\color{#e64626}{\boldsymbol\gamma_t}}\right)^\top S_{t|t-1}^{-1}(\boldsymbol\theta) \left(\mathbf r_t(\boldsymbol\theta) - \color{#e64626}{\boldsymbol\gamma_t}\right)\right\} + {\color{#e64626}{\sum_{t=1}^n P(\boldsymbol{\gamma}_t; \lambda)}} } \]

  • \(P\) is the hard penalty:

\[ P(\boldsymbol{\gamma}_t; \lambda) =\begin{cases}\lambda^2 & \text{if } \boldsymbol{\gamma}_t \neq \mathbf{0} \\ 0 & \text{otherwise} \end{cases} \]

  • \(\lambda\) is a tuning parameter

The “hard” penalty

  • We’re using a \(L_0\) style penalty, which is a “hard” penalty in the sense that there is a fixed cost for activating an outlier ar time \(t\), \[ P(\boldsymbol{\gamma}_t; \lambda) =\begin{cases}\lambda^2 & \text{if } \boldsymbol{\gamma}_t \neq 0 \\ 0 & \text{otherwise} \end{cases} \]

  • Contrast with a “soft” type of penalty, e.g. \(L_1\) (Lasso), where the penalty grows with the size of the outlier and gives continuous shrinkage.

  • Benefit: hard thresholding should give better control over the number of outliers identified: masking and swamping issues (She & Owen, 2011).

Alternating estimation algorithm

At iteration \(k\):

  1. Estimate \(\color{#e64626}{\boldsymbol\theta^{(k)}}\) given all \(\boldsymbol\gamma^{(k-1)}_{t}\) by finding the minimizer of:

\[ \small{ \sum_{t=1}^n \left\{1_{\{\boldsymbol\gamma^{(k-1)}_t = 0\}}\log \left|S_{t|t-1}({\color{#e64626}{\boldsymbol\theta}})\right| + \left(\mathbf r_t({\color{#e64626}{\boldsymbol\theta}}) - \boldsymbol\gamma^{(k-1)}_t\right)^\top S_{t|t-1}^{-1}({\color{#e64626}{\boldsymbol\theta}}) \left(\mathbf r_t({\color{#e64626}{\boldsymbol\theta}}) - \boldsymbol\gamma^{(k-1)}_t\right)\right\} } \]

  1. Update each \(\color{#e64626}{\boldsymbol\gamma^{(k)}_{t}}\) given \(\boldsymbol\theta^{(k)}\) via hard-thresholding:

\[ \small{{\color{#e64626}{\boldsymbol\gamma_t^{(k)}}} = \begin{cases} \mathbf r_t(\boldsymbol\theta^{(k)}) & \text{if } \sqrt{\log\Big|S_{t|t-1}^{(k)}\Big| + \text{MD}^2\left(\mathbf{r}_t^{(k)}, S_{t|t-1}^{(k)}\right)} > \color{#e64626}{\lambda} \\ \mathbf 0 & \text{otherwise}\end{cases} } \]

  1. Repeat until convergence.

Choosing \(\lambda\)

Bayesian Information Criterion (BIC) is a data-driven way of selecting \(\lambda\), \[\begin{align} \text{BIC}(\lambda) = - 2 \operatorname{Log-Likelihood}(\hat{\boldsymbol{\theta}}_\lambda, \hat{\boldsymbol{\gamma}}_\lambda) + |\operatorname{nz}(\lambda)|\log n, \end{align}\] where \(|\text{nz}(\lambda)|\) is the number of timepoints that have been flagged as outliers.

  • Interpretation:
    • First term: model fit (better fit => smaller value)
    • Second term: penalty for model complexity (more detected outliers => higher penalty).
  • This balances fit vs parsimony, avoiding both over-detection (too many outliers) and under-detection (missing true outliers).

Choosing \(\lambda\)

First difference correlated random walk model

A specific example of a state space model, \[\begin{align} \mathbf{x}_t & = \mathbf{x}_{t-1} + \alpha (\mathbf{x}_{t-1} - \mathbf{x}_{t-2}) + \mathbf{w}_{t} \\ \mathbf{y}_t & = \mathbf{x}_t + \mathbf{v}_t. \end{align}\]

Parameter vector \(\boldsymbol\theta = (\alpha, \sigma^2_{\mathbf{v},\text{lon}}, \sigma^2_{\mathbf{v},\text{lat}}, \sigma^2_{\mathbf{w},\text{lon}}, \sigma^2_{\mathbf{w},\text{lat}} )\), where

  • \(\alpha\) is the autocorrelation parameter
  • \(\sigma^2_{\mathbf{v},\text{lon}}\) and \(\sigma^2_{\mathbf{v},\text{lat}}\) are observation noise variances
  • \(\sigma^2_{\mathbf{w},\text{lon}}\) and \(\sigma^2_{\mathbf{w},\text{lat}}\) are state noise variances

The position at time \(t\), \(\mathbf{x}_t\), depends on the position at time \(t-1\), \(\mathbf{x}_{t-1}\), and the velocity at time \(t-1\), \(\mathbf{x}_{t-1} - \mathbf{x}_{t-2}\).

Benchmark methods

  • Classical SSM (maximum likelihood estimator)
  • ML estimator with Huber function replacing the quadratic loss (Crevits & Croux, 2019)
  • Maximum trimmed likelihood method discarding a fraction of the largest Mahalanobis distances from the likelihood (Crevits & Croux, 2019)

Simulation settings

  • First difference correlated random walk model
    • \(n\) = 100, 200, 500, 1000
  • Outlier configurations:
    • Clean data
    • Fixed distance outliers
    • Cluster outliers
    • Multilevel outliers
  • True contamination rate 10%
  • Tuning parameter selected using BIC

Fixed distance – data

Fixed distance – true outliers

Fixed distance – tuning parameter selection

Fixed distance – flagged outliers

Fixed distance – comparison

Cluster – data

Cluster – true outliers

Cluster – tuning parameter selection

Cluster – flagged outliers

Cluster – comparison

Multilevel – data

Multilevel – true outliers

Multilevel – tuning parameter selection

Multilevel – flagged outliers

Multilevel – comparison

Detecting outliers

Parameter estimation: relative RMSE

Application to animal tracking

How might this be used in practice?

  • Increasingly common for cows to have wearable trackers (e.g. for virtual fencing)
  • Potentially use this data to proactively identify health issues

Advantages:

  • Robustly estimate parameters (e.g. the autoregressive coefficient)
  • Automatic identification of outliers (problematic devices?)

Summary

  • Robustly estimate parameters of a state-space model in the presence of measurement outliers
  • Outlier identification
  • Applications to animal tracking
  • Play with the app: garthtarr.shinyapps.io/robularized

Acknowledgements

  • Rajan Shankar (U of Sydney)
  • Ines Wilms (Maastricht U)
  • Jakob Raymaekers (U of Antwerp)

garth.tarr@sydney.edu.au

arXiv:2511.15155

References

Cipra, T., & Romera, R. (1997). Kalman filter with outliers and missing observations. Test, 6(2), 379–395. DOI: 10.1007/BF02564705
Crevits, R., & Croux, C. (2019). Robust estimation of linear state space models. Communications in Statistics - Simulation and Computation, 48(6), 1694–1705. DOI: 10.1080/03610918.2017.1422752
Laska, J.N., Davenport, M.A., & Baraniuk, R.G. (2009). Exact signal recovery from sparsely corrupted measurements through the Pursuit of Justice. In 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers (pp. 1556–1560). DOI: 10.1109/ACSSC.2009.5470141
McCann, L., & Welsch, R.E. (2007). Robust variable selection using least angle regression and elemental set sampling. Computational Statistics & Data Analysis, 52(1), 249–257. DOI: 10.1016/j.csda.2007.01.012
Petris, G. (2010). An R package for dynamic linear models. Journal of Statistical Software, 36(12), 1–16. https://www.jstatsoft.org/v36/i12/
Shankar, R., Wilms, I., Raymaekers, J., & Tarr, G. (2025). Robust outlier-adjusted mean-shift estimation of state-space models. https://arxiv.org/abs/2511.15155
She, Y., & Owen, A.B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494), 626–639. DOI: 10.1198/jasa.2011.tm10390

Out of sample performance

  • 20 new observations
  • Compute one-step ahead predictions
  • Compute Mean Squared Forecast Error (MSFE)

Fixed distance – out-of-sample

Fixed distance – out-of-sample comparison

Cluster – out-of-sample

Cluster – out-of-sample comparison

Multilevel – out-of-sample

Multilevel – out-of-sample comparison

Outliers in the out-of-sample data

We’ve discussed outliers in the training data, but what if there are outliers in the testing data?

  • Use a “threshold” filter similar to what happens in the training data
  • Observations identified as too different from what we predict they should be are flagged as outliers and excluded
  • Potential for “cascading” outliers which we mitigate by inflating the variance to help it “snap back” to the true trajectory

Out of sample performance: clean data

Multilevel – out-of-sample with outlier

Multilevel – OOS comparison with outlier

Out of sample performance: contaminated data (5 units away)