LSTM feature scaling with windowing?

Question

Beginner ML practitioner here. I'm trying to do some time series forecasting on a fairly high resolution dataset that stretches over a long period of time. The values vary pretty widely over time: to take one example, the features & labels towards the end of the dataset are something in the realm of 100x to 1000x the value at beginning. I understand that normalizing your data is generally good practice for training, but my instinct is that the naive way to normalize ((sample - mean)/stdev) is going to cause me problems because the smaller values will basically vanish to zero. I'm using an LSTM model right now, but I'm definitely open to other approaches.

Here are my questions:

What's the standard practice for normalizing data that varies this widely?
Does it make any sense to normalize my features by computing the mean/stddev with a rolling window of the recent features/labels, or is that a bad idea?

Thanks for the help.

I would consider log transformations like @rehaqds has suggested. You could also experiment with and tune power transformations like $features^{1/p}$, where $p=2.0$ will take the square root. Visualise the distribution before and after. — MuhammedYunus
– MuhammedYunus, Commented Sep 6 at 12:26
For your rolling window question, you might want to do some time series analysis on your data to determine things like whether it's stationary, or if there are cycles, seasonality, etc to better understand your data before performing any transformations — Derek O
– Derek O, Commented Sep 6 at 23:39

Valentin Calomme · Accepted Answer · 2025-09-08 08:26:52Z

TL;DR

Use a log transform on the target and any strictly-positive, heavy-tailed features, then standardize using statistics computed on the training period only. Add causal rolling features (past-only rolling mean/std/z-score) for local context. Rolling “normalization” is fine only if it’s causal and you apply the same procedure at inference. Avoid leakage. Evaluate with a scale-aware loss or model on the log scale.

Longer answer

1) “Standard practice” for wide-range series

Transform scale first

If values are positive: use log1p or a power transform (Box–Cox / Yeo–Johnson). This compresses 100×–1000× ranges, so small values don’t vanish relative to huge ones.

Then standardize

Fit a scaler on the training window only and reuse it for validation/test. Standard choices: StandardScaler. If you have extreme outliers, consider RobustScaler (median/IQR) or QuantileTransformer to a normal or uniform target distribution.

Model on the log scale

Train LSTM on log1p(y) and exponentiate predictions with expm1. If you need unbiased back-transform under squared-error training, consider a smearing correction (Duan).

Consider differences or returns

If level non-stationarity dominates, model differences or pct changes. This removes trend/scale but changes the forecasting target.

2) Rolling/“online” normalization

Good idea if done causally.

Compute rolling stats using only past data at each timestamp (no future look-ahead). Example: an expanding mean/std or a fixed-width rolling window aligned to the past. Use the same real-time procedure at inference.

Bad idea if it leaks.

Do not compute mean/std over the whole series before splitting. Do not let the window include the current target or any future observations.

3) Add local context instead of normalizing everything away

Create rolling features: past-window mean, std, min/max, and a past-only z-score: z_t = (x_t − rolling_mean_past) / rolling_std_past.

This gives the model local scale info while global scaling keeps optimization stable.

4. Loss choice

When scale varies, L1 on the log scale, logcosh, or MAPE/SMAPE can be more informative than plain MSE on raw levels. If you train on log(y), plain MSE there often works well.

Stack Exchange Network

LSTM feature scaling with windowing?

1 Answer 1

TL;DR

Longer answer

Hot Network Questions

LSTM feature scaling with windowing?

1 Answer 1

TL;DR

Longer answer

Related

Hot Network Questions