TL;DR
Use a log transform on the target and any strictly-positive, heavy-tailed features, then standardize using statistics computed on the training period only. Add causal rolling features (past-only rolling mean/std/z-score) for local context. Rolling “normalization” is fine only if it’s causal and you apply the same procedure at inference. Avoid leakage. Evaluate with a scale-aware loss or model on the log scale.
Longer answer
1) “Standard practice” for wide-range series
If values are positive: use log1p or a power transform (Box–Cox / Yeo–Johnson). This compresses 100×–1000× ranges, so small values don’t vanish relative to huge ones.
Fit a scaler on the training window only and reuse it for validation/test. Standard choices: StandardScaler. If you have extreme outliers, consider RobustScaler (median/IQR) or QuantileTransformer to a normal or uniform target distribution.
Train LSTM on log1p(y) and exponentiate predictions with expm1. If you need unbiased back-transform under squared-error training, consider a smearing correction (Duan).
- Consider differences or returns
If level non-stationarity dominates, model differences or pct changes. This removes trend/scale but changes the forecasting target.
2) Rolling/“online” normalization
Good idea if done causally.
Compute rolling stats using only past data at each timestamp (no future look-ahead). Example: an expanding mean/std or a fixed-width rolling window aligned to the past. Use the same real-time procedure at inference.
Bad idea if it leaks.
Do not compute mean/std over the whole series before splitting. Do not let the window include the current target or any future observations.
3) Add local context instead of normalizing everything away
- Create rolling features: past-window mean, std, min/max, and a past-only z-score:
z_t = (x_t − rolling_mean_past) / rolling_std_past.
This gives the model local scale info while global scaling keeps optimization stable.
4. Loss choice
When scale varies, L1 on the log scale, logcosh, or MAPE/SMAPE can be more informative than plain MSE on raw levels. If you train on log(y), plain MSE there often works well.