I'm experimenting conformal prediction over high-frequent time data using following forest-based regression models for an in-sample forecasting task The size of uni-variate (1D) time-series data is (8640, 1) meaning a single column value over time for 8460 time steps.
short note:
8460 means $30×288$ (every day has 288 observations) -- data collected with epochs of 5 mins which leads $12×24$ = 288 observations\data per day
The following is the time data:
import pandas as pd import numpy as np np.random.seed(42) import matplotlib.pyplot as plt # create data df = pd.DataFrame({ "TS" : np.arange(0, 288*30), "value": np.abs(np.sin(2 * np.pi * np.arange(0, 288*30) / 7) + np.random.normal(0, 20.1, size=288*30)) # generate seasonality }) # Set 'TS' column as the index df = df.set_index('TS') # Plot 'value' column over the new index plt.figure(figsize=(10, 6)) plt.plot(df.index, df['value']) plt.xlabel('Timesteps (epochs=5mins)') plt.ylabel('value') plt.title('1D data over Time') plt.grid(True) plt.show() df.shape # (8640, 1) Let's say I split data into train and test sets (Let's say I want to forecast tail of data for the last 7 days= $7×288$ observations)
I apply the following forest-based models with the following setups (n_estimators=1) due to the high runtime and high-frequency nature of data. (epoch=5mins)
from lineartree import LinearForestRegressor model = LinearForestRegressor( base_estimator=Ridge(random_state=42), n_estimators=1, n_jobs=-1, max_features='sqrt') from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=1, random_state=123, n_jobs=-1) My questions:
- Do forest models act like Decision Tree (with single tree) if they set it up with one tree?
- in general is it valid that one uses forest model for forecasting task of uni-variate (1D) time data using single tree? what are the consequences of using single tree within Forest models? i.e. over/under-fitting
Note1: I'm not interested in touch nature of data assign downsampling with resample() and aggregate function like mean() to downsample 5mins to 1hour in order to reduce the observations because I lose information in small time resolution.
resampled_df = (df.set_index('datetime') # Conform data by setting a datetime column as dataframe index needed for resample .resample('1H') # resample with frequency of 1 hour .mean() # used mean() to aggregate .interpolate() # filling NaNs and missing values [just in case] ) resampled_df.shape # (24, 1)
(8640, 1)and knowing consumption CPU and memory from this post, still with runtime is so high with evenn_estimators=5, n_jobs=-1with 5 trees in Google Colab medium! So I have to setn_estimators=1because of I have many time data samples like df1, df2,..., df10000 needs to be forecasted. $\endgroup$n_estimators=1technically and theoretically ! in the worst scenario there is no further tree's result to get average of it. Right? but question is it still make sense to use such a model? $\endgroup$