0
$\begingroup$

I'm experimenting conformal prediction over high-frequent time data using following forest-based regression models for an in-sample forecasting task The size of uni-variate (1D) time-series data is (8640, 1) meaning a single column value over time for 8460 time steps.

short note:

8460 means $30×288$ (every day has 288 observations) -- data collected with epochs of 5 mins which leads $12×24$ = 288 observations\data per day

The following is the time data:

import pandas as pd import numpy as np np.random.seed(42) import matplotlib.pyplot as plt # create data df = pd.DataFrame({ "TS" : np.arange(0, 288*30), "value": np.abs(np.sin(2 * np.pi * np.arange(0, 288*30) / 7) + np.random.normal(0, 20.1, size=288*30)) # generate seasonality }) # Set 'TS' column as the index df = df.set_index('TS') # Plot 'value' column over the new index plt.figure(figsize=(10, 6)) plt.plot(df.index, df['value']) plt.xlabel('Timesteps (epochs=5mins)') plt.ylabel('value') plt.title('1D data over Time') plt.grid(True) plt.show() df.shape # (8640, 1) 

Let's say I split data into train and test sets (Let's say I want to forecast tail of data for the last 7 days= $7×288$ observations)

I apply the following forest-based models with the following setups (n_estimators=1) due to the high runtime and high-frequency nature of data. (epoch=5mins)

from lineartree import LinearForestRegressor model = LinearForestRegressor( base_estimator=Ridge(random_state=42), n_estimators=1, n_jobs=-1, max_features='sqrt') from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=1, random_state=123, n_jobs=-1) 

My questions:

  • Do forest models act like Decision Tree (with single tree) if they set it up with one tree?
  • in general is it valid that one uses forest model for forecasting task of uni-variate (1D) time data using single tree? what are the consequences of using single tree within Forest models? i.e. over/under-fitting

Note1: I'm not interested in touch nature of data assign downsampling with resample() and aggregate function like mean() to downsample 5mins to 1hour in order to reduce the observations because I lose information in small time resolution.

resampled_df = (df.set_index('datetime') # Conform data by setting a datetime column as dataframe index needed for resample .resample('1H') # resample with frequency of 1 hour .mean() # used mean() to aggregate .interpolate() # filling NaNs and missing values [just in case] ) resampled_df.shape # (24, 1) 
$\endgroup$
2
  • $\begingroup$ Please note that I'm aware of the fact of logic of forest-based average result of trees addressed here. Considering the size of my sub-data (8640, 1) and knowing consumption CPU and memory from this post, still with runtime is so high with even n_estimators=5, n_jobs=-1 with 5 trees in Google Colab medium! So I have to set n_estimators=1 because of I have many time data samples like df1, df2,..., df10000 needs to be forecasted. $\endgroup$ Commented Dec 15, 2024 at 18:10
  • $\begingroup$ Theoretically we know, "... Gradient boosting (along with any tree-based method) can be used to find relative feature importunes (based on how much error is reduced after each split)", which is not case for 1D time data as my sample sub-datasets are. Thus I could use n_estimators=1 technically and theoretically ! in the worst scenario there is no further tree's result to get average of it. Right? but question is it still make sense to use such a model? $\endgroup$ Commented Dec 15, 2024 at 18:29

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.