What happens if one uses the forest-based predictive models with a single tree or estimator for 1D time data?

Question

I'm experimenting conformal prediction over high-frequent time data using following forest-based regression models for an in-sample forecasting task The size of uni-variate (1D) time-series data is (8640, 1) meaning a single column value over time for 8460 time steps.

short note:

8460 means $30×288$ (every day has 288 observations) -- data collected with epochs of 5 mins which leads $12×24$ = 288 observations\data per day

The following is the time data:

import pandas as pd import numpy as np np.random.seed(42) import matplotlib.pyplot as plt # create data df = pd.DataFrame({ "TS" : np.arange(0, 288*30), "value": np.abs(np.sin(2 * np.pi * np.arange(0, 288*30) / 7) + np.random.normal(0, 20.1, size=288*30)) # generate seasonality }) # Set 'TS' column as the index df = df.set_index('TS') # Plot 'value' column over the new index plt.figure(figsize=(10, 6)) plt.plot(df.index, df['value']) plt.xlabel('Timesteps (epochs=5mins)') plt.ylabel('value') plt.title('1D data over Time') plt.grid(True) plt.show() df.shape # (8640, 1)

Let's say I split data into train and test sets (Let's say I want to forecast tail of data for the last 7 days= $7×288$ observations)

I apply the following forest-based models with the following setups (n_estimators=1) due to the high runtime and high-frequency nature of data. (epoch=5mins)

from lineartree import LinearForestRegressor model = LinearForestRegressor( base_estimator=Ridge(random_state=42), n_estimators=1, n_jobs=-1, max_features='sqrt') from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=1, random_state=123, n_jobs=-1)

My questions:

Do forest models act like Decision Tree (with single tree) if they set it up with one tree?
in general is it valid that one uses forest model for forecasting task of uni-variate (1D) time data using single tree? what are the consequences of using single tree within Forest models? i.e. over/under-fitting

Note1: I'm not interested in touch nature of data assign downsampling with resample() and aggregate function like mean() to downsample 5mins to 1hour in order to reduce the observations because I lose information in small time resolution.

resampled_df = (df.set_index('datetime') # Conform data by setting a datetime column as dataframe index needed for resample .resample('1H') # resample with frequency of 1 hour .mean() # used mean() to aggregate .interpolate() # filling NaNs and missing values [just in case] ) resampled_df.shape # (24, 1)

Please note that I'm aware of the fact of logic of forest-based average result of trees addressed here. Considering the size of my sub-data (8640, 1) and knowing consumption CPU and memory from this post, still with runtime is so high with even n_estimators=5, n_jobs=-1 with 5 trees in Google Colab medium! So I have to set n_estimators=1 because of I have many time data samples like df1, df2,..., df10000 needs to be forecasted. — Mario
– Mario, Commented Dec 15, 2024 at 18:10
Theoretically we know, "... Gradient boosting (along with any tree-based method) can be used to find relative feature importunes (based on how much error is reduced after each split)", which is not case for 1D time data as my sample sub-datasets are. Thus I could use n_estimators=1 technically and theoretically ! in the worst scenario there is no further tree's result to get average of it. Right? but question is it still make sense to use such a model? — Mario
– Mario, Commented Dec 15, 2024 at 18:29

Stack Exchange Network

What happens if one uses the forest-based predictive models with a single tree or estimator for 1D time data?

0

Linked

Hot Network Questions

What happens if one uses the forest-based predictive models with a single tree or estimator for 1D time data?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked

Related

Hot Network Questions