0
$\begingroup$

I'm fairly new to the world of ML & Data Science. I've completed a certification course in Coursera/IBM and I'm trying to hone my skills using some exercises from Kaggle. The course did not introduce me to the concepts of PCA, so I'm learning it on my own. From what I understand, PCA requires the data to be scaled. I have some data that I would like to create a model for and then run a prediction. I'm planning to use Ridge Regression as my model. I would also like to use GridSearchCV to tune the hyperparameters of the model. Based on some reading that I've done, I think that I should run fit_transform() functions for the scaler and the PCA on the training data separately and then just run the transform() method of both on the test data before calling the predict() function of the model. Is that correct? I'm using a pipeline to run the scaler, the PCA and the regression together. Before that, I'm creating the three objects (StandardScaler, PCA, Ridge) separately, so I have an object to reference for each. I assumed that the GridSearchCV function will call the fit_transform() functions for the scaler and the PCA and thus the scaler and PCA objects can be used later to call the transform() function on the testing data. But, that doesn't look like it's possible. I'm pasting my code below for reference. I get an error (NotFittedError: This StandardScaler instance is not fitted yet.) indicating that the scaler's fit_transform() function wasn't called. What is the correct way to accomplish my objective?

pca = PCA() scaler = StandardScaler() ridge = Ridge() params = [{'ridgereg__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0]}] pipeline = Pipeline([('scaler', scaler), ('pca', pca), ('ridgereg', ridge)]) X = df[['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea']] y = df['SalePrice'] x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1) model = GridSearchCV(pipeline, params, cv=4) model.fit(x_train, y_train) # Now we must scale and transform the testing data x_scaled = scaler.transform(x_test) #assuming that scaler.fit_transform() was called by the pipeline X_scaled_df = pd.DataFrame(x_scaled, columns=X.columns) X_pca = pca.transform(X_scaled_df) # assuming that pca.fit_transform() was called by the pipeline X_pca_df = pd.DataFrame(X_pca, columns=X.columns) y_pred = model.predict(X_pca_df) mse = mean_squared_error (y_test, y_pred) r2 = r2_score (y_test, y_pred) rmse = root_mean_squared_error(y_test, y_pred) print('MSE: %.2f; R2: %.2f; rmse: %.2f\n' % (mse, r2, rmse)) ``` 
$\endgroup$

1 Answer 1

2
$\begingroup$

You are correct that the fit should be made on the train set only and the transform made on both the train and the test (or validation) sets. If not the model would overfit because of the data leakage from the test/validation set which should not be used during training as its role is to evaluate the model on unseen data.

It can indeed be convenient to use a pipeline as it will deal with this issue automatically.

When you fit the gridsearch, as by default refit=True, after finding the best hyperparameters it will fit the best model on the training data (including scaler and pca) so to get the predictions, you can use directly this instruction:

model.predict(x_test) 

As "model" comes from the GridSearch that received the pipeline (maybe less confusing to call it "gs_model"), the x_test data will go through the pipeline and be transformed correctly using the scaler/pca/ridge fitted on the train set.

If you had refit=False, you could still get the best model and run the prediction with model.best_estimator_.predict(x_test).

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.