I'm fairly new to the world of ML & Data Science. I've completed a certification course in Coursera/IBM and I'm trying to hone my skills using some exercises from Kaggle. The course did not introduce me to the concepts of PCA, so I'm learning it on my own. From what I understand, PCA requires the data to be scaled. I have some data that I would like to create a model for and then run a prediction. I'm planning to use Ridge Regression as my model. I would also like to use GridSearchCV to tune the hyperparameters of the model. Based on some reading that I've done, I think that I should run fit_transform() functions for the scaler and the PCA on the training data separately and then just run the transform() method of both on the test data before calling the predict() function of the model. Is that correct? I'm using a pipeline to run the scaler, the PCA and the regression together. Before that, I'm creating the three objects (StandardScaler, PCA, Ridge) separately, so I have an object to reference for each. I assumed that the GridSearchCV function will call the fit_transform() functions for the scaler and the PCA and thus the scaler and PCA objects can be used later to call the transform() function on the testing data. But, that doesn't look like it's possible. I'm pasting my code below for reference. I get an error (NotFittedError: This StandardScaler instance is not fitted yet.) indicating that the scaler's fit_transform() function wasn't called. What is the correct way to accomplish my objective?
pca = PCA() scaler = StandardScaler() ridge = Ridge() params = [{'ridgereg__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0]}] pipeline = Pipeline([('scaler', scaler), ('pca', pca), ('ridgereg', ridge)]) X = df[['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea']] y = df['SalePrice'] x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1) model = GridSearchCV(pipeline, params, cv=4) model.fit(x_train, y_train) # Now we must scale and transform the testing data x_scaled = scaler.transform(x_test) #assuming that scaler.fit_transform() was called by the pipeline X_scaled_df = pd.DataFrame(x_scaled, columns=X.columns) X_pca = pca.transform(X_scaled_df) # assuming that pca.fit_transform() was called by the pipeline X_pca_df = pd.DataFrame(X_pca, columns=X.columns) y_pred = model.predict(X_pca_df) mse = mean_squared_error (y_test, y_pred) r2 = r2_score (y_test, y_pred) rmse = root_mean_squared_error(y_test, y_pred) print('MSE: %.2f; R2: %.2f; rmse: %.2f\n' % (mse, r2, rmse)) ```