Variable selection and NA

Question

I have a very large dataset with a lot of NAs in the data. I want to perform an analysis and have to select the variables that are of most interest.

I feel like I have to take 3 steps before I can start to analyse. I have to perform PCA, I will have to fill in the NAs and also, some variables contain very, very little data. Almost half of the variables have less than 50% of the observations filled in. So I feel like I have to delete these at some point. My question is, what is the most appropriate order?

My intuition is to delete variables with very little data, then do PCA with remaining variables and then fill in the NAs using Nearest Neighbor. But I would like to hear your takes on it! :)

Also, when deleting variables with little data, what is an appropriate percentage of NAs per variable?

PCA will not select main variables, it will create new variables from existing ones. So you need to fill the NAs before performing PCA. — manu190466
– manu190466, Commented Nov 16, 2022 at 11:28
@manu190466 Thanks for your answer. I misunderstood PCA apparently, thought it was a feature selection tool. Does that also mean that variables will be difficult to interpret after the analysis? — Wilko
– Wilko, Commented Nov 16, 2022 at 12:51
Yes, in one side the new variables (or components) are a linear combination of existing variable and may be not easy to interpret. On the other side it may help you to better understand the relationships between your variables and to know more about your data. To learn more, read about what is a scatter plot on principal components PC1 x PC2. — manu190466
– manu190466, Commented Nov 16, 2022 at 13:23
@manu190466 so if I understand correctly, PCA is not the best way to decrease the number of features when the main goal is what factors influence a given dependent variable. Its better to do some other feature selection method. — Wilko
– Wilko, Commented Nov 16, 2022 at 14:10

liakoyras · Accepted Answer · 2022-11-16 16:05:39Z

My question is, what is the most appropriate order?

PCA creates "new" variables in a different dimensional space. I am not even sure most PCA libraries can handle missing values, and it doesn't make sense too (in this case, all of the output vector would be NaN since the output is a linear combination of all the input variables). So you first need to do any sort of imputation and then try to reduce the dimensions with PCA.

when deleting variables with little data, what is an appropriate percentage of NAs per variable?

There is no silver bullet solution I think. You can try to see what produces the best outcomes. In my experience, more than 20% NaN is too much (but it depends on the size of your data too, 50% on a dataset with a billion rows might provide enough information while 50% on 100 rows might be too little).

In general, I would propose to

First drop the features with too many missing data (what is too many is up to you to find)
Then impute the missing values using whatever method you want (you mentioned KNN but even simpler methods like mean or mode or constant might work while taking a lot less time)
Finally do the PCA (or any other method of feature selection or dimensionality reduction, since you mentioned in the comments that you wanted to drop features, not dimensionality reduction)

Stack Exchange Network

Variable selection and NA

1 Answer 1

Hot Network Questions

Variable selection and NA

1 Answer 1

Related

Hot Network Questions