I have a very large dataset with a lot of NAs in the data. I want to perform an analysis and have to select the variables that are of most interest.
I feel like I have to take 3 steps before I can start to analyse. I have to perform PCA, I will have to fill in the NAs and also, some variables contain very, very little data. Almost half of the variables have less than 50% of the observations filled in. So I feel like I have to delete these at some point. My question is, what is the most appropriate order?
My intuition is to delete variables with very little data, then do PCA with remaining variables and then fill in the NAs using Nearest Neighbor. But I would like to hear your takes on it! :)
Also, when deleting variables with little data, what is an appropriate percentage of NAs per variable?