For instance, column x has 50 values and all of these values are the same. Is it a good idea to delete variables like these for building machine learning models? If so, how can I spot these variables in a large data set. I guess a formula/function might be required to do so. I thinking of using nunique that can take account of the whole data set.
答案 0 :(得分:0)
You should be deleting such columns because it will provide no extra information about how each data point is different from another. It's fine to leave the column for some machine learning models (due to the nature of how the algorithms work), like random forest, because this column will actually not be selected to split the data.
To spot those, especially for categorical or nominal variables (with fixed number of possible values), you can count the occurrence of each unique value, and if the mode is larger than a certain threshold (say 95%), then you delete that column from your model.
I personally will go through variables one by one if there aren't any so that I can fully understand each variable in the model, but the above systematic way is possible if the feature size is too large.