Question

如何在应用机器学习算法之前处理数据集中的缺失值？

我注意到丢失缺失的NAN值并不明智。我通常使用pandas进行插值（计算平均值）并将数据填充起来，这样可以提高分类的准确性，但可能不是最佳选择。

这是一个非常重要的问题。 处理数据集中缺失值的最佳方法是什么？

例如，如果您看到此数据集，则只有30％具有原始数据。

Int64Index: 7049 entries, 0 to 7048
Data columns (total 31 columns):
left_eye_center_x            7039 non-null float64
left_eye_center_y            7039 non-null float64
right_eye_center_x           7036 non-null float64
right_eye_center_y           7036 non-null float64
left_eye_inner_corner_x      2271 non-null float64
left_eye_inner_corner_y      2271 non-null float64
left_eye_outer_corner_x      2267 non-null float64
left_eye_outer_corner_y      2267 non-null float64
right_eye_inner_corner_x     2268 non-null float64
right_eye_inner_corner_y     2268 non-null float64
right_eye_outer_corner_x     2268 non-null float64
right_eye_outer_corner_y     2268 non-null float64
left_eyebrow_inner_end_x     2270 non-null float64
left_eyebrow_inner_end_y     2270 non-null float64
left_eyebrow_outer_end_x     2225 non-null float64
left_eyebrow_outer_end_y     2225 non-null float64
right_eyebrow_inner_end_x    2270 non-null float64
right_eyebrow_inner_end_y    2270 non-null float64
right_eyebrow_outer_end_x    2236 non-null float64
right_eyebrow_outer_end_y    2236 non-null float64
nose_tip_x                   7049 non-null float64
nose_tip_y                   7049 non-null float64
mouth_left_corner_x          2269 non-null float64
mouth_left_corner_y          2269 non-null float64
mouth_right_corner_x         2270 non-null float64
mouth_right_corner_y         2270 non-null float64
mouth_center_top_lip_x       2275 non-null float64
mouth_center_top_lip_y       2275 non-null float64
mouth_center_bottom_lip_x    7016 non-null float64
mouth_center_bottom_lip_y    7016 non-null float64
Image                        7049 non-null object

Answer 1

What is the best way to handle missing values in data set?

没有最好的方法，每个解决方案/算法各有利弊（你甚至可以将它们中的一些混合起来创建你自己的策略并调整相关参数来最好地满足你的数据，有关于这个主题的许多研究/论文）。

例如， Mean Imputation 快速而简单，但它会低估方差，并且通过将NaN替换为平均值来扭曲分布形状，而 KNN Imputation 在时间复杂度方面，在大型数据集中可能并不理想，因为它迭代所有数据点并对每个NaN值执行计算，并且假设NaN属性与其他属性相关。

How to handle missing values in datasets before applying machine learning algorithm??

除了你提到的平均估算之外，您还可以查看 K-Nearest Neighbutation 和回归估算，并参考到Imputer中强大的scikit-learn类来检查要使用的现有API。

KNN Imputation

计算此NaN点的k个最近邻居的平均值。

回归估算

估计回归模型可以根据其他变量预测变量的观测值，然后该模型用于在缺少该变量的情况下估算值。

Here指向scikit的“缺失值的估算”部分。我也听说过Orange库的插补，但还没有机会使用它。

Answer 2

没有一种最好的方法来处理缺失的数据。最严格的方法是将缺失值建模为像PyMC这样的概率框架中的附加参数。通过这种方式，您可以获得可能值的分布，而不仅仅是单个答案。以下是使用PyMC处理缺失数据的示例：http://stronginference.com/missing-data-imputation.html

如果你真的想用点数估算来插入这些漏洞，那么你就是要进行“估算”。我会避开像平均填充这样简单的插补方法，因为它们真的会对你的特征进行联合分配。相反，尝试softImpute之类的东西（它会尝试通过低秩近似推断缺失值）。 softImpute的原始版本是为R编写的，但我在这里制作了一个Python版本（以及kNN插补等其他方法）：https://github.com/hammerlab/fancyimpute

如何在python中处理机器学习中缺少的NaN

2 个答案: