Question

我在填充数据集中的缺失值方面存在性能问题。这涉及500mb / 5.000.0000行数据集（Kaggle：Expedia 2013）。

使用df.fillna()最简单，但似乎我不能用它来填充每个NaN的不同值。

我创建了一个lookup表：

srch_destination_id | Value
    2        0.0110
    3        0.0000
    5        0.0207
    7           NaN
    8           NaN
    9           NaN
    10       0.1500
    12       0.0114

此表包含每srch_destination_id个NaN替换dataset的相应值。

# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
    if pd.isnull(dataset.iloc[row]['prop_location_score2']):
        cell = dataset.iloc[row]['srch_destination_id']
        df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])

此代码在迭代超过1000行时有效，但在遍历所有500万行时，我的计算机永远不会完成（我等了几个小时）。

有没有更好的方法来做我正在做的事情？我在某个地方犯了错误吗？

Answer 1

pd.Series.fillna确实接受系列或字典，以及标量替换值。

因此，您可以从lookup：

创建一系列映射

s = lookup.set_index('srch_destination')['Value']

然后使用此选项填写NaN中的dataset值：

dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))

请注意，在fillna输入中，我们正在映射来自dataset的标识符。此外，我们使用pd.Series.map来执行必要的映射。

熊猫：填补南方表现不佳 - 避免迭代行？

1 个答案: