Question

我正在尝试合并2个数据帧，我想使用最近的日期行。请注意，日期未排序，因此无法使用groupby.first()或groupby.last()。

Left DataFrame  (n=834,570)     |         Right DataFrame (n=1,592,005)
id_key                          |         id_key    date           other_vars
  1                             |           1       2015-07-06        ...
  2                             |           1       2015-07-07        ...
  3                             |           1       2014-04-04        ...

使用groupby / agg示例，需要8分钟！当我将日期转换为整数时，则需要6分钟。

gb = right.groupby('id_key')
gb.agg(lambda x: x.iloc[x.date.argmax()])

我使用自己的版本为id创建字典，在那里我存储了当前最高日期的日期和索引。您只需迭代整个数据一次，最后使用字典{id_key : [highest_date, index]}。

这样，找到必要的行真的很快。

最终合并数据只需6秒钟;大约加速85倍。

我不得不承认我非常惊讶，因为我认为大熊猫会为此进行优化。有没有人知道发生了什么，以及字典方法是否也应该是熊猫的一个选项？当然，也可以简单地将其与其他条件相适应，例如sum，min等。

我的代码：

# 1. Create dictionary
dc = {}
for ind, (ik, d) in enumerate(zip(right['id_key'], right['date'])):
    if ik not in dc:
        dc[ik] = (d, ind)
        continue
    if (d, ind) > dc[ik]:
        dc[ik] = (d, ind)

# 2. Collecting indices at once (subsetting was slow), so to only subset once.
# It has the same amount of rows as left
inds = []
for x in left['id_key']: 
    # using this to append the last value that was given (missing strategy, very very few)
    if x in dc:
        row = dc[x][1]
    inds.append(row) 

# 3. Take the values
result = right.iloc[inds]

熊猫真的很慢加入

0 个答案: