Question

我有两个数据帧：df_a和df_b（每个都有一百万行）。

df_a有两列：uniqueID，start_time
df_b有两列：uniqueID，end_time

我的目标是生成带有以下列的数据框（df_final）：

uniqueID，开始时间，结束时间

df_final 必须包含df_a 中的所有数据（即uniqueID，start_time）。对于 df_final 中的每个（唯一ID，开始时间），结束时间必须来自与相同唯一ID对应的 df_b 。 df_b对于每个唯一ID具有多个end_time。对于每个唯一ID，对于 df_final [“ end_time”] ，必须考虑最接近开始时间的结束时间。如果df_b中没有这样的最接近的end_time，则df_final [“ end_time”]必须为NULL。

我的方法： 方法1：

#define a function
def get_end_date(uniqueID, start_time):
    sub_sectiondata = df_b[(df_b['uniqueID']==uniqueID) & (df_b["end_time"]>start_time)]
    if len(sub_sectiondata) == 0:
        return None
    else:
        return min(sub_sectiondata["end_time"])

然后将上述功能应用于df_a。

df_a['end_time'] = df_a.apply(lambda x: get_end_date(x['uniqueID'], x['start_time']), axis=1)

方法2：

df = df_a.copy()

for row in df.itertuples():
    sub_sectiondata = df_b[(df_b['uniqueID']==row.uniqueID) & (df_b["end_time"]>row.start_time)]
    if len(sub_sectiondata)>0:
        df_final = df_final .append({'uniqueID':row.uniqueID, 'start_time':row.start_time, 'end_time': min(sub_sectiondata["end_time"])}, ignore_index = True)

    else:
        df_final = df_final .append({'uniqueID':row.uniqueID, 'start_time':row.start_time, 'end_time': None}, ignore_index = True)

在两种方法中，我都得到了预期的结果，但是执行此操作所花费的时间非常长。它线性增加（每10000条记录，大约需要20分钟）。因此，对于100万，大约需要33个小时才能完成，这是巨大的。那么python中还有其他方法可以解决这个问题吗？社区对此有何帮助？

Answer 1

逻辑可能会占用很多时间，因为Pandas必须多次查看整个数据帧。尝试进行分析，以了解需要很长时间（https://docs.python.org/2/library/profile.html + https://jiffyclub.github.io/snakeviz/）。

您可以尝试的是：

复制A
添加具有空值的列 end_time
然后遍历B，如果 end_time 低于A中具有相同ID的对应ID，则将其替换。

那样，您只需在B上迭代一次。这可能更快，但是需要检查。如果您为我提供了示例数据集，那么我可以试一下。

我不知道数据帧是否在速度上是最有效的数据结构。也许直接使用Numpy数组可能会更快。如果您真的很绝望，Cython可能也是加快速度的一种方法，但是我对此没有经验。

Answer 2

在数据帧上按行迭代是一项缓慢的操作。这里的神奇词是merge_asof：它允许从第二个数据帧中选择第一个数据行之后的一行，或者具有最接近第一个数据行的数字或日期值。

因此，如果您希望结束时间最接近开始时间，则可以执行以下操作：

df_final = pd.merge_asof(df_a.sort_values('start_time'), df_b.sort_values('end_time')
    , left_on='start_time', right_on='end_time',
    by='unique_id', direction='nearest')

但这将允许在start_time之前有end_time。如果要确保end_time >= start_time，请使用direction='forward'：

df_final = pd.merge_asof(df_a.sort_values('start_time'), df_b.sort_values('end_time')
    , left_on='start_time', right_on='end_time',
    by='unique_id', direction='forward')

使用来自两个数据框的信息更新空数据框的更快方法

2 个答案: