Question

数据集说明：

我有一个看起来像这样的数据集（按键排序，这也意味着它按offset_time_stamp排序）：

key       offset_time_stamp       person          7_sec_count
1         0                       A               0
2         0                       B               0
3         0                       A               1
4         1                       A               2
5         2                       B               1
6         7                       A               1
7         9                       B               0

行数约为2000万。独特人数约为400万（每个人在过去7秒内的记录范围为0到10k +）。

我想计算过去7秒内为该人显示的行数（使用offset_time_stamp计算

以下是我尝试的内容：

def get_count(x):
  return [data[the_condition].count() for row in x.iterrows()]

counts = data.groupby(person).apply(get_count)

这段代码大约需要6个小时。我想用1秒重新采样，但它没有工作，因为数据集在那一秒内为同一个人有几行 - 而且我没有微秒级数据。我需要通过使用键值来解决时间戳相同时的关系。

我现在想做什么？

我现在想要用1亿行重新运行相同的练习，并将7秒窗口增加到7000秒。 现有代码估计会在5天内运行。

如何更快地进行此计算？我希望它能在2-3小时内完成运行，因此我可以分析更大的数据集。我可以将解决方案移植到另一种语言或使用纯粹的基于numpy的解决方案而不是pandas。

我还可以使用数据的排序性质吗？我在按人分组并减去刚刚落在7秒窗口之外的最后一行的cumcount时尝试了cumcount。这花了太长时间，可能是因为我写它的方式：

data_cumcount = data.groupby(person).cumcount() + 1

def get_subtractor(current_ts, person, index):
    global data_cumcount
    to_consider = sorted(person_level_indices[person][:person_level_indices[person].index(index)], reverse=True)
    for index in to_consider:
        if data_dict[index] <= current_ts: # -7 was already done and passed to this function
            return data_cumcount[index]
    return 0

def get_count(x):
    global data_cumcount
    return data_cumcount[x.name] - get_subtractor(x[TIME_STAMP]-7, x[person], x.name) - 1

counts = data.apply(get_count, axis=1)

我猜测get_subtractor中的for循环导致它在这种方法中花费了大量时间。

我尝试过其他一些方法，包括递归，但是data.groupby(person).apply(get_count)是那个在我看似不那么高效的代码中表现最佳的方法。

Answer 1

编辑：根据建议使用时间戳的排序，以及searchsorted被矢量化的事实，我能够将执行时间缩短为3。

def get_count(dfg):
    return pd.Series(
        np.arange(len(dfg)) - dfg.offset_time_stamp.searchsorted(dfg.offset_time_stamp - 6),
        index=dfg.index
    )

df['count'] = df.groupby('person').apply(get_count).reset_index(level=0, drop=True)

请注意，searchsorted不返回系列的索引，而是返回基础numpy数组中的位置。因此，计数只是当前位置（即np.arange）与searchsorted返回的位置之间的差异。

将结果转换为系列会有一些开销，但我需要它在结尾处重新分配结果。如果您找到直接使用numpy值的方法，则可以将时间再缩短一半。

原始回答：

你可能会提到cumcount的解决方案，但与此同时，这可能已经更快了：

def get_count(dfg):
    return dfg.apply(lambda row: dfg[dfg.offset_time_stamp > row['offset_time_stamp'] - 7].loc[:row.name-1].count(), axis=1)

df.groupby('person').apply(get_count)

我使用原生切片并选择而不是iterrows循环。请注意，在将数据行逐行应用于数据框时，row.name是原始数据框的索引。

计算与pandas框架中的条件匹配的行数（如果可能，使用数据的排序）

1 个答案: