Question

我有一个数据帧“A”（约500k记录）。它包含两列：“fromTimestamp”和“toTimestamp”。

我有一个数据帧“B”（约5M记录）。它有一些值和一个名为“actualTimestamp”的列。

我希望我的所有行都在数据帧“B”中，其中“actualTimestamp”的值介于任何“fromTimestamp”和“toTimestamp”对的值之间。

我想要类似这样的东西，但代码更有效：

ini_set('display_errors', 0);

在python / pandas中执行此操作的最快/最有效的方法是什么？

更新样本数据

dataframe A（输入）：

for index, row in A.iterrows():
    cond1 = B['actual_timestamp'] >= row['from_timestamp']
    cond2 = B['actual_timestamp'] <= row['to_timestamp']
    B.ix[cond1 & cond2, 'corrupted_flag'] = True

数据框B（输入）：

from_timestamp    to_timestamp
3                 4             
6                 9
8                 10

数据框B（预期输出）：

data    actual_timestamp
a       2
b       3
c       4
d       5
e       8
f       10
g       11
h       12

Answer 1

您可以使用intervaltree包从DataFrame A中的时间戳构建interval tree，然后检查DataFrame B中的每个时间戳是否都在树中：

from intervaltree import IntervalTree

tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))

请注意，您需要稍微填充A['to_timestamp']，因为间隔的上限不包含在intervaltree包中的间隔的一部分中（尽管下限是）。

对于我生成的一些样本数据（A = 10k行，B = 100k行），此方法将性能提高了一倍14。我添加的行越多，性能提升越大。

之前我已将intervaltree包与datetime个对象一起使用，因此如果您的时间戳不是像样本数据中那样的整数，那么上面的代码仍然有用;你可能需要改变填充上限的方式。

Answer 2

根据上述想法，我的最终解决方案如下（它不会在大数据集上生成MemoryError）：

from intervaltree import IntervalTree
import pandas as pd 

def flagDataWithGaps(A,B): 

    A['from_ts'] = A['from'].astype(float) 
    A['to_ts'] = A['to'].astype(float) 
    A['to_ts'] = A['to_ts']+0.1 
    B['actual_ts'] = B['actual'].astype(float) 

    tree = IntervalTree.from_tuples(zip(A['from_ts'], A['to_ts'])) 
    col = (tree.overlaps(x) for x in B['actual_ts']) 

    df = pd.DataFrame(col) 
    B['is_gap'] = df[0]

如何通过python / pandas中另一个数据帧的值标记数据帧列的最有效方式？

2 个答案: