在python中有效地检查数据集上的两个条件

时间:2018-05-03 13:15:20

标签: python performance pandas dataset

我想找到满足两个条件的数据集中的行数:

counter = len(train[(train['tag'] == label) & (train['word'] == word)])

但根据我拥有的数据量和我想要执行的次数计算它需要很长时间。

还有其他更快的方法吗?

更新: @jezrael解决方案,其速度几乎增加了两倍,但仍然需要很长时间。

这是更完整的代码!

for index, row in tqdm(test.iterrows()):
word = row['word']
for label in labels:
    temp1 = train.eval('tag == @label and word == @word').sum()/labelDict[label]
    temp2 = train.eval('tag == @label and tag1 == @LastLable').sum()/labelDict[label]
    temp = temp1 * temp2
    if max > temp:
        max = temp
        bestLabel = label

1 个答案:

答案 0 :(得分:1)

使用DataFrame.eval使用numexpr module并汇总True s:

counter = train.eval('tag == @label and word == @word').sum()

另一个解决方案,更慢:

counter = ((train['tag'] == label) & (train['word'] == word)).sum()

性能:

train = pd.DataFrame({'tag':list('abaaea'),
                     'word':list('baabbb')})

print (train)

#600k rows
train = pd.concat([train] * 100000, ignore_index=True)

label = 'a'
word = 'b'

In [214]: %timeit (((train['tag'] == label) & (train['word'] == word)).sum())
10 loops, best of 3: 84.6 ms per loop

In [215]: %timeit (train.eval('tag == @label and word == @word').sum())
10 loops, best of 3: 25.8 ms per loop

In [216]: %timeit (len(train[(train['tag'] == label) & (train['word'] == word)]))
10 loops, best of 3: 90.9 ms per loop