我想找到满足两个条件的数据集中的行数:
counter = len(train[(train['tag'] == label) & (train['word'] == word)])
但根据我拥有的数据量和我想要执行的次数计算它需要很长时间。
还有其他更快的方法吗?
更新: @jezrael解决方案,其速度几乎增加了两倍,但仍然需要很长时间。
这是更完整的代码!
for index, row in tqdm(test.iterrows()):
word = row['word']
for label in labels:
temp1 = train.eval('tag == @label and word == @word').sum()/labelDict[label]
temp2 = train.eval('tag == @label and tag1 == @LastLable').sum()/labelDict[label]
temp = temp1 * temp2
if max > temp:
max = temp
bestLabel = label
答案 0 :(得分:1)
使用DataFrame.eval
使用numexpr
module并汇总True
s:
counter = train.eval('tag == @label and word == @word').sum()
另一个解决方案,更慢:
counter = ((train['tag'] == label) & (train['word'] == word)).sum()
性能:
train = pd.DataFrame({'tag':list('abaaea'),
'word':list('baabbb')})
print (train)
#600k rows
train = pd.concat([train] * 100000, ignore_index=True)
label = 'a'
word = 'b'
In [214]: %timeit (((train['tag'] == label) & (train['word'] == word)).sum())
10 loops, best of 3: 84.6 ms per loop
In [215]: %timeit (train.eval('tag == @label and word == @word').sum())
10 loops, best of 3: 25.8 ms per loop
In [216]: %timeit (len(train[(train['tag'] == label) & (train['word'] == word)]))
10 loops, best of 3: 90.9 ms per loop