我有一个带有句子的数据框和一个按主题分组的术语词典,我想计算每个主题的术语匹配数。
import pandas as pd
terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}
df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})
到目前为止,我已经为主题创建了列,如果通过遍历字典存在单词,则将其标记为1。
df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])
for k, v in terms.items():
for val in v:
df.loc[df.Foo.str.contains(val), k] = 1
print (df)
我得到了:
>>>
Foo Score animals games \
0 The quick brown fox was playing basketball today 4 1 1
1 John and Rob visited the eagles nest, the foxe... 6 1 NaN
2 Bill smells like a wet dog 2 NaN NaN
3 Steve threw the football at a deer. But the fo... 7 1 1
4 Sheriff John does not like hockey 8 NaN 1
people
0 NaN
1 1
2 NaN
3 1
4 1
计算句子中出现的每个主题的单词数量的最佳方法是什么?是否有一种更有效的方法来循环字典而不使用cython
?
答案 0 :(得分:1)
您可以split
使用stack
,Counter
解决方案的速度提高5倍:
df1 = df.Foo.str.split(expand=True).stack()
.reset_index(level=1, drop=True)
.reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1
print df1.groupby('index').sum().astype(int)
games animals people
index
0 1 1 0
1 0 2 2
2 0 0 0
3 2 1 1
4 1 0 1
<强>计时强>:
In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop
In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop
代码:
def a(df):
df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
return df1.groupby('index').sum().astype(int)
def b(df):
from collections import Counter
df1 = pd.DataFrame(terms)
res = []
for i,r in df.iterrows():
s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
res.append(pd.DataFrame(s).T)
return pd.concat(res)
答案 1 :(得分:1)
我会选择Counter
和replace
:
from collections import Counter
df1 = pd.DataFrame(terms)
res = []
for i,r in df.iterrows():
s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
res.append(pd.DataFrame(s).T)
In [109]: pd.concat(res)
Out[109]:
animals games people
0 1 1 0
0 0 0 2
0 0 0 0
0 0 2 1
0 0 1 1