Question

我有一个带有句子的数据框和一个按主题分组的术语词典，我想计算每个主题的术语匹配数。

import pandas as pd

terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}

df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})

到目前为止，我已经为主题创建了列，如果通过遍历字典存在单词，则将其标记为1。

df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])


for k, v in terms.items():
    for val in v:
        df.loc[df.Foo.str.contains(val), k] = 1


print (df)

我得到了：

>>> 
                                                 Foo  Score animals games  \
0   The quick brown fox was playing basketball today      4       1     1   
1  John and Rob visited the eagles nest, the foxe...      6       1   NaN   
2                         Bill smells like a wet dog      2     NaN   NaN   
3  Steve threw the football at a deer. But the fo...      7       1     1   
4                  Sheriff John does not like hockey      8     NaN     1   

  people  
0    NaN  
1      1  
2    NaN  
3      1  
4      1

计算句子中出现的每个主题的单词数量的最佳方法是什么？是否有一种更有效的方法来循环字典而不使用cython？

Answer 1

您可以split使用stack，Counter解决方案的速度提高5倍：

df1 = df.Foo.str.split(expand=True).stack()
                                   .reset_index(level=1, drop=True)
                                   .reset_index(name='Foo')

for k, v in terms.items():
    df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1

print df1.groupby('index').sum().astype(int)
       games  animals  people
index                        
0          1        1       0
1          0        2       2
2          0        0       0
3          2        1       1
4          1        0       1

<强>计时：

In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop

In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop

代码：

def a(df):
    df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
    for k, v in terms.items():
        df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
    return df1.groupby('index').sum().astype(int)

def b(df):
    from collections import Counter

    df1 = pd.DataFrame(terms)

    res = []
    for i,r in df.iterrows():
        s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
        res.append(pd.DataFrame(s).T)
    return pd.concat(res)

Answer 2

我会选择Counter和replace：

from collections import Counter

df1 = pd.DataFrame(terms)

res = []
for i,r in df.iterrows():
    s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
    res.append(pd.DataFrame(s).T)


In [109]: pd.concat(res)
Out[109]:
   animals  games  people
0        1      1       0
0        0      0       2
0        0      0       0
0        0      2       1
0        0      1       1

Python pandas计算字符串

2 个答案: