我的数据框如下所示
Utterance Frequency
Directions to Starbucks 1045
Show me directions to Starbucks 754
Give me directions to Starbucks 612
Navigate me to Starbucks 498
Display navigation to Starbucks 376
Direct me to Starbucks 201
Navigate to Starbucks 180
在这里,有一些数据显示了人们的言论,以及这些话语的频率。
即,“向星巴克的路线”发出1045次,“向我展示星巴克的方向”发出了754次等等。
我能够通过以下方式获得所需的输出:
df = (df.set_index('Frequency')['Utterance']
.str.split(expand=True)
.stack()
.reset_index(name='Words')
.groupby('Words', as_index=False)['Frequency'].sum()
)
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
但是,我也在尝试寻找以下输出:
print (df)
Words Frequency
0 Directions 2411
1 Give_Show_Direct_Navigate 2245
2 Display 376
3 Starbucks 3666
4 me 2065
5 navigation 376
6 to 3666
即,我正试图找出一种方法来组合某些短语并获得这些单词的频率。例如,如果发言者说“Seattles_Best”或“Tullys”,那么理想情况下我会将其添加到“Starbucks”并将其重命名为“coffee_shop”或类似名称。
谢谢!
答案 0 :(得分:2)
这是一个解决方案,从您当前的结果集开始并适当地编辑:
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
首先,创建一个字典,将当前单词映射到您选择的新单词:
phrase_map = {'Starbucks': 'coffee_shop',
'Seattles_Best': 'coffee_shop',
'Tullys': 'coffee_shop',
'Give': 'Give_Show_Direct_Navigate',
'Show': 'Give_Show_Direct_Navigate',
'Direct': 'Give_Show_Direct_Navigate',
'Navigate': 'Give_Show_Direct_Navigate'}
然后查找每个单词,如果找到则替换为新值,否则保留原始值:
df['Words'] = df['Words'].apply(lambda x: phrase_map.get(x) if phrase_map.get(x) else x)
然后分组:
df.groupby('Words').sum()
结果:
Frequency
Words
Directions 1045
Display 376
Give_Show_Direct_Navigate 2245
coffee_shop 3666
directions 1366
me 2065
navigation 376
to 3666
答案 1 :(得分:1)
IIUC:
(df.set_index('Frequency')['Utterance'].str.lower()
.str.split(expand=True)
.stack()
.reset_index(name='Words')
.groupby('Words', as_index=False)['Frequency'].sum()
)
输出:
Words Frequency
0 direct 201
1 directions 2411
2 display 376
3 give 612
4 me 2065
5 navigate 678
6 navigation 376
7 show 754
8 starbucks 3666
9 to 3666
答案 2 :(得分:1)
这是一种方法,坚持上一个问题的collections.Counter
。
您可以向lst
添加任意数量的元组,以便为您选择的组合附加其他结果。
from collections import Counter
import pandas as pd
df = pd.DataFrame([['Directions to Starbucks', 1045],
['Show me directions to Starbucks', 754],
['Give me directions to Starbucks', 612],
['Navigate me to Starbucks', 498],
['Display navigation to Starbucks', 376],
['Direct me to Starbucks', 201],
['Navigate to Starbucks', 180]],
columns = ['Utterance', 'Frequency'])
c = Counter()
for row in df.itertuples():
for i in row[1].split():
c[i] += row[2]
res = pd.DataFrame.from_dict(c, orient='index')\
.rename(columns={0: 'Count'})\
.sort_values('Count', ascending=False)
def add_combinations(df, lst):
for i in lst:
words = '_'.join(i)
df.loc[words] = df.loc[df.index.isin(i), 'Count'].sum()
return df.sort_values('Count', ascending=False)
lst = [('Give', 'Show', 'Navigate', 'Direct')]
res = add_combinations(res, lst)
<强>结果强>
Count
to 3666
Starbucks 3666
Give_Show_Navigate_Direct 2245
me 2065
directions 1366
Directions 1045
Show 754
Navigate 678
Give 612
Display 376
navigation 376
Direct 201
答案 3 :(得分:1)
我的解决方案遍历每个单词,所以如果您正在考虑寻找更多单词,您应该切换到某些NLP库,如spacy或NLTK,这些应该具有计算单词出现的功能。
但这是我的解决方案:
lst = ['Directions','Give','Show','Direct','Navigate','Display','Starbucks','me','navigation','to']
for word in lst:
A[word +'_score'] = A['Phrase'].str.contains(word).astype(int)*A['Frequency'].astype(int)
A.iloc[:,2:].sum()
这导致
Directions_score 1045
Give_score 612
Show_score 754
Direct_score 1246
Navigate_score 678
Display_score 376
Starbucks_score 3666
me_score 2065
navigation_score 376
to_score 3666
dtype: int64
您只需要对列进行总结以获得出现次数