我的数据框如下所示
Utterance Frequency
Directions to Starbucks 1045
Show me directions to Starbucks 754
Give me directions to Starbucks 612
Navigate me to Starbucks 498
Display navigation to Starbucks 376
Direct me to Starbucks 201
Navigate to Starbucks 180
在这里,有一些数据显示了人们的言论,以及这些话语的频率。
即,“向星巴克的路线”发出1045次,“向我展示星巴克的方向”发出了754次等等。
我试图获得单个单词被说出多少次的频率。
我尝试使用.value_counts()
,但这只给了我以下
Utterance Frequency
Starbucks 7
Directions 3
Navigate 2
.
.
.
相反,我试图获得以下输出
Utterance Frequency
Starbucks 3666
Directions 2411
Navigate 678
.
.
.
换句话说,我试图获得它们发出的次数的频率,而不是它们出现的行数,这是value.counts()
发生的事情。感谢您的帮助!
答案 0 :(得分:2)
我认为需要:
df = (df.set_index('Frequency')['Utterance']
.str.split(expand=True)
.stack()
.groupby(level=0)
.value_counts()
.reset_index(name='new')
.assign(Frequency = lambda x: x.Frequency * x['new'])
.groupby('level_1', as_index=False)['Frequency'].sum()
.rename(columns={'level_1':'Words'})
)
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
如果每行只包含唯一的单词,则解决方法是简化:
df = (df.set_index('Frequency')['Utterance']
.str.split(expand=True)
.stack()
.reset_index(name='Words')
.groupby('Words', as_index=False)['Frequency'].sum()
)
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
<强>解释强>:
Frequency
split
句话到DataFrame
stack
SeriesGroupBy.value_counts
assign
Frequency
的多个计数列
GroupBy.sum
sum
醇>
答案 1 :(得分:2)
对于O(n)复杂性解决方案,请使用collections.Counter
。
from collections import Counter
import pandas as pd
df = pd.DataFrame([['Directions to Starbucks', 1045],
['Show me directions to Starbucks', 754],
['Give me directions to Starbucks', 612],
['Navigate me to Starbucks', 498],
['Display navigation to Starbucks', 376],
['Direct me to Starbucks', 201],
['Navigate to Starbucks', 180]],
columns = ['Utterance', 'Frequency'])
c = Counter()
for row in df.itertuples():
for i in row[1].split():
c[i] += row[2]
res = pd.DataFrame.from_dict(c, orient='index')\
.rename(columns={0: 'Count'})\
.sort_values('Count', ascending=False)
<强>结果强>
Count
to 3666
Starbucks 3666
me 2065
directions 1366
Directions 1045
Show 754
Navigate 678
Give 612
Display 376
navigation 376
Direct 201
<强>解释强>
答案 2 :(得分:0)
这应该可以解决问题:
output = {}
for i in ['starbucks','directions','navigate']:
output[i] = df[df['Utterance'].str.lower().str.contains(i)]['Frequency'].sum()
收率:
{'starbucks': 3666, 'directions': 2411, 'navigate': 678}