单词和组元组之间的映射以获得单词的频率

时间:2018-03-26 16:18:30

标签: python pandas dataframe statistics frequency

我的数据框如下所示

Utterance                         Frequency   
Directions to Starbucks           1045
Show me directions to Starbucks   754
Give me directions to Starbucks   612
Navigate me to Starbucks          498
Display navigation to Starbucks   376
Direct me to Starbucks            201
Navigate to Starbucks             180

在这里,有一些数据显示了人们的言论,以及这些话语的频率。

即,“向星巴克的路线”发出1045次,“向我展示星巴克的方向”发出了754次等等。

我能够通过以下方式获得所需的输出:

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )

print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

但是,我也在尝试寻找以下输出:

print (df)
                        Words        Frequency
0                  Directions        2411
1   Give_Show_Direct_Navigate        2245
2                     Display        376
3                   Starbucks        3666
4                          me        2065
5                  navigation        376
6                          to        3666

即,我正试图找出一种方法来组合某些短语并获得这些单词的频率。例如,如果发言者说“Seattles_Best”或“Tullys”,那么理想情况下我会将其添加到“Starbucks”并将其重命名为“coffee_shop”或类似名称。

谢谢!

4 个答案:

答案 0 :(得分:2)

这是一个解决方案,从您当前的结果集开始并适当地编辑:

print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

首先,创建一个字典,将当前单词映射到您选择的新单词:

phrase_map = {'Starbucks': 'coffee_shop',
              'Seattles_Best': 'coffee_shop',
              'Tullys': 'coffee_shop',
              'Give': 'Give_Show_Direct_Navigate',
              'Show': 'Give_Show_Direct_Navigate',
              'Direct': 'Give_Show_Direct_Navigate',
              'Navigate': 'Give_Show_Direct_Navigate'}

然后查找每个单词,如果找到则替换为新值,否则保留原始值:

df['Words'] = df['Words'].apply(lambda x: phrase_map.get(x) if phrase_map.get(x) else x)

然后分组:

df.groupby('Words').sum()

结果:

                           Frequency
Words                               
Directions                      1045
Display                          376
Give_Show_Direct_Navigate       2245
coffee_shop                     3666
directions                      1366
me                              2065
navigation                       376
to                              3666

答案 1 :(得分:1)

IIUC:

(df.set_index('Frequency')['Utterance'].str.lower()
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )

输出:

        Words  Frequency
0      direct        201
1  directions       2411
2     display        376
3        give        612
4          me       2065
5    navigate        678
6  navigation        376
7        show        754
8   starbucks       3666
9          to       3666

答案 2 :(得分:1)

这是一种方法,坚持上一个问题的collections.Counter

您可以向lst添加任意数量的元组,以便为​​您选择的组合附加其他结果。

from collections import Counter
import pandas as pd

df = pd.DataFrame([['Directions to Starbucks', 1045],
                   ['Show me directions to Starbucks', 754],
                   ['Give me directions to Starbucks', 612],
                   ['Navigate me to Starbucks', 498],
                   ['Display navigation to Starbucks', 376],
                   ['Direct me to Starbucks', 201],
                   ['Navigate to Starbucks', 180]],
                  columns = ['Utterance', 'Frequency'])

c = Counter()

for row in df.itertuples():
    for i in row[1].split():
        c[i] += row[2]

res = pd.DataFrame.from_dict(c, orient='index')\
                  .rename(columns={0: 'Count'})\
                  .sort_values('Count', ascending=False)

def add_combinations(df, lst):
    for i in lst:
        words = '_'.join(i)
        df.loc[words] = df.loc[df.index.isin(i), 'Count'].sum()
    return df.sort_values('Count', ascending=False)

lst = [('Give', 'Show', 'Navigate', 'Direct')]

res = add_combinations(res, lst)

<强>结果

                           Count
to                          3666
Starbucks                   3666
Give_Show_Navigate_Direct   2245
me                          2065
directions                  1366
Directions                  1045
Show                         754
Navigate                     678
Give                         612
Display                      376
navigation                   376
Direct                       201

答案 3 :(得分:1)

我的解决方案遍历每个单词,所以如果您正在考虑寻找更多单词,您应该切换到某些NLP库,如spacy或NLTK,这些应该具有计算单词出现的功能。

但这是我的解决方案:

lst = ['Directions','Give','Show','Direct','Navigate','Display','Starbucks','me','navigation','to']
for word in lst:
    A[word +'_score'] = A['Phrase'].str.contains(word).astype(int)*A['Frequency'].astype(int)

A.iloc[:,2:].sum()

这导致

Directions_score    1045
Give_score           612
Show_score           754
Direct_score        1246
Navigate_score       678
Display_score        376
Starbucks_score     3666
me_score            2065
navigation_score     376
to_score            3666
dtype: int64

您只需要对列进行总结以获得出现次数