在大熊猫df中按组获取令牌的频率

时间:2018-03-19 19:48:23

标签: python pandas

我有一个Pandas DF,其中包含Reddit评论的标记化列表。我希望按照“#red; subreddit”这一列进行分组。并获取' tokenized_text'中最常用的令牌列表柱。以下是数据的样子:

list(df['tokenized_text'].groupby(df['subreddit']))[25:30]

生成此输出:

[('15SecondStories',
  745124     [honestly, happened, write, fucking, complaint...
  997789                    [im, sorry, man, first, one, sure]
  1013206                       [little, bit, stupid, deadass]
  1177475                                                [lol]
  1179558    [native, spanish, speaker, school, taught, muc...
  1184372                     [format, incorrect, please, fix]
  1396579    [read, rules, posting, along, announcements, p...
  1859785                                                [lol]
  Name: tokenized_text, dtype: object),
 ('181920', 360480    [pretty, great, body]
  Name: tokenized_text, dtype: object),
 ('182637777', 1628100               [username, created, months, christmas]
  1632561    [approximate, value, mass, ratio, real, value,...
  1634853                                               [http]
  1665160                                           [hiw, whi]
  Name: tokenized_text, dtype: object),

我想通过subreddit进行聚合,并获取该subreddit最常用单词的频率字典。我希望得到的输出是一个pandas df,其中一列作为subreddit名称,另一列是最常用单词的字典(如FreqDict生成的字典)。

我已经尝试了df['tokenized_text'].groupby(df['subreddit'].progress_apply(lambda x: nltk.FreqDist(y) for y in x),但无法让它发挥作用。

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

如果df按照我认为的方式构建,那么这应该可以帮到你:

df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

具有模拟数据的可运行示例

# Simulated data
df = pd.DataFrame({'subreddit': ['news', 'news', 'art'],
                   'tokenized_text': [['some', 'ex', 'words', 'ex'],
                                      ['news', 'news', 'and', 'more', 'news'],
                                      ['draw', 'paint', 'up', 'up', 'down']]})
df
  subreddit                 tokenized_text
0      news          [some, ex, words, ex]
1      news  [news, news, and, more, news]
2       art    [draw, paint, up, up, down]


# Get pandas to print wider-than-usual columns, up to 800px
pd.set_option('max_colwidth', 800)

# Group by subreddit and aggregate lists (this likely does not scale well to larger data)
df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
                                                             tokenized_text
subreddit
art                             {'draw': 1, 'paint': 1, 'up': 2, 'down': 1}
news       {'some': 1, 'ex': 2, 'words': 1, 'news': 3, 'and': 1, 'more': 1}

将词典扩展为DataFrame列

df2 = df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

# Method 1: repeated use of the pd.Series() constructor

df2['tokenized_text'].apply(pd.Series).fillna(0).astype(int)
           and  down  draw  ex  more  news  paint  some  up  words
subreddit
art          0     1     1   0     0     0      1     0   2      0
news         1     0     0   2     1     3      0     1   0      1

# Method 2: pd.DataFrame() + df[col].tolist()

pd.DataFrame(df2['tokenized_text'].tolist(), index=df2.index).fillna(0).astype(int)
           and  down  draw  ex  more  news  paint  some  up  words
subreddit
art          0     1     1   0     0     0      1     0   2      0
news         1     0     0   2     1     3      0     1   0      1