我有一个Pandas DF,其中包含Reddit评论的标记化列表。我希望按照“#red; subreddit”这一列进行分组。并获取' tokenized_text'中最常用的令牌列表柱。以下是数据的样子:
list(df['tokenized_text'].groupby(df['subreddit']))[25:30]
生成此输出:
[('15SecondStories',
745124 [honestly, happened, write, fucking, complaint...
997789 [im, sorry, man, first, one, sure]
1013206 [little, bit, stupid, deadass]
1177475 [lol]
1179558 [native, spanish, speaker, school, taught, muc...
1184372 [format, incorrect, please, fix]
1396579 [read, rules, posting, along, announcements, p...
1859785 [lol]
Name: tokenized_text, dtype: object),
('181920', 360480 [pretty, great, body]
Name: tokenized_text, dtype: object),
('182637777', 1628100 [username, created, months, christmas]
1632561 [approximate, value, mass, ratio, real, value,...
1634853 [http]
1665160 [hiw, whi]
Name: tokenized_text, dtype: object),
我想通过subreddit进行聚合,并获取该subreddit最常用单词的频率字典。我希望得到的输出是一个pandas df,其中一列作为subreddit名称,另一列是最常用单词的字典(如FreqDict生成的字典)。
我已经尝试了df['tokenized_text'].groupby(df['subreddit'].progress_apply(lambda x: nltk.FreqDist(y) for y in x)
,但无法让它发挥作用。
有什么想法吗?
答案 0 :(得分:1)
如果df
按照我认为的方式构建,那么这应该可以帮到你:
df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
# Simulated data
df = pd.DataFrame({'subreddit': ['news', 'news', 'art'],
'tokenized_text': [['some', 'ex', 'words', 'ex'],
['news', 'news', 'and', 'more', 'news'],
['draw', 'paint', 'up', 'up', 'down']]})
df
subreddit tokenized_text
0 news [some, ex, words, ex]
1 news [news, news, and, more, news]
2 art [draw, paint, up, up, down]
# Get pandas to print wider-than-usual columns, up to 800px
pd.set_option('max_colwidth', 800)
# Group by subreddit and aggregate lists (this likely does not scale well to larger data)
df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
tokenized_text
subreddit
art {'draw': 1, 'paint': 1, 'up': 2, 'down': 1}
news {'some': 1, 'ex': 2, 'words': 1, 'news': 3, 'and': 1, 'more': 1}
df2 = df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
# Method 1: repeated use of the pd.Series() constructor
df2['tokenized_text'].apply(pd.Series).fillna(0).astype(int)
and down draw ex more news paint some up words
subreddit
art 0 1 1 0 0 0 1 0 2 0
news 1 0 0 2 1 3 0 1 0 1
# Method 2: pd.DataFrame() + df[col].tolist()
pd.DataFrame(df2['tokenized_text'].tolist(), index=df2.index).fillna(0).astype(int)
and down draw ex more news paint some up words
subreddit
art 0 1 1 0 0 0 1 0 2 0
news 1 0 0 2 1 3 0 1 0 1