我正在尝试按类别对10个最常见的单词进行分组。我已经看到this的答案了,但是我不能完全修改它以获得我想要的输出。
category | sentence
A cat runs over big dog
A dog runs over big cat
B random sentences include words
C including this one
所需的输出:
category | word/frequency
A runs, 2
cat: 2
dog: 2
over: 2
big: 2
B random: 1
C including: 1
由于我的数据框非常大,所以我只想获得前10个经常出现的单词。我也看到了answer
df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
但是此方法也返回字母计数。
答案 0 :(得分:2)
您可以在标记化句子后加入行并应用FreqDist
df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
出局:
category
a big 2.0
cat 2.0
dog 2.0
over 2.0
runs 2.0
c include 1.0
random 1.0
sentences 1.0
words 1.0
d including 1.0
one 1.0
this 1.0
Name: sentence, dtype: float64
答案 1 :(得分:1)
# Split the sentence into Series
df1 = pd.DataFrame(df.sentence.str.split(' ').tolist())
# Add category with as not been adding with the split
df1['category'] = df['category']
# Melt the Series corresponding to the splited sentence
df1 = pd.melt(df1, id_vars='category', value_vars=df1.columns[:-1].tolist())
# Groupby and count (reset_index will create a column nammed 0)
df1 = df1.groupby(['category', 'value']).size().reset_index()
# Keep the 10 largests numbers
df1 = df1.nlargest(10, 0)
答案 2 :(得分:1)
是否要按以下行的出现频率最高的单词的频率进行过滤(在这种情况下,每个类别有2个出现频率最高的单词):
from collections import Counter
df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
category
A [(cat, 2), (runs, 2)]
B [(random, 1), (sentences, 1)]
C [(including, 1), (this, 1)]
Name: sentence, dtype: object
明智的表现:
%timeit df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
2.07 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
4.96 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)