我有这样的数据集:
Type Word
0 N Work
1 N Rock
2 N Rock
3 Adj Rock
4 V Rock
5 N Work
6 V Work
7 V Rock
8 Adj Like
9 N Rock
10 V Love
11 V Like
12 V Rock
13 Adj Blue
14 Adv Work
我想计算每个单词的数量,并获得每个单词的前2种类型。 我期望的结果如下:
Word Top Count
0 Rock N, V 7
1 Work N, Adv 4
2 Like Adj, V 2
3 Blue Adj 1
4 Love V 1
我创建了一些代码行,并按照我的预期得到了结果。 这是我的代码:
In [1]:
import pandas as pd
df = pd.DataFrame([
['N','Work'],
['N','Rock'],
['N','Rock'],
['Adj','Rock'],
['V','Rock'],
['N','Work'],
['V','Work'],
['V','Rock'],
['Adj','Like'],
['N','Rock'],
['V','Love'],
['V','Like'],
['V','Rock'],
['Adj','Blue'],
['Adv','Work']], columns=['Type', 'Word'])
In [2]: #Group by column "Word","Type" and count number of each pair
df = df.groupby(["Type", "Word"])["Type"].count().reset_index(name="Count")
In [3]:
df
Type Word Count
0 Adj Blue 1
1 Adj Like 1
2 Adj Rock 1
3 Adv Work 1
4 N Rock 3
5 N Work 2
6 V Like 1
7 V Love 1
8 V Rock 3
9 V Work 1
In [4]: #Group by "Word" and sort by "Count" in each group, get top 2
df1 = df.sort_values(["Word","Count"], ascending=False).groupby("Word").head(2)
df1
Type Word Count
5 N Work 2
3 Adv Work 1
4 N Rock 3
8 V Rock 3
7 V Love 1
1 Adj Like 1
6 V Like 1
0 Adj Blue 1
In [5]: #Groupby "Word" and union "Type" in each group
df1 = df1.groupby('Word')['Type'].apply(lambda x: "%s" % ', '.join(x)).reset_index(name='Top')
df1
Word Top
0 Blue Adj
1 Like Adj, V
2 Love V
3 Rock N, V
4 Work N, Adv
In [6]: #Compute number of each word, save to a new dataframe
df_sum = df.groupby('Word').sum().reset_index()
df_sum
Word Count
0 Blue 1
1 Like 2
2 Love 1
3 Rock 7
4 Work 4
In [7]: #Merge to dataframe containing number of each word
df1.merge(df_sum).sort_values("Count", ascending=False)
df1
Word Top Count
3 Rock N, V 7
4 Work N, Adv 4
1 Like Adj, V 2
0 Blue Adj 1
2 Love V 1
但是,这段代码似乎不是最优的。我使用了很多groupby
,并使用了sort_values
2次。如果数据集实际很大,那将会很麻烦。你能优化它吗?
感谢。
答案 0 :(得分:2)
df.groupby('Word').agg(dict(
Type=lambda x: ', '.join(pd.value_counts(x).index[:2]),
Word='size'
)).rename(columns=dict(Word='Count')).reset_index().sort_values('Count')
Word Type Count
0 Blue Adj 1
2 Love V 1
1 Like V, Adj 2
4 Work N, V 4
3 Rock N, V 7
答案 1 :(得分:0)
您可以使用agg
后面的Counter
来获取最常见的类型,并使用len
来计算出现的单词数量。
import pandas as pd
from collections import Counter
group_df = df.groupby('Word')
df_summary = group_df.agg(
lambda x: {'Type': [', '.join([e[0] for e in Counter(x.Type).most_common(2)]), len(x)]}
)
df_out = df_summary.Type.apply(pd.Series).reset_index().rename(columns={0: 'Top', 1: 'count'})
df_out.sort_values('count', ascending=False) # output
这将输出数据帧为
Word Top count
3 Rock N, V 7
4 Work N, V 4
1 Like Adj, V 2
0 Blue Adj 1
2 Love V 1