Question

与PANDAS合作，尝试将数据框汇总为某些类别的计数，以及这些类别的均值情绪分数。

有一些表格，其中包含不同情绪分数的字符串，我想通过说明他们有多少帖子以及这些帖子的平均情绪来对每个文字来源进行分组。

我的（简化）数据框如下所示：

source    text              sent
--------------------------------
bar       some string       0.13
foo       alt string        -0.8
bar       another str       0.7
foo       some text         -0.2
foo       more text         -0.5

这个输出应该是这样的：

source    count     mean_sent
-----------------------------
foo       3         -0.5
bar       2         0.415

答案在某处：

df['sent'].groupby(df['source']).mean()

然而，只提供每个来源，它的意思是，没有列标题。

提前致谢！

Answer 1

您可以groupby使用aggregate：

df = df.groupby('source') \
       .agg({'text':'size', 'sent':'mean'}) \
       .rename(columns={'text':'count','sent':'mean_sent'}) \
       .reset_index()
print (df)
  source  count  mean_sent
0    bar      2      0.415
1    foo      3     -0.500

Answer 2

在较新版本的Panda中，如果使用命名参数，则不再需要重命名：

df = df.groupby('source') \
       .agg(count=('text', 'size'), mean_sent=('sent', 'mean')) \
       .reset_index()

print (df)
  source  count  mean_sent
0    bar      2      0.415
1    foo      3     -0.500

Answer 3

以下一项应该可以正常工作：

df [[''source'，'sent']]。groupby（'source'）。agg（['count'，'mean']）

Answer 4

实现此目的的较短版本是：

df.groupby('source')['sent'].agg(count='size', mean_sent='mean').reset_index()

这样做的好处是，如果您想取多个变量的均值但只计算一次，则可以扩展它。在这种情况下，您必须传递字典：

df.groupby('source')['sent1', 'sent2'].agg({'count': 'size', 'means': 'mean'}).reset_index()

Answer 5

我认为这应该提供您想要的输出：

result = pd.DataFrame(df.groupby('source').size())

results['mean_score'] = df.groupby('source').sent.mean()

Pandas Groupby：数量和平均值相结合

5 个答案: