Question

我对pandas很陌生。如果它们具有相同的名称，则需要汇总'Names'，然后将'Rating'和'NumsHelpful'取平均值（不计算NaN）。 'Review'应该串联，而'Weight(Pounds)'应该保持不变：

col names: ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']

Name             'Brand'                             'Name'
1534             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1535             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1536             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1537             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1538             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1539             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1540             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   

        'NumsHelpful'     'Rating'       'Weight'
1534          NaN            2              4.5   
1535          NaN            2              4.5   
1536          NaN            NaN            4.5   
1537          NaN            NaN            4.5   
1538          2              NaN            4.5   
1539          3              5              4.5   
1540          5              NaN            4.5   

                        'Review'
1534                                     Yummy - Delish  
1535  The best Bloody Mary mix! - The best Bloody Ma...  
1536  Best Taste by far - I've tried several if not ...  
1537  Best bloody mary mix ever - This is also good ...  
1538  Outstanding - Has a small kick to it but very ...  
1539   OMG! So Good! - Spicy, terrific Bloody Mary mix!  
1540                      Good stuff - This is the best

所以输出应该是这样的：

 'Brand'                'Name'                   'NumsHelpful'    'Rating' 
Zing Zang    Zing Zang Bloody Mary Mix, 32 fl oz     3.33             3

 'Weight'               'Review'
   4.5      Review1 / Review2 / ... / ReviewN

我应该如何进行？谢谢。

Answer 1

您可以使用groupby + agg以及针对功能的字典映射系列，为每列使用不同的功能进行汇总。例如：

d = {'Rating': 'mean',
     'NumsHelpful': 'mean',
     'Review': ' | '.join,
     'Weight(Pounds)': 'first'}

res = df.groupby('Name').agg(d)

Answer 2

将DataFrameGroupBy.agg与列字典和聚合函数一起使用-列Weight和Brand由first聚合-表示每组的第一个值：

d = {'NumsHelpful':'mean', 
     'Review':'/'.join, 
     'Weight':'first',
     'Brand':'first', 
     'Rating':'mean'}
df = df.groupby('Name').agg(d).reset_index()
print (df)
                                  Name  NumsHelpful  \
0  Zing Zang Bloody Mary Mix, 32 fl oz     3.333333   

                                              Review  Weight      Brand  \
0  Yummy - Delish/The best Bloody Mary mix! - The...     4.5  Zing Zang   

   Rating  
0     3.0

也在pandas 0.23.1 pandas版本中获得：

FutureWarning：“名称”既是索引级别又是列标签。默认为列，但这会在将来的版本中引发歧义错误

解决方案是删除索引名称Name：

df.index.name = None

或者：

df = df.rename_axis(None)

另一种可能的解决方案不是first聚合，而是将这些列添加到groupby：

d = {'NumsHelpful':'mean',  'Review':'/'.join, 'Rating':'mean'}
df = df.groupby(['Name', 'Weight','Brand']).agg(d).reset_index()

如果每个组的值相同，则两个解决方案都将返回相同的输出。

编辑：

如果需要将字符串（对象）列转换为数字，请首先尝试通过astype进行转换：

df['Weight(Pounds)'] = df['Weight(Pounds)'].astype(float)

如果失败，请使用带有参数errors='coerce'的{{3}}将不可解析的字符串转换为NaN：

df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')

Answer 3

我已经看到这种情况的发生，因为在创建索引时，您选择将列保留在列表中，通常该列转到该索引会从表中排除，因此请执行以下操作：

# dataset_A was created with the option # drop = False
df_dataset_new = dataset_A.copy()
index_df = ['month', 'scop']

# dataset_new will be create`enter code here`d with the option # drop = True
df_dataset_new.set_index(index_df, drop=True, inplace=True, verify_integrity=True)

聚集行熊猫

3 个答案: