我对pandas
很陌生。如果它们具有相同的名称,则需要汇总'Names'
,然后将'Rating'
和'NumsHelpful'
取平均值(不计算NaN
)。 'Review'
应该串联,而'Weight(Pounds)'
应该保持不变:
col names: ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']
Name 'Brand' 'Name'
1534 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1535 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1536 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1537 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1538 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1539 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1540 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
'NumsHelpful' 'Rating' 'Weight'
1534 NaN 2 4.5
1535 NaN 2 4.5
1536 NaN NaN 4.5
1537 NaN NaN 4.5
1538 2 NaN 4.5
1539 3 5 4.5
1540 5 NaN 4.5
'Review'
1534 Yummy - Delish
1535 The best Bloody Mary mix! - The best Bloody Ma...
1536 Best Taste by far - I've tried several if not ...
1537 Best bloody mary mix ever - This is also good ...
1538 Outstanding - Has a small kick to it but very ...
1539 OMG! So Good! - Spicy, terrific Bloody Mary mix!
1540 Good stuff - This is the best
所以输出应该是这样的:
'Brand' 'Name' 'NumsHelpful' 'Rating'
Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz 3.33 3
'Weight' 'Review'
4.5 Review1 / Review2 / ... / ReviewN
我应该如何进行?谢谢。
答案 0 :(得分:2)
您可以使用groupby
+ agg
以及针对功能的字典映射系列,为每列使用不同的功能进行汇总。例如:
d = {'Rating': 'mean',
'NumsHelpful': 'mean',
'Review': ' | '.join,
'Weight(Pounds)': 'first'}
res = df.groupby('Name').agg(d)
答案 1 :(得分:2)
将DataFrameGroupBy.agg
与列字典和聚合函数一起使用-列Weight
和Brand
由first
聚合-表示每组的第一个值:
d = {'NumsHelpful':'mean',
'Review':'/'.join,
'Weight':'first',
'Brand':'first',
'Rating':'mean'}
df = df.groupby('Name').agg(d).reset_index()
print (df)
Name NumsHelpful \
0 Zing Zang Bloody Mary Mix, 32 fl oz 3.333333
Review Weight Brand \
0 Yummy - Delish/The best Bloody Mary mix! - The... 4.5 Zing Zang
Rating
0 3.0
也在pandas 0.23.1 pandas版本中获得:
FutureWarning:“名称”既是索引级别又是列标签。 默认为列,但这会在将来的版本中引发歧义错误
解决方案是删除索引名称Name
:
df.index.name = None
或者:
df = df.rename_axis(None)
另一种可能的解决方案不是first
聚合,而是将这些列添加到groupby
:
d = {'NumsHelpful':'mean', 'Review':'/'.join, 'Rating':'mean'}
df = df.groupby(['Name', 'Weight','Brand']).agg(d).reset_index()
如果每个组的值相同,则两个解决方案都将返回相同的输出。
编辑:
如果需要将字符串(对象)列转换为数字,请首先尝试通过astype
进行转换:
df['Weight(Pounds)'] = df['Weight(Pounds)'].astype(float)
如果失败,请使用带有参数errors='coerce'
的{{3}}将不可解析的字符串转换为NaN
:
df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
答案 2 :(得分:0)
我已经看到这种情况的发生,因为在创建索引时,您选择将列保留在列表中,通常该列转到该索引会从表中排除,因此请执行以下操作:
# dataset_A was created with the option # drop = False
df_dataset_new = dataset_A.copy()
index_df = ['month', 'scop']
# dataset_new will be create`enter code here`d with the option # drop = True
df_dataset_new.set_index(index_df, drop=True, inplace=True, verify_integrity=True)