我一直在尝试这样做。熊猫max()将在整个列中找到最大值。我需要的是:
我输入的csv文件:
struct Student *s = malloc(sizeof(Student) * n);
所需的输出:
Id Param1 Param2 Val1
1 -5.00138282776 2.04990620034e-08 1.738e-05
1 -4.80147838593 2.01516989762e-08 1.628e-05
1 -4.60159301758 1.98263165885e-08 1.671e-05
1 -4.40133094788 1.94918392538e-08 1.576e-05
1 -4.20143127441 1.91767686175e-08
2 -5.00141859055 6.88369405921e-09 5.512e-06
2 -4.80152130126 6.77335965093e-09 5.964e-06
2 -4.60163593292 6.65415056389e-09
3 -5.00138044357 1.16316911658e-08 4.008e-06
3 -4.80148792267 1.15515588206e-08 7.347e-06
3 -4.60160970681 1.14048361866e-08 8.446e-06
3 -4.40137386322 1.12357021465e-08
我不确定如何从具有相同ID的Val1列中选择/分组值,然后找到它们的最大值。另外,我在Val1列中有一些空白,将其数据类型呈现为对象。我不知道该怎么办。任何帮助将是最欢迎的。
答案 0 :(得分:3)
将GroupBy.transform
用于每组max
个值的新列:
df['Max_Val1_for_each_Id'] = df.groupby('Id')['Val1'].transform('max')
print (df)
Id Param1 Param2 Val1 Max_Val1_for_each_Id
0 1 -5.001383 2.049906e-08 0.000017 0.000017
1 1 -4.801478 2.015170e-08 0.000016 0.000017
2 1 -4.601593 1.982632e-08 0.000017 0.000017
3 1 -4.401331 1.949184e-08 0.000016 0.000017
4 1 -4.201431 1.917677e-08 NaN 0.000017
5 2 -5.001419 6.883694e-09 0.000006 0.000006
6 2 -4.801521 6.773360e-09 0.000006 0.000006
7 2 -4.601636 6.654151e-09 NaN 0.000006
8 3 -5.001380 1.163169e-08 0.000004 0.000008
9 3 -4.801488 1.155156e-08 0.000007 0.000008
10 3 -4.601610 1.140484e-08 0.000008 0.000008
11 3 -4.401374 1.123570e-08 NaN 0.000008
然后,如果仅需要第一个值,则将where
与由duplicated
创建的掩码和~
一起添加为反转掩码:
df['Max_Val1_for_each_Id'] = df['Max_Val1_for_each_Id'].where(~df['Id'].duplicated())
print (df)
Id Param1 Param2 Val1 Max_Val1_for_each_Id
0 1 -5.001383 2.049906e-08 0.000017 0.000017
1 1 -4.801478 2.015170e-08 0.000016 NaN
2 1 -4.601593 1.982632e-08 0.000017 NaN
3 1 -4.401331 1.949184e-08 0.000016 NaN
4 1 -4.201431 1.917677e-08 NaN NaN
5 2 -5.001419 6.883694e-09 0.000006 0.000006
6 2 -4.801521 6.773360e-09 0.000006 NaN
7 2 -4.601636 6.654151e-09 NaN NaN
8 3 -5.001380 1.163169e-08 0.000004 0.000008
9 3 -4.801488 1.155156e-08 0.000007 NaN
10 3 -4.601610 1.140484e-08 0.000008 NaN
11 3 -4.401374 1.123570e-08 NaN NaN
编辑:
如果Val1
没有NaN
值,并且上述解决方案引发错误:
TypeError:'float'和'str'实例之间不支持'> ='
第一步是将非数字转换为NaN
:
df['Val1'] = pd.to_numeric(df['Val1'], errors='coerce')
df['Max_Val1_for_each_Id'] = df.groupby('Id')['Val1'].transform('max')
df['Max_Val1_for_each_Id'] = df['Max_Val1_for_each_Id'].where(~df['Id'].duplicated())
答案 1 :(得分:1)
Numpy的有趣方式
f, u = pd.factorize(df.Id)
out = np.zeros(len(u))
whr = np.ones(len(u), np.int64) * len(f)
mask = np.isnan(df.Val1)
np.maximum.at(out, f[~mask], df.Val1[~mask])
np.minimum.at(whr, f, np.arange(len(f)))
df.assign(Max_Val1_for_each_Id=pd.Series(out, df.index[whr]))
Id Param1 Param2 Val1 Max_Val1_for_each_Id
0 1 -5.001383 2.049906e-08 0.000017 0.000017
1 1 -4.801478 2.015170e-08 0.000016 NaN
2 1 -4.601593 1.982632e-08 0.000017 NaN
3 1 -4.401331 1.949184e-08 0.000016 NaN
4 1 -4.201431 1.917677e-08 NaN NaN
5 2 -5.001419 6.883694e-09 0.000006 0.000006
6 2 -4.801521 6.773360e-09 0.000006 NaN
7 2 -4.601636 6.654151e-09 NaN NaN
8 3 -5.001380 1.163169e-08 0.000004 0.000008
9 3 -4.801488 1.155156e-08 0.000007 NaN
10 3 -4.601610 1.140484e-08 0.000008 NaN
11 3 -4.401374 1.123570e-08 NaN NaN