Python熊猫|仅从列的特定部分中查找最大值

时间:2018-06-27 05:03:18

标签: python python-2.7 pandas dataframe

我一直在尝试这样做。熊猫max()将在整个列中找到最大值。我需要的是:

我输入的csv文件:

struct Student *s = malloc(sizeof(Student) * n);

所需的输出:

Id  Param1          Param2              Val1
1  -5.00138282776   2.04990620034e-08   1.738e-05
1  -4.80147838593   2.01516989762e-08   1.628e-05
1  -4.60159301758   1.98263165885e-08   1.671e-05
1  -4.40133094788   1.94918392538e-08   1.576e-05
1  -4.20143127441   1.91767686175e-08   
2  -5.00141859055   6.88369405921e-09   5.512e-06
2  -4.80152130126   6.77335965093e-09   5.964e-06
2  -4.60163593292   6.65415056389e-09
3  -5.00138044357   1.16316911658e-08   4.008e-06
3  -4.80148792267   1.15515588206e-08   7.347e-06
3  -4.60160970681   1.14048361866e-08   8.446e-06
3  -4.40137386322   1.12357021465e-08   

我不确定如何从具有相同ID的Val1列中选择/分组值,然后找到它们的最大值。另外,我在Val1列中有一些空白,将其数据类型呈现为对象。我不知道该怎么办。任何帮助将是最欢迎的。

2 个答案:

答案 0 :(得分:3)

GroupBy.transform用于每组max个值的新列:

df['Max_Val1_for_each_Id'] = df.groupby('Id')['Val1'].transform('max')
print (df)
    Id    Param1        Param2      Val1  Max_Val1_for_each_Id
0    1 -5.001383  2.049906e-08  0.000017              0.000017
1    1 -4.801478  2.015170e-08  0.000016              0.000017
2    1 -4.601593  1.982632e-08  0.000017              0.000017
3    1 -4.401331  1.949184e-08  0.000016              0.000017
4    1 -4.201431  1.917677e-08       NaN              0.000017
5    2 -5.001419  6.883694e-09  0.000006              0.000006
6    2 -4.801521  6.773360e-09  0.000006              0.000006
7    2 -4.601636  6.654151e-09       NaN              0.000006
8    3 -5.001380  1.163169e-08  0.000004              0.000008
9    3 -4.801488  1.155156e-08  0.000007              0.000008
10   3 -4.601610  1.140484e-08  0.000008              0.000008
11   3 -4.401374  1.123570e-08       NaN              0.000008

然后,如果仅需要第一个值,则将where与由duplicated创建的掩码和~一起添加为反转掩码:

df['Max_Val1_for_each_Id'] = df['Max_Val1_for_each_Id'].where(~df['Id'].duplicated())
print (df)
    Id    Param1        Param2      Val1  Max_Val1_for_each_Id
0    1 -5.001383  2.049906e-08  0.000017              0.000017
1    1 -4.801478  2.015170e-08  0.000016                   NaN
2    1 -4.601593  1.982632e-08  0.000017                   NaN
3    1 -4.401331  1.949184e-08  0.000016                   NaN
4    1 -4.201431  1.917677e-08       NaN                   NaN
5    2 -5.001419  6.883694e-09  0.000006              0.000006
6    2 -4.801521  6.773360e-09  0.000006                   NaN
7    2 -4.601636  6.654151e-09       NaN                   NaN
8    3 -5.001380  1.163169e-08  0.000004              0.000008
9    3 -4.801488  1.155156e-08  0.000007                   NaN
10   3 -4.601610  1.140484e-08  0.000008                   NaN
11   3 -4.401374  1.123570e-08       NaN                   NaN

编辑:

如果Val1没有NaN值,并且上述解决方案引发错误:

  

TypeError:'float'和'str'实例之间不支持'> ='

第一步是将非数字转换为NaN

df['Val1'] = pd.to_numeric(df['Val1'], errors='coerce')
df['Max_Val1_for_each_Id'] = df.groupby('Id')['Val1'].transform('max')
df['Max_Val1_for_each_Id'] = df['Max_Val1_for_each_Id'].where(~df['Id'].duplicated())

答案 1 :(得分:1)

Numpy的有趣方式

f, u = pd.factorize(df.Id)
out = np.zeros(len(u))
whr = np.ones(len(u), np.int64) * len(f)

mask = np.isnan(df.Val1)

np.maximum.at(out, f[~mask], df.Val1[~mask])
np.minimum.at(whr, f, np.arange(len(f)))

df.assign(Max_Val1_for_each_Id=pd.Series(out, df.index[whr]))

    Id    Param1        Param2      Val1  Max_Val1_for_each_Id
0    1 -5.001383  2.049906e-08  0.000017              0.000017
1    1 -4.801478  2.015170e-08  0.000016                   NaN
2    1 -4.601593  1.982632e-08  0.000017                   NaN
3    1 -4.401331  1.949184e-08  0.000016                   NaN
4    1 -4.201431  1.917677e-08       NaN                   NaN
5    2 -5.001419  6.883694e-09  0.000006              0.000006
6    2 -4.801521  6.773360e-09  0.000006                   NaN
7    2 -4.601636  6.654151e-09       NaN                   NaN
8    3 -5.001380  1.163169e-08  0.000004              0.000008
9    3 -4.801488  1.155156e-08  0.000007                   NaN
10   3 -4.601610  1.140484e-08  0.000008                   NaN
11   3 -4.401374  1.123570e-08       NaN                   NaN