在python中对分组行执行操作

时间:2018-05-10 00:29:49

标签: python pandas dataframe

我有一个数据帧,其中pic_code值可能重复。如果它重复,我想设置变量" keep"到" t"对于最接近其mpe_wgt的pic_code。

例如,第二个pic_code具有" keep"设置为t,因为它具有" weight"最接近其对应的" mpe_weight"。我的代码导致" keep"住在' f'为了所有人和"差异"住#" 100"为了所有人。

df['keep']='f'
df['diff']=100

def cln_df(data):
    if pd.unique(data['mpe_wgt']).shape==(1,):
        data['keep'][0:1]='t'
    elif pd.unique(data['mpe_wgt']).shape!=(1,): 
        data['diff']=abs(data['weight']-(data['mpe_wgt']/100))
        data['keep'][data['diff']==min(data['diff'])]='t'
    return data

df=df.groupby('pic_code').apply(cln_df)

之前的

  pic_code      weight      mpe_wgt    keep    diff
  1234          45          34         f       100
  1234          32          23         f       100
  45344         54          35         f       100
  234           76          98         f       100
  234           65          12         f       100

df输出应为

  pic_code      weight      mpe_wgt    keep    diff
  1234          45          34         f       11
  1234          32          23         t       9
  45344         54          35         t       100
  234           76          98         t       22
  234           65          12         f       53

我对python很新,所以请尽量保持解决方案的简单性。我真的想让我的方法有效,所以请不要过于花哨。在此先感谢您的帮助。

4 个答案:

答案 0 :(得分:6)

这是一种方式。注意我使用布尔值True / False代替字符串"t""f"。这只是一种很好的做法。

请注意,以下所有操作都是矢量化的,而具有自定义功能的groupby.apply肯定不是。

<强>设置

print(df)

   pic_code  weight  mpe_wgt
0      1234      45       34
1      1234      32       23
2     45344      54       35
3       234      76       98
4       234      65       12

<强>解决方案

# calculate difference
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()

# sort by pic_code, then by diff
df = df.sort_values(['pic_code', 'diff'])

# define keep column as True only for non-duplicates by pic_code
df['keep'] = ~df.duplicated('pic_code')

<强>结果

print(df)

   pic_code  weight  mpe_wgt  diff   keep
3       234      76       98    22   True
4       234      65       12    53  False
1      1234      32       23     9   True
0      1234      45       34    11  False
2     45344      54       35    19   True

答案 1 :(得分:4)

使用:

df['keep'] = df.assign(closest=(df['mpe_wgt']-df['weight']).abs())\
               .sort_values('closest').duplicated(subset=['pic_code'])\
               .replace({True:'f',False:'t'})

输出:

   pic_code  weight  mpe_wgt keep
0      1234      45       34    f
1      1234      32       23    t
2     45344      54       35    t
3       234      76       98    t
4       234      65       12    f

答案 2 :(得分:4)

也许你可以尝试cumcount

df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
df['keep'] = df.sort_values('diff').groupby('pic_code').cumcount().eq(0)
df
   pic_code  weight  mpe_wgt  diff   keep
0      1234      45       34    11  False
1      1234      32       23     9   True
2     45344      54       35    19   True
3       234      76       98    22   True
4       234      65       12    53  False

答案 3 :(得分:2)

使用static int n = 0; public static string[] NoDuplicate(string[] array) { int i; string[] res = (string[])array.Clone(); for (i = 0; i < array.Length-1; i++) { if (array[i + 1] != array[i]) res[n++] = (string)array[i]; } return res; } eval执行与其他答案类似的逻辑。

assign