c3的Python Pandas Group找到第2列的最大值并获得第1列

时间:2015-06-05 08:21:45

标签: python python-2.7 python-3.x numpy pandas

我尝试在python中反映一些复杂的SQL操作。从最初开始 - 要求是找出由部门明智地获得最高工资的EMP_ID。 3个步骤:

  1. GROUPBY(DEPT)

  2. Max(薪水) - 每个部门

  3. get(Emp_Id) - 在每个部门

  4. 示例file.csv

    EMP_ID,NAME,AGE,ADDRESS,SAL,DEPT,LOC
    1,ghk,3,PTBP,23,IME,bhmd
    2,ghk,3,PTBP,23,IME,bhmd
    3,ghk,3,PTBP,23,IME,bhmd
    4,ghk,3,PTBP,23,IME-DATA,bhmd
    5,ghk,3,PTBP,24,IME-DATA,bhmd
    6,ghk,3,PTBP,23,IME,bhmd
    7,ghk,3,PTBP,23,IME,bhmd
    8,ghk,3,PTBP,29,IME-NA,bhmd
    9,ghk,3,PTBP,23,IME,bhmd
    10,ghk,3,PTBP,23,IME-NA,bhmd
    

    我试过的代码:

    import pandas as pd
    from pandas import *
    import numpy as np
    from numpy import *
    df=pd.read_csv("SAM_JOINS.csv",sep=",")
    go=df["EMP_ID"]+df["AGE"]
    df["SYSTEM_REVENUE"]=go
    print (df)
    b=df.groupby(["DEPT"],as_index=False)
    gb1=b['DEPT'].agg({'Count':np.size})
    print(gb1)
    

    但未能明智地获得每个部门的最高(工资)和emp_id。 请帮助我解决这个问题,因为我是对蜜蜂大熊猫的新蜜蜂。

1 个答案:

答案 0 :(得分:0)

您可以使用group.transform method

基本上,这一行:

df['DEPT_MAX_SAL'] = df.groupby('DEPT')['SAL'].transform(lambda x: x.max())

将部门最高薪水放在每一行,然后你要做的只是那里的子集。我已经在您的数据中包含了一个实现的IPython。请注意,由于您的示例数据在SAL字段中没有很多变化,因此该示例看起来并不特别干净。

IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: from StringIO import StringIO
   ...: import pandas as pd
   ...: 

In [2]: # Create data set for pandas to read from
   ...: data = """EMP_ID,NAME,AGE,ADDRESS,SAL,DEPT,LOC
   ...: 1,ghk,3,PTBP,23,IME,bhmd
   ...: 2,ghk,3,PTBP,23,IME,bhmd
   ...: 3,ghk,3,PTBP,23,IME,bhmd
   ...: 4,ghk,3,PTBP,23,IME-DATA,bhmd
   ...: 5,ghk,3,PTBP,24,IME-DATA,bhmd
   ...: 6,ghk,3,PTBP,23,IME,bhmd
   ...: 7,ghk,3,PTBP,23,IME,bhmd
   ...: 8,ghk,3,PTBP,29,IME-NA,bhmd
   ...: 9,ghk,3,PTBP,23,IME,bhmd
   ...: 10,ghk,3,PTBP,23,IME-NA,bhmd"""
   ...: data = StringIO(data)
   ...: 

In [3]: # Load dataset
   ...: df = pd.read_csv(data)
   ...: print df
   ...: 
   EMP_ID NAME  AGE ADDRESS  SAL      DEPT   LOC
0       1  ghk    3    PTBP   23       IME  bhmd
1       2  ghk    3    PTBP   23       IME  bhmd
2       3  ghk    3    PTBP   23       IME  bhmd
3       4  ghk    3    PTBP   23  IME-DATA  bhmd
4       5  ghk    3    PTBP   24  IME-DATA  bhmd
5       6  ghk    3    PTBP   23       IME  bhmd
6       7  ghk    3    PTBP   23       IME  bhmd
7       8  ghk    3    PTBP   29    IME-NA  bhmd
8       9  ghk    3    PTBP   23       IME  bhmd
9      10  ghk    3    PTBP   23    IME-NA  bhmd

In [4]: # Create new column of department max salary
   ...: df['DEPT_MAX_SAL'] = df.groupby('DEPT')['SAL'].transform(lambda x: x.max())
   ...: print df
   ...: 
   EMP_ID NAME  AGE ADDRESS  SAL      DEPT   LOC  DEPT_MAX_SAL
0       1  ghk    3    PTBP   23       IME  bhmd            23
1       2  ghk    3    PTBP   23       IME  bhmd            23
2       3  ghk    3    PTBP   23       IME  bhmd            23
3       4  ghk    3    PTBP   23  IME-DATA  bhmd            24
4       5  ghk    3    PTBP   24  IME-DATA  bhmd            24
5       6  ghk    3    PTBP   23       IME  bhmd            23
6       7  ghk    3    PTBP   23       IME  bhmd            23
7       8  ghk    3    PTBP   29    IME-NA  bhmd            29
8       9  ghk    3    PTBP   23       IME  bhmd            23
9      10  ghk    3    PTBP   23    IME-NA  bhmd            29

In [5]: # Subset to show only employees with max salary in department
   ...: print df[df['SAL'] == df['DEPT_MAX_SAL']]
   EMP_ID NAME  AGE ADDRESS  SAL      DEPT   LOC  DEPT_MAX_SAL
0       1  ghk    3    PTBP   23       IME  bhmd            23
1       2  ghk    3    PTBP   23       IME  bhmd            23
2       3  ghk    3    PTBP   23       IME  bhmd            23
4       5  ghk    3    PTBP   24  IME-DATA  bhmd            24
5       6  ghk    3    PTBP   23       IME  bhmd            23
6       7  ghk    3    PTBP   23       IME  bhmd            23
7       8  ghk    3    PTBP   29    IME-NA  bhmd            29
8       9  ghk    3    PTBP   23       IME  bhmd            23