我尝试在python中反映一些复杂的SQL操作。从最初开始 - 要求是找出由部门明智地获得最高工资的EMP_ID。 3个步骤:
GROUPBY(DEPT)
Max(薪水) - 每个部门
get(Emp_Id) - 在每个部门
EMP_ID,NAME,AGE,ADDRESS,SAL,DEPT,LOC
1,ghk,3,PTBP,23,IME,bhmd
2,ghk,3,PTBP,23,IME,bhmd
3,ghk,3,PTBP,23,IME,bhmd
4,ghk,3,PTBP,23,IME-DATA,bhmd
5,ghk,3,PTBP,24,IME-DATA,bhmd
6,ghk,3,PTBP,23,IME,bhmd
7,ghk,3,PTBP,23,IME,bhmd
8,ghk,3,PTBP,29,IME-NA,bhmd
9,ghk,3,PTBP,23,IME,bhmd
10,ghk,3,PTBP,23,IME-NA,bhmd
我试过的代码:
import pandas as pd
from pandas import *
import numpy as np
from numpy import *
df=pd.read_csv("SAM_JOINS.csv",sep=",")
go=df["EMP_ID"]+df["AGE"]
df["SYSTEM_REVENUE"]=go
print (df)
b=df.groupby(["DEPT"],as_index=False)
gb1=b['DEPT'].agg({'Count':np.size})
print(gb1)
但未能明智地获得每个部门的最高(工资)和emp_id。 请帮助我解决这个问题,因为我是对蜜蜂大熊猫的新蜜蜂。
答案 0 :(得分:0)
您可以使用group.transform method。
基本上,这一行:
df['DEPT_MAX_SAL'] = df.groupby('DEPT')['SAL'].transform(lambda x: x.max())
将部门最高薪水放在每一行,然后你要做的只是那里的子集。我已经在您的数据中包含了一个实现的IPython。请注意,由于您的示例数据在SAL
字段中没有很多变化,因此该示例看起来并不特别干净。
IPython 3.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
%guiref -> A brief reference about the graphical user interface.
In [1]: from StringIO import StringIO
...: import pandas as pd
...:
In [2]: # Create data set for pandas to read from
...: data = """EMP_ID,NAME,AGE,ADDRESS,SAL,DEPT,LOC
...: 1,ghk,3,PTBP,23,IME,bhmd
...: 2,ghk,3,PTBP,23,IME,bhmd
...: 3,ghk,3,PTBP,23,IME,bhmd
...: 4,ghk,3,PTBP,23,IME-DATA,bhmd
...: 5,ghk,3,PTBP,24,IME-DATA,bhmd
...: 6,ghk,3,PTBP,23,IME,bhmd
...: 7,ghk,3,PTBP,23,IME,bhmd
...: 8,ghk,3,PTBP,29,IME-NA,bhmd
...: 9,ghk,3,PTBP,23,IME,bhmd
...: 10,ghk,3,PTBP,23,IME-NA,bhmd"""
...: data = StringIO(data)
...:
In [3]: # Load dataset
...: df = pd.read_csv(data)
...: print df
...:
EMP_ID NAME AGE ADDRESS SAL DEPT LOC
0 1 ghk 3 PTBP 23 IME bhmd
1 2 ghk 3 PTBP 23 IME bhmd
2 3 ghk 3 PTBP 23 IME bhmd
3 4 ghk 3 PTBP 23 IME-DATA bhmd
4 5 ghk 3 PTBP 24 IME-DATA bhmd
5 6 ghk 3 PTBP 23 IME bhmd
6 7 ghk 3 PTBP 23 IME bhmd
7 8 ghk 3 PTBP 29 IME-NA bhmd
8 9 ghk 3 PTBP 23 IME bhmd
9 10 ghk 3 PTBP 23 IME-NA bhmd
In [4]: # Create new column of department max salary
...: df['DEPT_MAX_SAL'] = df.groupby('DEPT')['SAL'].transform(lambda x: x.max())
...: print df
...:
EMP_ID NAME AGE ADDRESS SAL DEPT LOC DEPT_MAX_SAL
0 1 ghk 3 PTBP 23 IME bhmd 23
1 2 ghk 3 PTBP 23 IME bhmd 23
2 3 ghk 3 PTBP 23 IME bhmd 23
3 4 ghk 3 PTBP 23 IME-DATA bhmd 24
4 5 ghk 3 PTBP 24 IME-DATA bhmd 24
5 6 ghk 3 PTBP 23 IME bhmd 23
6 7 ghk 3 PTBP 23 IME bhmd 23
7 8 ghk 3 PTBP 29 IME-NA bhmd 29
8 9 ghk 3 PTBP 23 IME bhmd 23
9 10 ghk 3 PTBP 23 IME-NA bhmd 29
In [5]: # Subset to show only employees with max salary in department
...: print df[df['SAL'] == df['DEPT_MAX_SAL']]
EMP_ID NAME AGE ADDRESS SAL DEPT LOC DEPT_MAX_SAL
0 1 ghk 3 PTBP 23 IME bhmd 23
1 2 ghk 3 PTBP 23 IME bhmd 23
2 3 ghk 3 PTBP 23 IME bhmd 23
4 5 ghk 3 PTBP 24 IME-DATA bhmd 24
5 6 ghk 3 PTBP 23 IME bhmd 23
6 7 ghk 3 PTBP 23 IME bhmd 23
7 8 ghk 3 PTBP 29 IME-NA bhmd 29
8 9 ghk 3 PTBP 23 IME bhmd 23