什么是T-SQL中与TOP函数等效的python?我希望将我的数据帧过滤到前50K行。我在网上看了,我找不到一个简单的例子。
答案 0 :(得分:9)
更新: - 显示不同的pandas方法,包括:
每组前N行
前N行有偏移量
等效于SQL聚合函数:
ROW_NUMBER() / RANK() OVER(PARTITION BY ... ORDER BY ...)
样本DF:
df = pd.DataFrame({
'dep': np.random.choice(list('ABC'), 20),
'manager_id': np.random.randint(0, 10, 20),
'salary': np.random.randint(5000, 5006, 20)
})
----------------------- Original DF ---------------------- -
In [2]: df
Out[2]:
dep manager_id salary
0 B 5 5005
1 A 6 5001
2 C 8 5000
3 A 7 5000
4 B 0 5002
5 A 3 5003
6 A 2 5004
7 A 2 5004
8 C 3 5002
9 C 4 5001
10 A 9 5002
11 C 9 5000
12 B 8 5004
13 A 1 5003
14 C 7 5005
15 B 0 5002
16 B 2 5003
17 A 4 5000
18 B 2 5003
19 B 7 5003
------------------前5行(按原始索引排序)------------------- < / p>
In [3]: df.head(5)
Out[3]:
dep manager_id salary
0 B 5 5005
1 A 6 5001
2 C 8 5000
3 A 7 5000
4 B 0 5002
---前5行(按manager_id
DESC排序,dep
ASC)----
In [4]: df.sort_values(by=['manager_id', 'dep'], ascending=[False,True]).head(5)
Out[4]:
dep manager_id salary
10 A 9 5002
11 C 9 5000
12 B 8 5004
2 C 8 5000
3 A 7 5000
---相当于SELECT * FROM tab ORDER BY salary DESC LIMIT 5 OFFSET 3
---
In [19]: df.nlargest(5+3, columns=['salary']).tail(5)
Out[19]:
dep manager_id salary
7 A 2 5004
12 B 8 5004
5 A 3 5003
13 A 1 5003
16 B 2 5003
----每个部门的前2名工资(无重复)-----
---等同于SQL:row_number() over(partition by DEP order by SALARY desc)
---
In [7]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
...: .groupby(['dep'])
...: .cumcount() + 1)
...: .query('rn < 3')
...: .sort_values(['dep','rn'])
...: )
Out[7]:
dep manager_id salary rn
6 A 2 5004 1
7 A 2 5004 2
0 B 5 5005 1
12 B 8 5004 2
14 C 7 5005 1
8 C 3 5002 2
---每个部门的前2名工资(使用&#34; nlargest&#34;)----
In [15]: df.loc[df.groupby('dep')['salary'].nlargest(2).reset_index()['level_1']]
Out[15]:
dep manager_id salary
6 A 2 5004
7 A 2 5004
0 B 5 5005
12 B 8 5004
14 C 7 5005
8 C 3 5002
---每个部门的第二和第三高薪---
In [16]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
....: .groupby(['dep'])
....: .cumcount() + 1)
....: .query('rn >= 2 and rn <= 3')
....: .sort_values(['dep','rn'])
....: )
Out[16]:
dep manager_id salary rn
7 A 2 5004 2
13 A 1 5003 3
12 B 8 5004 2
18 B 2 5003 3
8 C 3 5002 2
9 C 4 5001 3
---每个部门的前两名工资(有重复)----
---等同于SQL:rank() over(partition by DEP order by SALARY desc)
---
In [18]: (df.assign(rnk=df.groupby(['dep'])['salary']
....: .rank(method='min', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['dep','rnk'])
....: )
Out[18]:
dep manager_id salary rnk
6 A 2 5004 1.0
7 A 2 5004 1.0
0 B 5 5005 1.0
12 B 8 5004 2.0
14 C 7 5005 1.0
8 C 3 5002 2.0