这是非常常见的SQL
查询:
在X
列中选择具有最大值的行,按group_id
分组。
结果适用于每个group_id
,一行(第一行),其中列X
值在组内最大。
我有一个包含许多列的2D
NumPy
数组,但我们将其简化为(ID
,X
,Y
):
import numpy as np
rows = np.array([[1 22 1236]
[1 11 1563]
[2 13 1234]
[2 10 1224]
[2 23 1111]
[2 23 1250]])
我想得到:
[[1 22 1236]
[2 23 1111]]
我能够通过繁琐的循环来完成它,例如:
row_grouped_with_max = []
max_row = rows[0]
last_max = max_row[1]
last_row_group = max_row[0]
for row in rows:
if last_max < row[1]:
max_row = row
if row[0] != last_row_group:
last_row_group = row[0]
last_max = 0
row_grouped_with_max.append(max_row)
row_grouped_with_max.append(max_row)
如何以干净的NumPy
方式执行此操作?
答案 0 :(得分:4)
替代使用pandas
库(更容易操作ndarrays
那里,IMO)。
In [1]: import numpy as np
...: import pandas as pd
In [2]: rows = np.array([[1,22,1236],
...: [1,11,1563],
...: [2,13,1234],
...: [2,10,1224],
...: [2,23,1111],
...: [2,23,1250]])
...: print rows
[[ 1 22 1236]
[ 1 11 1563]
[ 2 13 1234]
[ 2 10 1224]
[ 2 23 1111]
[ 2 23 1250]]
In [3]: df = pd.DataFrame(rows)
...: print df
0 1 2
0 1 22 1236
1 1 11 1563
2 2 13 1234
3 2 10 1224
4 2 23 1111
5 2 23 1250
In [4]: g = df.groupby([0])[1].transform(max)
...: print g
0 22
1 22
2 23
3 23
4 23
5 23
dtype: int32
In [5]: df2 = df[df[1] == g]
...: print df2
0 1 2
0 1 22 1236
4 2 23 1111
5 2 23 1250
In [6]: df3 = df2.drop_duplicates([1])
...: print df3
0 1 2
0 1 22 1236
4 2 23 1111
In [7]: mtx = df3.as_matrix()
...: print mtx
[[ 1 22 1236]
[ 2 23 1111]]
答案 1 :(得分:2)
可能不是干净,但这是一种解决它的矢量化方式 -
# Get sorted "rows"
sorted_rows = rows[np.argsort(rows[:,0])]
# Get count of elements for each ID
_,count = np.unique(sorted_rows[:,0],return_counts=True)
# Form mask to fill elements from X-column
N1 = count.max()
N2 = len(count)
mask = np.arange(N1) < count[:,None]
# Form a 2D matrix of ID's with each row for each unique ID
ID_2Darray = np.empty((N2,N1))
ID_2Darray.fill(-np.Inf)
ID_2Darray[mask] = sorted_rows[:,1]
# Get ID based max indices
grp_max_idx = np.argmax(ID_2Darray,axis=1) + np.append([0],count.cumsum()[:-1])
# Finally, get the "maxed"-X rows
out = sorted_rows[grp_max_idx]
示例输入,输出 -
In [101]: rows
Out[101]:
array([[ 2, 13, 1234],
[ 1, 22, 1236],
[ 2, 23, 1250],
[ 6, 12, 1345],
[ 4, 10, 290],
[ 2, 10, 1224],
[ 2, 23, 1111],
[ 4, 45, 99],
[ 1, 11, 1563],
[ 4, 23, 89]])
In [102]: out
Out[102]:
array([[ 1, 22, 1236],
[ 2, 23, 1250],
[ 4, 45, 99],
[ 6, 12, 1345]])
答案 2 :(得分:1)
使用numpy_indexed包可以优雅地完全解决这个问题(免责声明:我是它的作者):
import numpy_indexed as npi
# sort rows by 2nd column
rows = rows[np.argsort(rows[:, 1])]
# group by is stable, so last item in each group is the one we are after
print(npi.group_by(rows[:, 0]).last(rows))
答案 3 :(得分:0)
假设您有n列:
沿第一轴使用a.max并解压缩值 x1max,x2max .... xnmax = a.max(axis = 0)