Question

这是非常常见的SQL查询：

在X列中选择具有最大值的行，按group_id分组。

结果适用于每个group_id，一行（第一行），其中列X值在组内最大。

我有一个包含许多列的2D NumPy数组，但我们将其简化为（ID，X，Y）：

import numpy as np
rows = np.array([[1 22 1236]
                 [1 11 1563]
                 [2 13 1234]
                 [2 10 1224]
                 [2 23 1111]
                 [2 23 1250]])

我想得到：

[[1 22 1236]
 [2 23 1111]]

我能够通过繁琐的循环来完成它，例如：

  row_grouped_with_max = []

  max_row = rows[0]
  last_max = max_row[1]
  last_row_group = max_row[0]
  for row in rows:
    if last_max < row[1]:
        max_row = row
    if row[0] != last_row_group:      
      last_row_group = row[0]
      last_max = 0
      row_grouped_with_max.append(max_row)
  row_grouped_with_max.append(max_row)

如何以干净的NumPy方式执行此操作？

Answer 1

替代使用pandas库（更容易操作ndarrays那里，IMO）。

In [1]: import numpy as np
   ...: import pandas as pd

In [2]: rows = np.array([[1,22,1236],
   ...:                  [1,11,1563],
   ...:                  [2,13,1234],
   ...:                  [2,10,1224],
   ...:                  [2,23,1111],
   ...:                  [2,23,1250]])
   ...: print rows
[[   1   22 1236]
 [   1   11 1563]
 [   2   13 1234]
 [   2   10 1224]
 [   2   23 1111]
 [   2   23 1250]]

In [3]: df = pd.DataFrame(rows)
   ...: print df
   0   1     2
0  1  22  1236
1  1  11  1563
2  2  13  1234
3  2  10  1224
4  2  23  1111
5  2  23  1250

In [4]: g = df.groupby([0])[1].transform(max)
   ...: print g
0    22
1    22
2    23
3    23
4    23
5    23
dtype: int32

In [5]: df2 = df[df[1] == g]
   ...: print df2
   0   1     2
0  1  22  1236
4  2  23  1111
5  2  23  1250

In [6]: df3 = df2.drop_duplicates([1])
   ...: print df3
   0   1     2
0  1  22  1236
4  2  23  1111

In [7]: mtx = df3.as_matrix()
   ...: print mtx
[[   1   22 1236]
 [   2   23 1111]]

Answer 2

可能不是干净，但这是一种解决它的矢量化方式 -

# Get sorted "rows"
sorted_rows = rows[np.argsort(rows[:,0])]

# Get count of elements for each ID
_,count = np.unique(sorted_rows[:,0],return_counts=True)

# Form mask to fill elements from X-column
N1 = count.max()
N2 = len(count)
mask = np.arange(N1) < count[:,None]

# Form a 2D matrix of ID's with each row for each unique ID
ID_2Darray = np.empty((N2,N1))
ID_2Darray.fill(-np.Inf)
ID_2Darray[mask] = sorted_rows[:,1]

# Get ID based max indices
grp_max_idx = np.argmax(ID_2Darray,axis=1) + np.append([0],count.cumsum()[:-1])

# Finally, get the "maxed"-X rows
out = sorted_rows[grp_max_idx]

示例输入，输出 -

In [101]: rows
Out[101]: 
array([[   2,   13, 1234],
       [   1,   22, 1236],
       [   2,   23, 1250],
       [   6,   12, 1345],
       [   4,   10,  290],
       [   2,   10, 1224],
       [   2,   23, 1111],
       [   4,   45,   99],
       [   1,   11, 1563],
       [   4,   23,   89]])

In [102]: out
Out[102]: 
array([[   1,   22, 1236],
       [   2,   23, 1250],
       [   4,   45,   99],
       [   6,   12, 1345]])

Answer 3

使用numpy_indexed包可以优雅地完全解决这个问题（免责声明：我是它的作者）：

import numpy_indexed as npi
# sort rows by 2nd column
rows = rows[np.argsort(rows[:, 1])]
# group by is stable, so last item in each group is the one we are after
print(npi.group_by(rows[:, 0]).last(rows))

Answer 4

假设您有n列：

沿第一轴使用a.max并解压缩值 x1max，x2max .... xnmax = a.max（axis = 0）

如何从numpy 2d获取行，其中列值最大的是其他列的组？

4 个答案: