Question

我每天都有以下数据框和时间序列数据：

time-orig   00:15:00    00:30:00    00:45:00    01:00:00
date                
2010-01-04  1164.3  1163.5  1162.8  1161.8
2010-01-05  1186.3  1185.8  1185.6  1185.0
2010-01-06  1181.5  1181.5  1182.7  1182.3
2010-01-07  1202.1  1201.9  1201.7  1200.8

现在我想得到每列最大值的数量：

'00:15:00' : 3
'00:30:00' : 0
'00:45:00' : 1
'01:00:00' : 0

（即：列'00：15：00'有3个最大值，每行最大值。）

我知道我可以转置数据帧并在列上运行循环并使用idxmax（），但我的问题是，是否有一个矢量化/更好的方法来执行此操作？

Answer 1

一种方法是在底层数组数据上使用np.argmax，然后使用np.bincount对最大索引进行binned-count -

np.bincount(df.iloc[:,1:].values.argmax(1), minlength=df.shape[1]-1)

示例运行 -

In [141]: df
Out[141]: 
    time-orig  00:15:00  00:30:00  00:45:00  01:00:00
0  2010-01-04    1164.3    1163.5    1162.8    1161.8
1  2010-01-05    1186.3    1185.8    1185.6    1185.0
2  2010-01-06    1181.5    1181.5    1182.7    1182.3
3  2010-01-07    1202.1    1201.9    1201.7    1200.8

In [142]: c = np.bincount(df.iloc[:,1:].values.argmax(1), minlength=df.shape[1]-1)

In [143]: c
Out[143]: array([3, 0, 1, 0])

In [144]: np.c_[df.columns[1:], c]
Out[144]: 
array([['00:15:00', 3],
       ['00:30:00', 0],
       ['00:45:00', 1],
       ['01:00:00', 0]], dtype=object)

Answer 2

这里假设date是索引。您可以使用df.idxmax后跟df.value_counts：

print(df) 
time-orig   00:15:00  00:30:00  00:45:00  01:00:00
date                                              
2010-01-04    1164.3    1163.5    1162.8    1161.8
2010-01-05    1186.3    1185.8    1185.6    1185.0
2010-01-06    1181.5    1181.5    1182.7    1182.3
2010-01-07    1202.1    1201.9    1201.7    1200.8

s = df.idxmax(1).value_counts().reindex(df.columns, fill_value=0)
print(s)

time-orig
00:15:00    3
00:30:00    0
00:45:00    1
01:00:00    0
dtype: int64

如果你想要一个numpy数组，Divakar的解决方案非常快。对于您的确切数据，他的答案需要稍作修改：

val = np.bincount(df.values.argmax(1), minlength=df.shape[1])
s = pd.Series(val, df.columns)
print(s)

time-orig
00:15:00    3
00:30:00    0
00:45:00    1
01:00:00    0
dtype: int64

获取pandas中每列的最大值数

2 个答案: