Question

如何在不使用b函数的情况下比较col b中col a中第一行和col groupby中最后一行的值？因为groupby函数对于大型数据集来说非常慢。

a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3] 
b = [1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1]

返回两个列表：一个包含col a的组名，其中最后一个值大于第一个值，等等。

larger_or_equal = [1,3]
smaller = [2]

Answer 1

所有numpy

a = np.array([1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]) 
b = np.array([1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1])

w = np.where(a[1:] != a[:-1])[0]  # find the edges
e = np.append(w, len(a) - 1)  # define the end pos
s = np.append(0, w + 1)  # define start pos

# slice end pos with boolean array.  then slice groups with end postions.
# I could also have used start positions.
a[e[b[e] >= b[s]]]
a[e[b[e] < b[s]]]

[1 3]
[2]

Answer 2

这是一个没有groupby的解决方案。我们的想法是将列a转移到检测组更改：

df[df['a'].shift() != df['a']]

    a  b
0   1  1
7   2  8
14  3  1

df[df['a'].shift(-1) != df['a']]

    a  b
6   1  7
13  2  4
20  3  1

我们将比较这两个数据框中的列b。我们只需要将pandas比较的索引重置为：

first = df[df['a'].shift() != df['a']].reset_index(drop=True)
last = df[df['a'].shift(-1) != df['a']].reset_index(drop=True)
first.loc[last['b'] >= first['b'], 'a'].values

array([1, 3])

然后对<执行相同操作以获取其他组。或者做一组差异。

正如我在评论中所写，groupby(sort=False)可能会更快，具体取决于您的数据集。

pandas dataframe比较每组的第一行和最后一行

2 个答案: