Question

如何获取一个数据框的行，该数据框按5列的每个顺序，是至少是数字1的3倍？数据框填充有1和0（无缺失值）。

示例：

此外，由于需要检查数百万行和数十列列，因此快速处理将很有用。

Answer 1

创建宽度为5的滚动总和宽度，查看从第5到末尾的所有列，如果值始终为3或以上，则选择它们：

rolling_sum = df.rolling(5, min_periods=1, axis=1).sum()
select = (rolling_sum.iloc[:, 4:] >= 3).all(axis=1)

In [92]: df
Out[92]: 
   0  1  2  3  4  5  6  7  8  9
0  0  0  0  0  0  0  0  0  0  0
1  0  1  0  0  1  0  1  1  0  0
2  0  1  0  1  1  0  0  1  0  0
3  0  1  1  1  0  1  1  1  1  1
4  0  1  0  1  1  1  0  0  1  1
5  0  0  1  1  1  0  1  1  1  0

In [94]: (df.rolling(5, min_periods=1, axis=1).sum().iloc[:, 4:] >= 3).all(axis=1)
Out[94]: 
0    False
1    False
2    False
3     True
4     True
5     True
dtype: bool

Answer 2

将基础数组数据重塑为3D，以使最后一个轴具有5个元素，每个元素代表每个5的块，然后沿该轴求和，以得出每个元素的和块，最后沿代表原始数据帧每一行的第二个轴使用any缩减-

df['result'] = (df.values.reshape(-1,df.shape[1]//5,5).sum(2)>=3).any(1)

为了提高性能，您可能需要使用布尔数组：df.values==1而不是df.values。

样品运行-

In [41]: df
Out[41]: 
   0  1  2  3  4  5  6  7  8  9
0  0  1  1  0  0  1  0  0  0  1
1  0  0  0  0  0  0  1  0  1  1
2  0  1  1  0  0  1  1  0  0  1
3  1  1  1  1  0  0  0  1  0  1
4  0  1  1  1  0  1  1  1  1  0
5  0  0  0  0  1  0  0  1  1  1
6  0  0  1  0  1  1  0  0  0  1

In [42]: df['result'] = (df.values.reshape(-1,df.shape[1]//5,5).sum(2)>=3).any(1)

In [43]: df
Out[43]: 
   0  1  2  3  4  5  6  7  8  9  result
0  0  1  1  0  0  1  0  0  0  1   False
1  0  0  0  0  0  0  1  0  1  1    True
2  0  1  1  0  0  1  1  0  0  1    True
3  1  1  1  1  0  0  0  1  0  1    True
4  0  1  1  1  0  1  1  1  1  0    True
5  0  0  0  0  1  0  0  1  1  1    True
6  0  0  1  0  1  1  0  0  0  1   False

如果列数不是5的倍数，我们可以使用np.add.reduceat-

idx = np.arange(0,df.shape[1],5)
df['result'] = (np.add.reduceat(df.values, idx, axis=1)>=3).any(1)

millions rows and tens of cols上的时间-

In [99]: np.random.seed(0)
    ...: a = (np.random.rand(1000000,20)>0.6).astype(int)
    ...: df = pd.DataFrame(a)

# Solution from this post
In [101]: %timeit (df.values.reshape(-1,df.shape[1]//5,5).sum(2)>=3).any(1)
10 loops, best of 3: 65.3 ms per loop

# @w-m's soln
In [102]: %timeit (df.rolling(5, min_periods=1, axis=1).sum().iloc[:, 4:] >= 3).all(axis=1)
1 loop, best of 3: 8.04 s per loop

获取数据帧行，其中每个T列序列有1 N次

2 个答案: