我有这样的df:
Count
1
0
1
1
0
0
1
1
1
0
如果1
中1
连续出现两次或更多次,我想在新列中返回Count
,如果没有则0
。因此,在新列中,每行将根据1
列中满足的条件获得Count
。我想要的输出是:
Count New_Value
1 0
0 0
1 1
1 1
0 0
0 0
1 1
1 1
1 1
0 0
我想我可能需要使用itertools
,但我一直在阅读它并且还没有遇到我需要的东西。我希望能够使用此方法计算任意数量的连续出现次数,而不仅仅是2次。例如,有时我需要连续计算10次,我在这里只使用2个。
答案 0 :(得分:10)
你可以:
df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
得到:
Count consecutive
0 1 1
1 0 0
2 1 2
3 1 2
4 0 0
5 0 0
6 1 3
7 1 3
8 1 3
9 0 0
从这里你可以,任何门槛:
threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)
得到:
Count consecutive
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
或者,只需一步:
(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
就效率而言,使用pandas
方法可在问题规模扩大时提供显着的加速:
df = pd.concat([df for _ in range(1000)])
%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop
与之相比:
%%timeit
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
pd.Series(l)
10 loops, best of 3: 76.7 ms per loop
答案 1 :(得分:1)
不确定这是否已经过优化,但您可以尝试一下:
from itertools import groupby
import pandas as pd
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
df['new_Value'] = pd.Series(l)
df
Count new_Value
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0