我有一个包含NaN和True作为值的系列。我想要另一个系列来生成一个数字序列,这样每当NaN将该系列值设置为0并且在两个NaN行之间我需要执行cumcount。
即,
输入:
colA
NaN
True
True
True
True
NaN
True
NaN
NaN
True
True
True
True
True
输出
ColA Sequence
NaN 0
True 0
True 1
True 2
True 3
NaN 0
True 0
NaN 0
NaN 0
True 0
True 1
True 2
True 3
True 4
如何在熊猫中执行此操作?
答案 0 :(得分:11)
如果性能很重要,请不要将groupby
用于连续计数True
:
a = df['colA'].notnull()
b = a.cumsum()
df['Sequence'] = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
print (df)
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
<强>解释强>:
df = pd.DataFrame({'colA':[np.nan,True,True,True,True,np.nan,
True,np.nan,np.nan,True,True,True,True,True]})
a = df['colA'].notnull()
#cumulative sum, Trues are processes like 1
b = a.cumsum()
#replace Trues from a to NaNs
c = b.mask(a)
#add 1 for count from 0
d = b.mask(a).add(1)
#forward fill NaNs, replace possible first NaNs to 0 and cast to int
e = b.mask(a).add(1).ffill().fillna(0).astype(int)
#substract b for counts
f = b-b.mask(a).add(1).ffill().fillna(0).astype(int)
#replace -1 to 0 by mask a
g = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
#all together
df = pd.concat([a,b,c,d,e,f,g], axis=1, keys=list('abcdefg'))
print (df)
a b c d e f g
0 False 0 0.0 1.0 1 -1 0
1 True 1 NaN NaN 1 0 0
2 True 2 NaN NaN 1 1 1
3 True 3 NaN NaN 1 2 2
4 True 4 NaN NaN 1 3 3
5 False 4 4.0 5.0 5 -1 0
6 True 5 NaN NaN 5 0 0
7 False 5 5.0 6.0 6 -1 0
8 False 5 5.0 6.0 6 -1 0
9 True 6 NaN NaN 6 0 0
10 True 7 NaN NaN 6 1 1
11 True 8 NaN NaN 6 2 2
12 True 9 NaN NaN 6 3 3
13 True 10 NaN NaN 6 4 4
答案 1 :(得分:8)
您可以在此处使用groupby
+ cumcount
+ mask
:
m = df.colA.isnull()
df['Sequence'] = df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)
或者,在最后一步使用clip_lower
,您不必预先缓存m
:
df['Sequence'] = df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)
df
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
<强>计时强>
df = pd.concat([df] * 10000, ignore_index=True)
# Timing the alternatives in this answer
%%timeit
m = df.colA.isnull()
df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)
23.3 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)
24.1 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @user2314737's solution
%%timeit
df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
29.8 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @jezrael's solution
%%timeit
a = df['colA'].isnull()
b = a.cumsum()
(b-b.where(~a).add(1).ffill().fillna(0).astype(int)).clip_lower(0)
11.5 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
请注意,根据数据,您的里程可能会有所不同。
答案 2 :(得分:3)
试试这个:
df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
完整示例:
>>> df = pd.DataFrame({'colA':[np.NaN,True,True,True,True,np.NaN,True,np.NaN,np.NaN,True,True,True,True,True]})
>>> df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
>>> df
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
答案 3 :(得分:2)
晚会,但这是一个包含在函数中的numpy
解决方案:
import pandas as pd, numpy as np
df = pd.DataFrame({'ColA': [np.nan, True, True, True, True, np.nan, True,
np.nan, np.nan, True, True, True, True, True]})
def return_cumsum(df):
v = np.array(df.ColA, dtype=float)
n = np.isnan(v)
v[n] = -np.diff(np.concatenate(([0.], np.cumsum(~n)[n])))
df['Sequence'] = np.array(np.maximum(0, np.cumsum(v)-1), dtype=int)
return df
df = return_cumsum(df)
# ColA Sequence
# 0 NaN 0
# 1 True 0
# 2 True 1
# 3 True 2
# 4 True 3
# 5 NaN 0
# 6 True 0
# 7 NaN 0
# 8 NaN 0
# 9 True 0
# 10 True 1
# 11 True 2
# 12 True 3
# 13 True 4