我想在我的数据框列中找到连续的nans,比如
>>> df = pd.DataFrame([[np.nan, 2, np.nan],
... [3, 4, np.nan],
... [np.nan, np.nan, np.nan],
... [np.nan, 3, np.nan]],
... columns=list('ABC'))
>>> df
A B C
0 NaN 2.0 NaN
1 3.0 4.0 NaN
2 NaN NaN NaN
3 NaN 3.0 NaN
会给出
>>> df
A B C
0 1.0 NaN 4.0
1 NaN NaN 4.0
2 2.0 1.0 4.0
3 2.0 NaN 4.0
答案 0 :(得分:2)
使用:
a = df.isnull()
b = a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a)
print (b)
A B C
0 1.0 NaN 4
1 NaN NaN 4
2 2.0 1.0 4
3 2.0 NaN 4
详情:
#unique consecutive values
print (a.ne(a.shift()).cumsum())
A B C
0 1 1 1
1 2 1 1
2 3 2 1
3 3 3 1
#count values per columns and map
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())))
A B C
0 1 2 4
1 1 2 4
2 2 1 4
3 2 1 4
#add NaNs by mask a
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a))
A B C
0 1.0 NaN 4
1 NaN NaN 4
2 2.0 1.0 4
3 2.0 NaN 4
一栏替代方案:
a = df['A'].isnull()
b = a.ne(a.shift()).cumsum()
c = b.map(b.value_counts()).where(a)
print (c)
0 1.0
1 NaN
2 2.0
3 2.0
Name: A, dtype: float64
答案 1 :(得分:1)
IIUC ...... groupby
+ mask
+ isnull
df.apply(lambda x :x.groupby(x.isnull().diff().ne(0).cumsum()).transform(len).mask(~x.isnull()))
Out[751]:
A B C
0 1.0 NaN 4.0
1 NaN NaN 4.0
2 2.0 1.0 4.0
3 2.0 NaN 4.0
一栏
df.A.groupby(df.A.isnull().diff().ne(0).cumsum()).transform(len).mask(~df.A.isnull())
Out[756]:
0 1.0
1 NaN
2 2.0
3 2.0
Name: A, dtype: float64
答案 2 :(得分:1)
不确定这是不是太优雅了,但我就是这样做的:
def f(ds):
ds = ds.isnull()
splits = np.split(ds, np.where(ds == False)[0])
counts = [np.sum(v) for v in splits]
return pd.concat([pd.Series(split).replace({False: np.nan, True: count})
for split, count in zip(splits, counts)])
df.apply(lambda x: f(x))
说明:
# Binarize the array
ds = ds.isnull()
# Split the array where we have False (former nan values)
splits = np.split(ds, np.where(ds == False)[0])
# Now just count the number of True values
counts = [np.sum(v) for v in splits]
# Concatenate series that contains the requested values
pd.concat([pd.Series(split).replace({False: np.nan, True: count})
for split, count in zip(splits, counts)])