我有Pandas系列,其中有一些用nans表示的缺失数据,并希望大致了解该数据缺失多长时间以及多少次
一个例子是:
10:01 1.23
10:02 2.23
10:03 nan
10:04 nan
10:05 nan
10:06 6.23
10:07 nan
10:08 nan
10:09 9.23
然后,期望输出将是这样的列表
missing = [[10:03,10:05], [10:07,10:08]]
N_missing = 2
答案 0 :(得分:3)
使用:
#create DataFrame
df = df.reset_index()
df.columns = ['A','B']
#boolean mask for check no NaNs to variable for reuse
m = df['B'].notnull()
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()
#filter only NaNs row and aggregate first and last value, convert to list
missing = df[~m.values].groupby(level=0)['A'].agg(['first','last']).values.tolist()
print (missing)
[['10:03', '10:05'], ['10:07', '10:08']]
#get length of nested lists
N_missing = len(missing)
print (N_missing)
2
详细信息:
print (df[~m.values])
A B
B
2 10:03 NaN
2 10:04 NaN
2 10:05 NaN
3 10:07 NaN
3 10:08 NaN
与Series
相似的解决方案:
m = s.notnull()
cum = m.cumsum()
missing = s[~m.values].index.to_series().groupby(cum).agg(['first','last']).values.tolist()
print (missing)
[['10:03', '10:05'], ['10:07', '10:08']]
N_missing = len(missing)
print (N_missing)
2
答案 1 :(得分:0)