考虑NaN值+熊猫的采样数据帧

时间:2019-05-07 05:27:09

标签: python pandas dataframe

我有一个如下数据框。我想用“ 3S”进行采样 因此,在某些情况下存在NaN。我期望的是数据帧应使用“ 3S”进行采样,并且如果在两者之间发现任何“ NaN”,则应从此处停止并从该索引开始采样。我尝试使用dataframe.apply方法来实现,但看起来非常复杂。有什么短途可达成吗?

df.sample(n=3)

生成输入的代码:

index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series

series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'

我尝试进行采样,但是之后没有任何线索。

2015-01-01 00:00:00    0.0
2015-01-01 01:00:00    1.0
2015-01-01 02:00:00    2.0
2015-01-01 03:00:00    2.0
2015-01-01 04:00:00    NaN
2015-01-01 05:00:00    3.0
2015-01-01 06:00:00    4.0
2015-01-01 07:00:00    4.0
2015-01-01 08:00:00    4.0
2015-01-01 09:00:00    NaN
2015-01-01 10:00:00    3.0
2015-01-01 11:00:00    4.0
2015-01-01 12:00:00    4.0

新数据框应基于“ 3S”进行采样,如果存在,还应考虑“ NaN”,并从找到“ NaN”记录的位置开始采样。

预期输出:

2015-01-01 02:00:00    2.0 -- Sampling after 3S
2015-01-01 03:00:00    2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00    NaN -- print NaN record
2015-01-01 07:00:00    4.0 -- Sampling after 3S
2015-01-01 08:00:00    4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00    NaN -- print NaN record
2015-01-01 12:00:00    4.0 -- Sampling after 3S

2 个答案:

答案 0 :(得分:1)

一种方法是用0填充NA:

combineLatest

然后对系列进行重采样: (如果datetime是您的索引)

df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)

答案 1 :(得分:1)

使用:

index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan

print (df)
                      col
2000-01-01 00:00:00   0.0
2000-01-01 01:00:00   1.0
2000-01-01 02:00:00   2.0
2000-01-01 03:00:00   3.0
2000-01-01 04:00:00   NaN
2000-01-01 05:00:00   5.0
2000-01-01 06:00:00   6.0
2000-01-01 07:00:00   7.0
2000-01-01 08:00:00   8.0
2000-01-01 09:00:00   NaN
2000-01-01 10:00:00  10.0
2000-01-01 11:00:00  11.0
2000-01-01 12:00:00  12.0

m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t

df1 = df[mask | m]
print (df1)
                      col
2000-01-01 02:00:00   2.0
2000-01-01 03:00:00   3.0
2000-01-01 04:00:00   NaN
2000-01-01 07:00:00   7.0
2000-01-01 08:00:00   8.0
2000-01-01 09:00:00   NaN
2000-01-01 12:00:00  12.0

说明

  1. 创建掩码以按Series.isna比较缺失值
  2. 通过将移位后的值与Series.ne(!=)进行比较,按连续值创建组

print (s1)
2000-01-01 00:00:00    1
2000-01-01 01:00:00    1
2000-01-01 02:00:00    1
2000-01-01 03:00:00    1
2000-01-01 04:00:00    2
2000-01-01 05:00:00    3
2000-01-01 06:00:00    3
2000-01-01 07:00:00    3
2000-01-01 08:00:00    3
2000-01-01 09:00:00    4
2000-01-01 10:00:00    5
2000-01-01 11:00:00    5
2000-01-01 12:00:00    5
Freq: H, Name: col, dtype: int32
  1. 获取每组索引的第一个值,添加timdelta(对于预期的输出将增加2T)并通过DatetimeIndex进行比较
  2. 对于|,最后boolean indexing个过滤器和bitwise OR一个链式掩码