我有间隔不均匀的时间序列,可以重新采样到更高的频率(在这种情况下为1min
),以便可以执行一些计算。现在有一个列,在示例中名为minor
,有时会延迟几行,有时会正确对齐。我需要找到一种方法来将“次要”中非零块的末端与major
中非零块的末端对齐,如示例所示:
major = [0,0,0,0,0,0,0,0,4,4,4,4,4,5,6,7,0,0,0,0,4,3,5,6,4,0,0,0]
minor = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,1,.9,0]
# correctly aligned minor row:
minor_aligned = [0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,.9,0,0,0]
df = pd.DataFrame(
data={'major': major, 'minor': minor, 'minor_aligned': minor_aligned})
df.index.name = 'index'
预期输出:
minor
中的值应与minor_aligned
中的值对齐。
简短说明:
minor
中每个连续的非零值块的最后一个非零值必须与major
中每个块的最后非零值对齐,如minor_aligned
所示。以下附加限制适用:
minor
将在95%的时间内正好是1(或0),其余的将介于两者之间。
minor
只能是> 0
,其中major > 0
minor
中的非零块最多可以与major
中的相应块一样长,但不能更长。它通常比major
minor
必须为0
。 (我没有发现这种情况,因此这是可选的) 到目前为止,我已经尝试过:
从[本文]复制了block
计数方法,此外,我尝试实现一些屏蔽并尝试了各种cumcount
,cumsum
等,但是我无法找到解决方案。
df['mask_mult'] = pd.DataFrame( # mask where shifted rows exist
np.where((df.minor != 0.) & (df.major == 0.), 1 * df.minor, 0),
index=df.index, columns=['outliers'])
# block counting method:
df['block'] = (df.minor.shift(1) != df.minor).astype(int).cumsum()
df.loc[:, 'block'][df['minor'] == 0] = 0 # set zero-blocks to zero
使用groupby,类别和聚合(不知道如何充分利用它),我尝试将遮罩/块用于某种用途,但未成功:
# make block counting categories:
df_cat = df.set_index(pd.cut(df.block, np.arange(-1, df.block.max() + 1)))
# groupby blocks and use mask as amount of shift indices:
df_grpd = df.groupby('block').sum()
我认为我可以对df_cat
中的所有类别进行迭代以获取移位索引,也可以对df_grpd
中的分组块进行迭代以进行相同的操作(并使用相加的mask
作为行数),但是在两种情况下,由于0.9
值,我都无法得到正确的结果。
关于如何使用0.9
之类的值以及尽可能避免完全循环的想法?
预先感谢!
答案 0 :(得分:0)
了解了agg
/ aggregate
的工作原理后,我找到了解决方案。我无法完全避免循环,但至少它只循环了聚合的块,而其余的都被矢量化了。
只要满足沿轴1连接的一般形状和索引要求,此解决方案就适用于大多数输入/类型。
def align_non_overlapping(ser_maj, ser_min, full_output=False):
minor_colname = ( # backup name of minor if available
ser_min.name if isinstance(ser_min, pd.Series)
else ser_min.column if isinstance(ser_min, pd.DataFrame) else 'minor')
# merge both series in df
df = pd.DataFrame({'major': ser_maj, 'minor': ser_min})
df_idx = df.index # backup and drop index for easy general shifting
df.reset_index(drop=True, inplace=True)
# make mask where values are not overlapping
df['mask_no'] = np.where((df.minor != 0.) & (df.major == 0.), 1, 0)
# make mask where minor values are not zero
df['mask_nz'] = (df.minor != 0).astype(int)
# get blocks of consecutive non-zero values in minor
df['block'] = (
df['mask_nz'].shift(1) != df['mask_nz']).astype(int).cumsum()
# set block to zero where minor is zero
df.loc[df['minor'] == 0., 'block'] = 0
# set block to zero where
# generate shifting information. summing of mask_no gives amount of non
# overlapping rows (n rows to shift), index min and max gives start and
# end indices of blocks of non zero blocks. reset index to be able to apply
# aggregate function on index
shifter = df.reset_index().groupby('block').agg(
{'mask_no': 'sum',
'index': {'index_start': 'min', 'index_stop': 'max'}})
# drop first level of MultiIndex for easier indexing
shifter.columns = shifter.columns.droplevel(0)
# loop over blocks and shift each block by aggregated values
for blck in shifter.index:
if shifter.loc[blck, 'sum'] == 0: # skip all zero blocks/shifts
continue
n_shift, istart, iend = shifter.loc[blck] # extract shifting bounds
df.loc[istart - n_shift:iend, 'minor'] = np.roll( # roll window
df.loc[istart - n_shift:iend, 'minor'], -n_shift)
df.set_index(df_idx, inplace=True) # set backup index
df.rename(columns={'minor': minor_colname}, inplace=True) # set old name
if not full_output: # return full output with information only if required
return df[minor_colname]
else:
return df
相对于预期结果的测试结果:
ser_maj = pd.Series([0,0,0,0,0,0,0,0,4,4,4,4,4,5,6,7,0,0,0,0,4,3,5,6,4,0,0,0], name='major')
ser_min = pd.Series([0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,1,.9,0], name='minor')
minor_aligned = [0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,.9,0,0,0]
print(align_non_overlapping(ser_maj, ser_min, full_output=False) == minor_aligned).all())
# Out: True
这些条件当前被忽略:
> 0
,其中major > 0
但是两者都可以使用df[df.major == 0] = 0
轻松实现。