移位不重叠的列,使其重叠/对齐

时间:2019-01-10 21:18:30

标签: python pandas time-series aggregate-functions pandas-groupby

我有间隔不均匀的时间序列,可以重新采样到更高的频率(在这种情况下为1min),以便可以执行一些计算。现在有一个列,在示例中名为minor,有时会延迟几行,有时会正确对齐。我需要找到一种方法来将“次要”中非零块的末端与major中非零块的末端对齐,如示例所示:

major = [0,0,0,0,0,0,0,0,4,4,4,4,4,5,6,7,0,0,0,0,4,3,5,6,4,0,0,0]
minor = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,1,.9,0]
# correctly aligned minor row:
minor_aligned = [0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,.9,0,0,0]
df = pd.DataFrame(
data={'major': major, 'minor': minor, 'minor_aligned': minor_aligned})
df.index.name = 'index'

预期输出:
minor中的值应与minor_aligned中的值对齐。

简短说明:
minor中每个连续的非零值块的最后一个非零值必须与major中每个块的最后非零值对齐,如minor_aligned所示。以下附加限制适用:

  • minor将在95%的时间内正好是1(或0),其余的将介于两者之间。
  • minor只能是> 0,其中major > 0
  • minor中的非零块最多可以与major中的相应块一样长,但不能更长。它通常比major
  • 中的块短得多
  • 如果没有对应的块,则minor必须为0。 (我没有发现这种情况,因此这是可选的)

到目前为止,我已经尝试过:
从[本文]复制了block计数方法,此外,我尝试实现一些屏蔽并尝试了各种cumcountcumsum等,但是我无法找到解决方案。

df['mask_mult'] = pd.DataFrame(  # mask where shifted rows exist
    np.where((df.minor != 0.) & (df.major == 0.), 1 * df.minor, 0),
    index=df.index, columns=['outliers'])
# block counting method:
df['block'] = (df.minor.shift(1) != df.minor).astype(int).cumsum()
df.loc[:, 'block'][df['minor'] == 0] = 0  # set zero-blocks to zero

使用groupby,类别和聚合(不知道如何充分利用它),我尝试将遮罩/块用于某种用途,但未成功:

# make block counting categories:
df_cat = df.set_index(pd.cut(df.block, np.arange(-1, df.block.max() + 1)))
# groupby blocks and use mask as amount of shift indices:
df_grpd = df.groupby('block').sum()

我认为我可以对df_cat中的所有类别进行迭代以获取移位索引,也可以对df_grpd中的分组块进行迭代以进行相同的操作(并使用相加的mask作为行数),但是在两种情况下,由于0.9值,我都无法得到正确的结果。

关于如何使用0.9之类的值以及尽可能避免完全循环的想法?
预先感谢!

1 个答案:

答案 0 :(得分:0)

了解了agg / aggregate的工作原理后,我找到了解决方案。我无法完全避免循环,但至少它只循环了聚合的块,而其余的都被矢量化了。
只要满足沿轴1连接的一般形状和索引要求,此解决方案就适用于大多数输入/类型。

def align_non_overlapping(ser_maj, ser_min, full_output=False):
    minor_colname = (  # backup name of minor if available
        ser_min.name if isinstance(ser_min, pd.Series)
        else ser_min.column if isinstance(ser_min, pd.DataFrame) else 'minor')
    # merge both series in df
    df = pd.DataFrame({'major': ser_maj, 'minor': ser_min})
    df_idx = df.index  # backup and drop index for easy general shifting
    df.reset_index(drop=True, inplace=True)
    # make mask where values are not overlapping
    df['mask_no'] = np.where((df.minor != 0.) & (df.major == 0.), 1, 0)
    # make mask where minor values are not zero
    df['mask_nz'] = (df.minor != 0).astype(int)
    # get blocks of consecutive non-zero values in minor
    df['block'] = (
        df['mask_nz'].shift(1) != df['mask_nz']).astype(int).cumsum()
    # set block to zero where minor is zero
    df.loc[df['minor'] == 0., 'block'] = 0
    # set block to zero where
    # generate shifting information. summing of mask_no gives amount of non
    # overlapping rows (n rows to shift), index min and max gives start and
    # end indices of blocks of non zero blocks. reset index to be able to apply
    # aggregate function on index
    shifter = df.reset_index().groupby('block').agg(
        {'mask_no': 'sum',
         'index': {'index_start': 'min', 'index_stop': 'max'}})
    # drop first level of MultiIndex for easier indexing
    shifter.columns = shifter.columns.droplevel(0)
    # loop over blocks and shift each block by aggregated values
    for blck in shifter.index:
        if shifter.loc[blck, 'sum'] == 0:  # skip all zero blocks/shifts
            continue
        n_shift, istart, iend = shifter.loc[blck]  # extract shifting bounds
        df.loc[istart - n_shift:iend, 'minor'] = np.roll(  # roll window
            df.loc[istart - n_shift:iend, 'minor'], -n_shift)
    df.set_index(df_idx, inplace=True)  # set backup index
    df.rename(columns={'minor': minor_colname}, inplace=True)  # set old name
    if not full_output:  # return full output with information only if required
        return df[minor_colname]
    else:
        return df

相对于预期结果的测试结果:

ser_maj = pd.Series([0,0,0,0,0,0,0,0,4,4,4,4,4,5,6,7,0,0,0,0,4,3,5,6,4,0,0,0], name='major')
ser_min = pd.Series([0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,1,.9,0], name='minor')
minor_aligned = [0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,.9,0,0,0]
print(align_non_overlapping(ser_maj, ser_min, full_output=False) == minor_aligned).all())
# Out: True

这些条件当前被忽略:

  • 未成年人只能是> 0,其中major > 0
  • 次要中的非零块最多可以与主要中的相应块一样长,但不能更长。它将比专业中的程序段短得多

但是两者都可以使用df[df.major == 0] = 0轻松实现。