Question

我有一个时间序列，即带有一列（包含值）和索引（包含时间戳）的pandas.DataFrame。有许多值为0，我想检查连续的0。如果一个接一个地有太多0，我想删除太多的0。

例如，如果我只允许0持续5秒，那么我希望所有代表时间跨度超过5秒0的行减少到0秒的前5秒：

              value
time
12:01:01.001  1
12:01:01.002  0
12:01:01.004  6
12:01:01.010  4
12:01:03.010  0
12:01:05.010  0
12:01:08.010  0
12:01:10.010  0
12:01:10.510  0
12:01:11.101  3
12:01:12.101  3
12:01:15.101  0

应该成为

              value
time
12:01:01.001  1
12:01:01.002  0
12:01:01.004  6
12:01:01.010  4
12:01:03.010  0
12:01:05.010  0
12:01:08.010  0
12:01:11.101  3
12:01:12.101  3
12:01:15.101  0

可能的解决方案

一种可能的解决方案是循环通过具有两个变量的DataFrame：第一次记忆时非0后的第一个0和第二次进一步迭代直到超过时间（例如5秒）。然后第二个变量位置的第一个变量和第二个变量移动直到它达到非0。第一个和第二个变量之间的所有零都将被删除。

这在C中可能非常有效，但在Python中，使用库可能更快。 如何使用Python库优雅地完成此任务？

Answer 1

这是使用pandas groupby的解决方案。更新答案以显示如何根据一列数据框应用过滤器。

导入数据

from io import StringIO
import pandas as pd
import numpy as np

inp_str = u"""
time value
12:01:01.001 1
12:01:01.002 0
12:01:01.004 6
12:01:01.010 4
12:01:03.010 0
12:01:05.010 0
12:01:08.010 0
12:01:10.010 0
12:01:10.510 0
12:01:11.101 3
12:01:12.101 3
12:01:15.101 0
"""
frame = pd.read_csv(StringIO(inp_str), sep = " ").set_index('time')

# make sure we have a datetime index
frame.index = pd.to_datetime(frame.index)

# EDIT: ADD ANOTHER COLUM
frame = frame.assign(other = range(len(frame)))  

# EDIT: REPLACE ts with the relevant column
ts = frame['value']   

# Everything else remain unchanged!

# Group by consecutive values `ts != ts.shift()`
out = ts.groupby([(ts != ts.shift()).cumsum(), ts])

# for all sequences of zeros, identify where more than 5 seconds passed from beginning of sequence

def seconds_elapsed(ts):
    return ts.index.map(lambda x: (x - ts.index[0]).total_seconds())

to_drop = [group.index[np.where(map(lambda x: x>5, seconds_elapsed(group)))]
           for key, group in out if key[1] == 0]
# Collapse everything to flat list of dates
to_drop = reduce(lambda x, y: x.union(y), to_drop)
# Remove from dataframe
frame.drop(to_drop)

为了应用多个过滤器，可能有两种情况：

根据原始数据框中的值应用过滤器：对于每个过滤列，应用上述过程而不覆盖原始数据框，但始终创建新数据框。要获得最终结果，请在当时按一列过滤数据框的内部联接
连续应用过滤器：每次覆盖原始数据帧时，对每个过滤列使用上述方法（顺序很重要！）

删除具有太多连续相等值的时间序列行

可能的解决方案

1 个答案: