熊猫将数据重新采样到秒,每隔约10秒进行分组

时间:2019-09-16 03:36:54

标签: python pandas resampling

说我有以下数据框:

>>> df
                       a
2019-04-05 00:00:00  2.0                
2019-04-05 00:00:01  1.0
2019-04-05 00:00:02  NaN
2019-04-05 00:00:03  NaN
2019-04-05 00:00:04  NaN
2019-04-05 00:00:05  NaN
2019-04-05 00:00:06  NaN
2019-04-05 00:00:07  NaN
2019-04-05 00:00:08  3.0
2019-04-05 00:00:09  4.0
2019-04-05 00:00:10  NaN
2019-04-05 00:00:11  NaN
2019-04-05 00:00:12  NaN
2019-04-05 00:00:13  NaN
2019-04-05 00:00:14  NaN
2019-04-05 00:00:15  NaN
2019-04-05 00:00:16  NaN
2019-04-05 00:00:17  NaN
2019-04-05 00:00:18  NaN
2019-04-05 00:00:19  NaN
2019-04-05 00:00:20  4.0
2019-04-05 00:00:21  5.0
2019-04-05 00:00:22  NaN
2019-04-05 00:00:23  NaN
2019-04-05 00:00:24  NaN
2019-04-05 00:00:25  NaN
2019-04-05 00:00:26  6.0
2019-04-05 00:00:27  NaN
2019-04-05 00:00:28  4.0
2019-04-05 00:00:29  NaN
2019-04-05 00:00:30  NaN
2019-04-05 00:00:31  NaN

我希望每7秒有一个值(假设有一个值,否则为NaN),因此一个数据帧如下所示:

>>> df
                       a
2019-04-05 00:00:00  2.0                
2019-04-05 00:00:01  NaN
2019-04-05 00:00:02  NaN
2019-04-05 00:00:03  NaN
2019-04-05 00:00:04  NaN
2019-04-05 00:00:05  NaN
2019-04-05 00:00:06  NaN
2019-04-05 00:00:07  NaN
2019-04-05 00:00:08  3.0
2019-04-05 00:00:09  NaN
2019-04-05 00:00:10  NaN
2019-04-05 00:00:11  NaN
2019-04-05 00:00:12  NaN
2019-04-05 00:00:13  NaN
2019-04-05 00:00:14  NaN
2019-04-05 00:00:15  NaN
2019-04-05 00:00:16  NaN
2019-04-05 00:00:17  NaN
2019-04-05 00:00:18  NaN
2019-04-05 00:00:19  NaN
2019-04-05 00:00:20  4.0
2019-04-05 00:00:21  NaN
2019-04-05 00:00:22  NaN
2019-04-05 00:00:23  NaN
2019-04-05 00:00:24  NaN
2019-04-05 00:00:25  NaN
2019-04-05 00:00:26  NaN
2019-04-05 00:00:27  NaN
2019-04-05 00:00:28  4.0
2019-04-05 00:00:29  NaN
2019-04-05 00:00:30  NaN
2019-04-05 00:00:31  NaN

7秒是任意的,实际上我实际上每分钟都会获取一次值。到目前为止,这是我尝试过的:

df = df.resample('7s').first()

但是会产生以下数据帧:

                       a
2019-04-05 00:00:00  2.0
2019-04-05 00:00:07  3.0
2019-04-05 00:00:14  4.0
2019-04-05 00:00:21  5.0
2019-04-05 00:00:28  4.0

注意:这些要点之间没有NaN的存在,对此我并不感到困扰。我只是对计时感到不满意,因为它强制每7秒强制执行一次值,因为我只想不允许彼此之间在7秒以内的值,而不必每7秒强制执行一次。

伊迪丝为清楚起见:

我不想要的数据帧:

                       a
2019-04-05 00:00:00  2.0
2019-04-05 00:00:07  3.0
2019-04-05 00:00:14  4.0
2019-04-05 00:00:21  5.0
2019-04-05 00:00:28  4.0

我想要的数据帧:

>>> df
                       a
2019-04-05 00:00:00  2.0                
2019-04-05 00:00:01  NaN
2019-04-05 00:00:02  NaN
2019-04-05 00:00:03  NaN
2019-04-05 00:00:04  NaN
2019-04-05 00:00:05  NaN
2019-04-05 00:00:06  NaN
2019-04-05 00:00:07  NaN
2019-04-05 00:00:08  3.0
2019-04-05 00:00:09  NaN
2019-04-05 00:00:10  NaN
2019-04-05 00:00:11  NaN
2019-04-05 00:00:12  NaN
2019-04-05 00:00:13  NaN
2019-04-05 00:00:14  NaN
2019-04-05 00:00:15  NaN
2019-04-05 00:00:16  NaN
2019-04-05 00:00:17  NaN
2019-04-05 00:00:18  NaN
2019-04-05 00:00:19  NaN
2019-04-05 00:00:20  4.0
2019-04-05 00:00:21  NaN
2019-04-05 00:00:22  NaN
2019-04-05 00:00:23  NaN
2019-04-05 00:00:24  NaN
2019-04-05 00:00:25  NaN
2019-04-05 00:00:26  NaN
2019-04-05 00:00:27  NaN
2019-04-05 00:00:28  4.0
2019-04-05 00:00:29  NaN
2019-04-05 00:00:30  NaN
2019-04-05 00:00:31  NaN

OR:

>>> df
                       a
2019-04-05 00:00:00  2.0
2019-04-05 00:00:08  3.0
2019-04-05 00:00:20  4.0
2019-04-05 00:00:28  4.0

4 个答案:

答案 0 :(得分:1)

这不是严格地使用pandas方法,但是可以完成工作。

c = 8
for index, row in df.iterrows():
    c += 1
    if c > 7 and not(np.isnan(row[0])):
        c=0
    else:
        row[0] = np.nan

一旦应用于df,将返回所需的数据帧。

编辑:

对于n列的数据帧,每x行有一个值:

c = [x+1 for i in range(df.shape[1])]

for index, row in df.iterrows():
    c = [i+1 for i in c]
    for i in range(len(c)):
        if c[i] > x and not(np.isnan(row[i])):
            c[i] = 0
        else:
            row[i] = np.nan

第二次修改:

以上假设每个时间值都有一个NaN。以下适用于数据框中的空白:

c = [dt.datetime(1,1,1) for i in range(df.shape[1])]

for index, row in df.iterrows():
    for i in range(len(c)):
        if index.to_pydatetime() - c[i] > dt.timedelta(seconds=x) and not(np.isnan(row[i])):
            c[i] = index.to_pydatetime()
        else:
            row[i] = np.nan

答案 1 :(得分:0)

您可以对数据框进行升采样,而且非常接近;

df = df.resample('7s').first()
df = df.resample(rule='1s')

这将在添加的秒数内为新插入的行创建一个具有NaN的数据框。

答案 2 :(得分:0)

在重新采样之前填充NA值怎么办?

df = df.fillna('something').resample('7s').first()

然后将不强制使用这些值:

                    a
2019-04-05 00:00:00 2
2019-04-05 00:00:07 something
2019-04-05 00:00:14 something
2019-04-05 00:00:21 5
2019-04-05 00:00:28 4

请注意,如果用something之类的字符串填充NA,则会将整个列转换为object而不是float。因此,如果要维护数据类型,可以改用df.fillna(0)

答案 3 :(得分:0)

df.loc[df.resample("7s").apply(lambda s: s.first_valid_index()).a]

如果您要用NaN填充中间值,则

df1 = df.loc[df.resample("7s").apply(lambda s: s.first_valid_index()).a]
df1.resample("1s").apply(lambda s: None if s.empty else s)

编辑:

基于澄清,我们开始:

df[df.rolling(window="7s", closed='neither').sum().isna()]

使用上面显示的上采样代码填充NaN。

编辑2

我们必须对行使用循环,因为要决定是否要发出值取决于先前发出的值:

def f():
    skip = 0
    for row in df.itertuples():
        if skip == 0:
            if pd.notna(row.a):
                yield row
                skip = 7
        else:
            skip = skip - 1

pd.DataFrame(f())