Question

我有以下格式的数据。

index  timestamps(s)    Bytes
0       0.0               0
1       0.1               9
2       0.2               10
3       0.3               8
4       0.4               8
5       0.5               9
6       0.6               7
7       0.7               8     
8       0.8               7
9       0.9               6

它在pandas数据框中（但格式无关紧要）。我想将数据分成更小的部分（称为窗口）。每个部分应具有固定的持续时间（0.3秒），然后计算每个窗口中字节的平均值。我想要每个窗口的行的开始和结束索引，如下所示：

win_start_ind = [1 4 7]
win_end_ind   = [3 6 9]

我打算用这些索引计算每个窗口的平均字节数。

感谢python代码。

Answer 1

John Galt建议simple alternative适用于您的问题。

g = df.groupby(df['timestamps(s)']//0.3*0.3).Bytes.mean().reset_index()

适用于任何日期数据的通用解决方案涉及pd.to_datetime和pd.Grouper。

df['timestamps(s)'] = pd.to_datetime(df['timestamps(s)'], format='%S.%f')  # 1
g = df.groupby(pd.Grouper(key='timestamps(s)', freq='0.3S')).Bytes\
                                                   .mean().reset_index()   # 2
g['timestamps(s)'] = g['timestamps(s)']\
                        .dt.strftime('%S.%f').astype(float) # 3

g    
   timestamps(s)     Bytes
0            0.0  6.333333
1            0.3  8.333333
2            0.6  7.333333
3            0.9  6.000000    

g.Bytes.values
array([ 6.33333333,  8.33333333,  7.33333333,  6.        ])

Answer 2

好吧，没有熊猫意识到可以获得两个索引列表的可能解决方案，假设您的数据可以作为二维数组访问，其中第一维是行：

win_start_ind = []
win_end_ind = []
last = last_nonzerobyte_idx = first_ts = None
for i, ts, byt in data : # (1)
    if not byt: continue
    if first_ts == None :
        first_ts = ts
    win_num = int((ts-first_ts) * 10 // 3) # (2)
    if win_num >= 1 or not win_start_ind:
        if win_start_ind :
            win_end_ind.append(last_nonzerobyte_idx)
        win_start_ind.append(i)
        last = win_num
        first_ts = ts
    last_nonzerobyte_idx = i
wind_end_ind.append(last_nonzerobyte_idx)

此行只是遍历您的数组并将其行内容分配给变量，您必须根据您的情况进行调整。您还可以循环遍历数组并将完整行分配给单个变量，然后在下一行中将所需数据提取到所需变量。请参阅（dataframe docs - N-Dimensional arrays - Indexing in NumPy）以根据您的需求定制此代码。
这一行是告诉我们新时间窗口何时开始的行，如果它是0那么我们仍然在同一时间窗口，如果它是1，则是时候：
1. 将win_end_ind添加到最后一个非零字节行索引
2. 将win_start_ind添加到当前索引
3. 将first_ts设置为当前时间戳，以便ts-first_ts为我们提供自此时间窗口开始以来经过的相对时间。

Answer 3

我使用pandas内置函数得到了我的问题的答案如下：

正如我所提到的，我想将数据划分为固定持续时间的窗口（或容器）。请注意，我只使用uni时间戳测试了该功能。（上面我的问题中的时间戳值是为了简单而假设的。）

解决方案是从Link复制的，如下所示：

import pandas as pd
import datetime
import numpy as np

# Create an empty dataframe
df = pd.DataFrame()

# Create a column from the timestamps series
df['timestamps'] = timestamps

# Convert that column into a datetime datatype
df['timestamps'] = pd.to_datetime(df['timestamps'])

# Set the datetime column as the index
df.index = df['timestamps']

# Create a column from the numeric Bytes series
df['Bytes'] = Bytes


# Now for my original data
# Downsample the series into 30S bins and sum the values of the Bytes
# falling into a bin.

window = df.Bytes.resample('30S').sum()

我的输出：

1970-01-01 00:00:00    10815752
1970-01-01 00:00:30     6159960
1970-01-01 00:01:00       40270
1970-01-01 00:01:30       44196
1970-01-01 00:02:00       48084
1970-01-01 00:02:30       47147
1970-01-01 00:03:00       45279
1970-01-01 00:03:30       40574

在输出中：

第一栏==＆gt;时间窗口持续30秒第二栏==＆gt; 30秒箱中所有字节的总和

您也可以尝试更多功能选项，例如mean，last等。有关详细信息，请阅读Documentation。

根据固定的持续时间查找平均值

3 个答案: