在python中处理时间序列数据:对序列求和并在一个时间段内进行汇总

时间:2019-05-17 16:11:31

标签: python pandas time-series

我试图弄清楚如何可视化一些传感器数据。我每5分钟收集一次用于多个设备的数据,这些数据存储在一个看起来像这样的JSON结构中(请注意,我无法控制该数据结构):

[
  {
    "group": { "id": "01234" },
    "measures": {
      "measures": {
        "...device 1 uuid...": {
          "metric.name.here": {
            "mean": [
              ["2019-04-17T14:30:00+00:00", 300, 1],
              ["2019-04-17T14:35:00+00:00", 300, 2],
              ...
            ]
          }
        },
        "...device 2 uuid...": {
          "metric.name.here": {
            "mean": [
              ["2019-04-17T14:30:00+00:00", 300, 0],
              ["2019-04-17T14:35:00+00:00", 300, 1],
              ...
            ]
          }
        }
      }
    }
  }
]

格式为["2019-04-17T14:30:00+00:00", 300, 0]的每个元组为[timestamp, granularity, value]。设备按项目ID分组。在任何给定的组中,我都希望获取多个设备的数据并将它们加在一起。例如,对于上述示例数据,我希望最终的序列看起来像这样:

["2019-04-17T14:30:00+00:00", 300, 1],
["2019-04-17T14:35:00+00:00", 300, 3],

系列的长度不一定相同。

最后,我想将这些度量汇总到每小时样本中。

我可以这样获得单个系列:

with open('data.json') as fd:
  data = pd.read_json(fd)

for i, group in enumerate(data.group):
    project = group['project_id']
    instances = data.measures[i]['measures']
    series_for_group = []
    for instance in instances.keys():
        measures = instances[instance][metric][aggregate]

        # build an index from the timestamps
        index = pd.DatetimeIndex(measure[0] for measure in measures)

        # extract values from the data and link it to the index
        series = pd.Series((measure[2] for measure in measures),
                           index=index)

        series_for_group.append(series)

在外部for循环的底部,我有一个pandas.core.series.Series对象的数组,它们表示与当前组关联的不同度量集。我希望可以像total = sum(series_for_group)那样将它们简单地加在一起,但是会产生无效的数据。

  1. 我什至可以正确读取此数据吗?这是我第一次与熊猫合作。我不确定(a)创建索引,然后(b)填充数据是否是此处的正确过程。

  2. 我将如何成功地将这些系列加在一起?

  3. 如何将数据重新采样为1小时间隔?看着this question似乎似乎很感兴趣.groupby.agg方法,但是从该示例中尚不清楚如何指定间隔大小。

更新1

也许我可以使用concatgroupby?例如:

final = pd.concat(all_series).groupby(level=0).sum()

3 个答案:

答案 0 :(得分:1)

我在评论中建议执行以下操作:

result = pd.DataFrame({}, columns=['timestamp', 'granularity', 'value',
                               'project', 'uuid', 'metric', 'agg'])
for i, group in enumerate(data.group):
    project = group['id']
    instances = data.measures[i]['measures']

    series_for_group = []


    for device, measures in instances.items():
        for metric, aggs in measures.items():
            for agg, lst in aggs.items():
                sub_df = pd.DataFrame(lst, columns = ['timestamp', 'granularity', 'value'])
                sub_df['project'] = project
                sub_df['uuid'] = device
                sub_df['metric'] = metric
                sub_df['agg'] = agg

                result = pd.concat((result,sub_df), sort=True)

# parse date:
result['timestamp'] = pd.to_datetime(result['timestamp'])

所生成的数据看起来像这样

    agg     granularity         metric  project     timestamp           uuid                value
0   mean    300     metric.name.here    01234   2019-04-17 14:30:00     ...device 1 uuid...     1
1   mean    300     metric.name.here    01234   2019-04-17 14:35:00     ...device 1 uuid...     2
0   mean    300     metric.name.here    01234   2019-04-17 14:30:00     ...device 2 uuid...     0
1   mean    300     metric.name.here    01234   2019-04-17 14:35:00     ...device 2 uuid...     1

然后您可以进行整体汇总

result.resample('H', on='timestamp').sum()

给出:

timestamp
2019-04-17 14:00:00    4
Freq: H, Name: value, dtype: int64

或分组汇总:

result.groupby('uuid').resample('H', on='timestamp').value.sum()

给出:

uuid                 timestamp          
...device 1 uuid...  2019-04-17 14:00:00    3
...device 2 uuid...  2019-04-17 14:00:00    1
Name: value, dtype: int64

答案 1 :(得分:0)

要从具有不同长度(例如s1,s2,s3)的系列构造一个数据帧(df),您可以尝试:

df=pd.concat([s1,s2,s3], ignore_index=True, axis=1).fillna('')

构建数据框后:

  1. 确保所有日期都存储为时间戳记对象:

    df ['Date'] = pd.to_datetime(df ['Date'])

然后,添加另一列以从日期列中提取小时数:

df['Hour']=df['Date'].dt.hour

然后按小时分组并汇总值:

df.groupby('Hour').sum()

答案 2 :(得分:0)

基于问题代码,我最终得出了一个可行的解决方案。在我的系统上,这大约需要6秒钟来处理大约85MB的输入数据。相比之下,我在5分钟后取消了Quang的代码。

我不知道这是否是处理这些数据的正确方法,但是它会产生明显正确的结果。我注意到,像在此解决方案中那样,构建一系列列表,然后进行单个pd.concat调用比将pd.concat放入循环中更有效。

#!/usr/bin/python3

import click
import matplotlib.pyplot as plt
import pandas as pd


@click.command()
@click.option('-a', '--aggregate', default='mean')
@click.option('-p', '--projects')
@click.option('-r', '--resample')
@click.option('-o', '--output')
@click.argument('metric')
@click.argument('datafile', type=click.File(mode='rb'))
def plot_metric(aggregate, projects, output, resample, metric, datafile):

    # Read in a list of project id -> project name mappings, then
    # convert it to a dictionary.
    if projects:
        _projects = pd.read_json(projects)
        projects = {_projects.ID[n]: _projects.Name[n].lstrip('_')
                    for n in range(len(_projects))}
    else:
        projects = {}

    data = pd.read_json(datafile)
    df = pd.DataFrame()

    for i, group in enumerate(data.group):
        project = group['project_id']
        project = projects.get(project, project)

        devices = data.measures[i]['measures']
        all_series = []
        for device, measures in devices.items():
            samples = measures[metric][aggregate]
            index = pd.DatetimeIndex(sample[0] for sample in samples)
            series = pd.Series((sample[2] for sample in samples),
                               index=index)
            all_series.append(series)

        # concatenate all the measurements for this project, then
        # group them using the timestamp and sum the values.
        final = pd.concat(all_series).groupby(level=0).sum()

        # resample the data if requested
        if resample:
            final = final.resample(resample).sum()

        # add series to dataframe
        df[project] = final

    fig, ax = plt.subplots()
    df.plot(ax=ax, figsize=(11, 8.5))
    ax.legend(frameon=False, loc='upper right', ncol=3)

    if output:
        plt.savefig(output)
        plt.close()
    else:
        plt.show()


if __name__ == '__main__':
    plot_metric()