Question

我有以下CSV格式提供给我的AWS EC2实例CPU利用率和其他指标数据：

Date,Time,CPU_Utilization,Unit
2016-10-17,09:25:00,22.5,Percent
2016-10-17,09:30:00,6.534,Percent
2016-10-17,09:35:00,19.256,Percent
2016-10-17,09:40:00,43.032,Percent
2016-10-17,09:45:00,58.954,Percent
2016-10-17,09:50:00,56.628,Percent
2016-10-17,09:55:00,25.866,Percent
2016-10-17,10:00:00,17.742,Percent
2016-10-17,10:05:00,34.22,Percent
2016-10-17,10:10:00,26.07,Percent
2016-10-17,10:15:00,20.066,Percent
2016-10-17,10:20:00,15.466,Percent
2016-10-17,10:25:00,16.2,Percent
2016-10-17,10:30:00,14.27,Percent
2016-10-17,10:35:00,5.666,Percent
2016-10-17,10:40:00,4.534,Percent
2016-10-17,10:45:00,4.6,Percent
2016-10-17,10:50:00,4.266,Percent
2016-10-17,10:55:00,4.2,Percent
2016-10-17,11:00:00,4.334,Percent
2016-10-17,11:05:00,4.334,Percent
2016-10-17,11:10:00,4.532,Percent
2016-10-17,11:15:00,4.266,Percent
2016-10-17,11:20:00,4.266,Percent
2016-10-17,11:25:00,4.334,Percent

很明显，每5分钟报告一次。我无法访问aws-cli。我需要处理这个并报告每15分钟的平均利用率以进行可视化。也就是说，对于每小时，我需要在前15分钟，接下来的十五分钟内找到值的平均值，依此类推。所以，我会每小时报告4个值。

示例输出将是：

Date,Time,CPU_Utilization,Unit
2016-10-17,09:30:00,14.517,Percent
2016-10-17,09:45:00,40.414,Percent
2016-10-17,10:00:00,33.412,Percent
2016-10-17,10:15:00,26.785,Percent
...

一种方法是读取整个文件（有10000多条），然后对于每个日期，找到属于一个15分钟窗口的值，计算它们的平均值并重复所有值。这似乎不是最好和最有效的方法。有没有更好的方法呢？谢谢。

Answer 1

由于您的输入数据实际上非常小，我建议您使用np.genfromtxt立即阅读。然后，您可以通过检查何时达到整个四分之一小时来找到适当的范围，并通过计算剩余的完整四分之一数来结束。然后，您可以使用np.reshape将数组放入一个包含四分之一小时行的表单，然后对这些行进行平均处理：

import numpy as np

# Read in the data:
data = np.genfromtxt("data.dat", skip_header=1,
                     dtype=[("date", "|S10"),
                            ("time", "|S8"),
                            ("cpu_usage", "f8")],
                     delimiter=',', usecols=(0, 1, 2))

# Find the first full quarter:
firstQuarterHour = 0
while not (int(data[firstQuarterHour]["time"][3:5]) % 15 == 0):
    firstQuarterHour += 1
noOfQuarterHours = data[firstQuarterHour:].shape[0]/3

# Create a reshaped array
reshaped = data[firstQuarterHour:firstQuarterHour+3*noOfQuarterHours+1].reshape(
    (noOfQuarterHours, 3))

# Average over cpu_usage and take the appropriate dates and times:
cpu_usage = reshaped["cpu_usage"].mean(axis=1)
dates = reshaped["date"][:, 0]
times = reshaped["time"][:, 0]

现在，您可以使用这些数组，例如使用np.savetxt保存到另一个文本文件中。

聚合时间序列数据

1 个答案: