使用pandas Series.rolling with DateOffset

时间:2017-04-03 03:24:33

标签: python-3.x pandas analysis

Python,Pandas,数据分析。

所以我要做的就是从一大堆apache服务器日志中找出最繁忙的60分钟时间间隔。我已将日志中的时间戳提取到列表中。

time_recieved是一个包含类似

的值的列表
[
1995-07-01T00:01:18-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:46-04:00,
1995-07-01T00:13:47-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:50-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:14:11-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:18-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:23-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:26-04:00,
1995-07-01T00:14:27-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:31-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:36-04:00,
]

我的目标是沿着这个时间戳列表,我将能够从这些点中的任何一个点开始计算60分钟的间隔。一旦我开始滚动窗口,我想我可以处理它。

关于pandas文档的

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html 我找到了关于window参数的以下条目 " window:int或offset 移动窗口的大小。这是用于计算统计量的观测数。每个窗口都是固定大小。 如果是偏移量,那么这将是每个窗口的时间段。每个窗口将是基于时间段中包括的观察值而变化的大小。这仅适用于类似日期的索引。这是0.19.0中的新功能 "

我正在使用pandas 19.2根据时间段内的观察结果使用可变大小窗口的选项听起来就像我想要的那样。所以我试着实现它:

import pandas as pd
from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 60):    
    time_window = DateOffset(minutes = 60)
    print (type(time_window))
    series = pd.Series(data)
    series.rolling(time_window).count()
    return series  

busiest_tf = busiest_timeframe(time_received)    

我收到以下错误:     提高ValueError("窗口必须是整数")

ValueError: window must be an integer

我正在使用其他一些偏移对象吗?这个熊猫功能不起作用吗?我误解了文件吗?

提前感谢您的帮助和建议!

2 个答案:

答案 0 :(得分:0)

可悲的是,我不知道如何使用series.rolling,好像你没有将它设置为索引,这就是为什么它没有用。但即便如此,我也会遇到错误,所以这里有另一种选择(也许是非常丑陋的方式),所以如果其他人有更好的方式,那么如果你倾听别人的话,那就最好了。

所以是的,它使用布尔索引。使用代码(许多打印语句)并可能更改> =和< = to>和<如果你愿意的话。

liste=[
"1995-07-01T00:01:18-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:46-04:00",
"1995-07-01T00:13:47-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:50-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:14:11-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:18-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:23-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:26-04:00",
"1995-07-01T00:14:27-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:31-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:36-04:00"
]
import pandas as pd

from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 1):

    series = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S') #maybe you dont need the to_datetime here. I did.
    df=series.to_frame(name="time")
    df["count"]=[df[(df["time"] >= x) & (df["time"] <= (x+pd.Timedelta(seconds=timeframe)))].size for x in df["time"].values] #change seconds to minutes or whatever you want
    highest_index=df["count"].idxmax()
    #print(df.ix[highest_index]["time"])
    df2=df[(df["time"] >= df.ix[highest_index]["time"]) & (df["time"] <= (df.ix[highest_index]["time"]+pd.Timedelta(seconds=timeframe)))] #change seconds here to th same as above
    return df2
print(busiest_timeframe(liste))

答案 1 :(得分:0)

尝试使用偏移别名代替DateOffset:

来自the docs的示例:

import pandas as pd
import numpy as np

df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
                  index = [pd.Timestamp('20130101 09:00:00'),
                           pd.Timestamp('20130101 09:00:02'),
                           pd.Timestamp('20130101 09:00:03'),
                           pd.Timestamp('20130101 09:00:05'),
                           pd.Timestamp('20130101 09:00:06')])

print(df.rolling('2s').count())

输出:

                       B
2013-01-01 09:00:00  1.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  2.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  1.0