将不规则的时间序列转换为python中的每小时数据并具有正态分布

时间:2018-07-18 00:49:26

标签: python python-3.x pandas datetime

我有一个如下数据框:

Date Time Entry Exist 2013-01-07 05:00:00 29.0 12.0 2013-01-07 10:00:00 98.0 83.0 2013-01-07 15:00:00 404.0 131.0 2013-01-07 20:00:00 2340.0 229.0 2013-01-08 05:00:00 3443.0 629.0 2013-01-08 10:00:00 6713.0 1629.0 2013-01-08 15:00:00 9547.0 2965.0 2013-01-08 20:00:00 10440.0 4589.0

我想将其转换并归一化,以便显示一段时间内的每小时消耗量。

DateTime Entry Exist 2013-01-07 00:00:00 2.0 1.0 2013-01-07 01:00:00 9.0 4.0 2013-01-07 02:00:00 16.0 6.0 2013-01-07 03:00:00 23.0 9.0 2013-01-07 04:00:00 26.0 10.0 2013-01-07 05:00:00 29.0 12.0 2013-01-07 06:00:00 37.0 19.0 2013-01-07 07:00:00 56.0 32.0 2013-01-07 08:00:00 62.0 57.0 2013-01-07 09:00:00 77.0 63.0 2013-01-07 10:00:00 98.0 83.0 2013-01-07 11:00:00 104.0 95.0 .......

我想首先将日期和时间作为DateTime连接到一个列中,然后实现上述结果。

python新手,任何帮助将不胜感激。谢谢。

1 个答案:

答案 0 :(得分:0)

快速的答案是您可以使用

DataFrame.resample().mean().interpolate() 

至少要进行帖子的插值部分。

请注意,由于您在输入数据的范围之外进行预测,因此您的帖子中包含“超出范围” 外推。也就是说,时间序列从1/7的5:00 AM开始,但是您的过采样数据开始的时间是更早 5小时。插值只是一个域内方​​法,但是我怀疑那是您想要的。

这是内插步骤。

首先,如果您可以发布一个包含代码的自包含示例,该示例要么生成用于测试的数据,要么具有某种方式来再现它,这将有所帮助。

参考这两篇出色的文章:

Combine Date and Time columns using python pandas

How to create a Pandas DataFrame from a string

这是我的做法:

import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show

# copied and pasted from your post :)
data = StringIO("""
Date             Time         Entry       Exist
2013-01-07      05:00:00        29.0       12.0
2013-01-07      10:00:00        98.0       83.0
2013-01-07      15:00:00       404.0      131.0
2013-01-07      20:00:00      2340.0      229.0
2013-01-08      05:00:00      3443.0      629.0
2013-01-08      10:00:00      6713.0      1629.0
2013-01-08      15:00:00      9547.0      2965.0
2013-01-08      20:00:00     10440.0      4589.0""")

# read in the data,  converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data has separate date and time columns

df = pd.read_csv(data, 
    parse_dates={"date_time": ['Date', 'Time']}, 
    delim_whitespace=True)

现在,将数据设为时间序列,对其进行重新采样,应用一个函数(在这种情况下为均值),并同时对两个数据列进行插值。

df_rs = df.set_index('date_time').resample('H').mean().interpolate('linear')
df_rs

看起来像这样:

enter image description here

这些值与您帖子中的值看起来并不完全相同,但是尚不清楚使用哪种插值方式。线性,立方?其他吗?

所以,为了好玩,让我们用bokeh绘制数据。大的红色点是原始数据,而蓝色的点(和连接线)是插值数据。

output_notebook()

p = figure(x_axis_type="datetime", width=800, height=500)

p.title.text = "Entry vs. Date Time (cubic interpolated to 1H)"
p.xaxis.axis_label = 'Date Time (cubic interpolated to 1H)'
p.yaxis.axis_label = 'Entry'

# orig data
p.circle(df['date_time'], df['Entry'], color='red', size=10)

# oversampled data
p.circle(df_rs.index, df_rs['Entry'])
p.line(df_rs.index, df_rs['Entry'])

show(p)

看起来像这样:

enter image description here

或者通过三次插值,您可以获得更多的平滑:

enter image description here

完整代码

import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show

output_notebook()

# copied and pasted from your post :)
data = StringIO("""
Date            Time        ENTRIES       EXITS
2013-01-07      05:00:00        29.0       12.0
2013-01-07      10:00:00        98.0       83.0
2013-01-07      15:00:00       404.0      131.0
2013-01-07      20:00:00      2340.0      229.0
2013-01-08      05:00:00      3443.0      629.0
2013-01-08      10:00:00      6713.0      1629.0
2013-01-08      15:00:00      9547.0      2965.0
2013-01-08      20:00:00     10440.0      4589.0""")

# read in the data,  converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data as separate date and time columns
original_data = pd.read_csv(data, 
    parse_dates={"DATETIME": ['Date', 'Time']}, 
    delim_whitespace=True)

# make it a time series, resample to a higher freq, apply mean, interpolate and round
inter_data = original_data.set_index(['DATETIME']).resample('H').mean().interpolate('linear').round(1) 

# No need to drop the index to select a slice.  You can slice on the index
# I see you are starting at 1/1 (jan 1st),  yet your data starts at 1/7 (Jan 7th?)
inter_data[inter_data.index >= '2013-01-01 00:00:00'].head(20) 

enter image description here