我有一个如下数据框:
Date Time Entry Exist
2013-01-07 05:00:00 29.0 12.0
2013-01-07 10:00:00 98.0 83.0
2013-01-07 15:00:00 404.0 131.0
2013-01-07 20:00:00 2340.0 229.0
2013-01-08 05:00:00 3443.0 629.0
2013-01-08 10:00:00 6713.0 1629.0
2013-01-08 15:00:00 9547.0 2965.0
2013-01-08 20:00:00 10440.0 4589.0
我想将其转换并归一化,以便显示一段时间内的每小时消耗量。
DateTime Entry Exist
2013-01-07 00:00:00 2.0 1.0
2013-01-07 01:00:00 9.0 4.0
2013-01-07 02:00:00 16.0 6.0
2013-01-07 03:00:00 23.0 9.0
2013-01-07 04:00:00 26.0 10.0
2013-01-07 05:00:00 29.0 12.0
2013-01-07 06:00:00 37.0 19.0
2013-01-07 07:00:00 56.0 32.0
2013-01-07 08:00:00 62.0 57.0
2013-01-07 09:00:00 77.0 63.0
2013-01-07 10:00:00 98.0 83.0
2013-01-07 11:00:00 104.0 95.0
.......
我想首先将日期和时间作为DateTime连接到一个列中,然后实现上述结果。
python新手,任何帮助将不胜感激。谢谢。
答案 0 :(得分:0)
快速的答案是您可以使用
DataFrame.resample().mean().interpolate()
至少要进行帖子的插值部分。
请注意,由于您在输入数据的范围之外进行预测,因此您的帖子中包含“超出范围” 外推。也就是说,时间序列从1/7的5:00 AM开始,但是您的过采样数据开始的时间是更早 5小时。插值只是一个域内方法,但是我怀疑那是您想要的。
这是内插步骤。
首先,如果您可以发布一个包含代码的自包含示例,该示例要么生成用于测试的数据,要么具有某种方式来再现它,这将有所帮助。
参考这两篇出色的文章:
Combine Date and Time columns using python pandas
How to create a Pandas DataFrame from a string
这是我的做法:
import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show
# copied and pasted from your post :)
data = StringIO("""
Date Time Entry Exist
2013-01-07 05:00:00 29.0 12.0
2013-01-07 10:00:00 98.0 83.0
2013-01-07 15:00:00 404.0 131.0
2013-01-07 20:00:00 2340.0 229.0
2013-01-08 05:00:00 3443.0 629.0
2013-01-08 10:00:00 6713.0 1629.0
2013-01-08 15:00:00 9547.0 2965.0
2013-01-08 20:00:00 10440.0 4589.0""")
# read in the data, converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data has separate date and time columns
df = pd.read_csv(data,
parse_dates={"date_time": ['Date', 'Time']},
delim_whitespace=True)
现在,将数据设为时间序列,对其进行重新采样,应用一个函数(在这种情况下为均值),并同时对两个数据列进行插值。
df_rs = df.set_index('date_time').resample('H').mean().interpolate('linear')
df_rs
看起来像这样:
这些值与您帖子中的值看起来并不完全相同,但是尚不清楚使用哪种插值方式。线性,立方?其他吗?
所以,为了好玩,让我们用bokeh绘制数据。大的红色点是原始数据,而蓝色的点(和连接线)是插值数据。
output_notebook()
p = figure(x_axis_type="datetime", width=800, height=500)
p.title.text = "Entry vs. Date Time (cubic interpolated to 1H)"
p.xaxis.axis_label = 'Date Time (cubic interpolated to 1H)'
p.yaxis.axis_label = 'Entry'
# orig data
p.circle(df['date_time'], df['Entry'], color='red', size=10)
# oversampled data
p.circle(df_rs.index, df_rs['Entry'])
p.line(df_rs.index, df_rs['Entry'])
show(p)
看起来像这样:
或者通过三次插值,您可以获得更多的平滑:
完整代码
import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show
output_notebook()
# copied and pasted from your post :)
data = StringIO("""
Date Time ENTRIES EXITS
2013-01-07 05:00:00 29.0 12.0
2013-01-07 10:00:00 98.0 83.0
2013-01-07 15:00:00 404.0 131.0
2013-01-07 20:00:00 2340.0 229.0
2013-01-08 05:00:00 3443.0 629.0
2013-01-08 10:00:00 6713.0 1629.0
2013-01-08 15:00:00 9547.0 2965.0
2013-01-08 20:00:00 10440.0 4589.0""")
# read in the data, converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data as separate date and time columns
original_data = pd.read_csv(data,
parse_dates={"DATETIME": ['Date', 'Time']},
delim_whitespace=True)
# make it a time series, resample to a higher freq, apply mean, interpolate and round
inter_data = original_data.set_index(['DATETIME']).resample('H').mean().interpolate('linear').round(1)
# No need to drop the index to select a slice. You can slice on the index
# I see you are starting at 1/1 (jan 1st), yet your data starts at 1/7 (Jan 7th?)
inter_data[inter_data.index >= '2013-01-01 00:00:00'].head(20)