我是python中的熊猫新手,我必须实现以下逻辑。我知道将其实现为sql查询,但是需要知道如何在pandas中实现。
我从查询中得到的输出如下:
startdatetime,endatetime,value
2019-03-26 23:00:00.000,2019-03-27 01:00:00.000,37.86
2019-03-27 01:00:00.000,2019-03-27 03:00:00.000,37.91
2019-03-27 03:00:00.000,2019-03-27 05:00:00.000,34.54
我需要将datetime分成15分钟,并保持相同的值,例如:
startdatetime,endatetime,value
2019-03-26 23:00:00.000,2019-03-26 23:15:00.000,37.86
2019-03-26 23:15:00.000,2019-03-26 23:30:00.000,37.86
2019-03-26 23:30:00.000,2019-03-26 23:45:00.000,37.86
2019-03-26 23:45:00.000,2019-03-27 00:00:00.000,37.86
2019-03-27 00:00:00.000,2019-03-27 00:15:00.000,37.86
2019-03-27 00:15:00.000,2019-03-27 00:30:00.000,37.86
2019-03-27 00:30:00.000,2019-03-27 00:45:00.000,37.86
2019-03-27 00:45:00.000,2019-03-27 01:00:00.000,37.86
答案 0 :(得分:1)
许多方法可以做到这一点,仅提供我的观点。
首先让我们重新创建数据
import pandas as pd
df = pd.DataFrame([
('2019-03-26 23:00:00.000','2019-03-27 01:00:00.000','37.86'),
('2019-03-27 01:00:00.000','2019-03-27 03:00:00.000','37.91'),
('2019-03-27 03:00:00.000','2019-03-27 05:00:00.000','34.54')
], columns=['startdatetime','enddatetime','value'])
df['startdatetime'] = pd.to_datetime(df['startdatetime'])
df['enddatetime'] = pd.to_datetime(df['enddatetime'])
现在直观地讲,我将采用以下两种方法之一:
Apply
语法:我们将每一行分成一组。对我来说感觉很直观,但是语法通常不是很快。Join
语法:我们创建时间间隔,并将这些值加入其中。更接近于SQL风格。我在下面添加了此代码。 加入
我们创建范围,并加入灵活的merge_asof
function。这不是严格的合并,允许加入范围。它对于您的示例非常有效,如果实际数据不同,则可能需要进行一些调整。
range = pd.date_range(start=df.startdatetime.min(), end=df.enddatetime.max(), freq='15min')
df_range = pd.DataFrame(range, columns=['startdatetime'])
result = pd.merge_asof(df_range, df, left_on='startdatetime', right_on='startdatetime')
答案 1 :(得分:0)
将Index.repeat
的日期时间差转换为分钟,然后将15分钟的时间增量添加到GroupBy.cumcount
和to_timedelta
创建的startdatetime
中,因为endatetime
仅移位值并按原始值替换每组的最后NaN
个
df['startdatetime'] = pd.to_datetime(df['startdatetime'])
df['endatetime'] = pd.to_datetime(df['endatetime'])
v = ((df['endatetime'] - df['startdatetime']).dt.total_seconds() / (60 * 15))
df = df.loc[df.index.repeat(v)]
df['startdatetime'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='s') * 15 * 60
df['endatetime'] = df['startdatetime'].shift(-1).fillna(df['endatetime'])
df = df.reset_index(drop=True)
print (df)
startdatetime endatetime value
0 2019-03-26 23:00:00 2019-03-26 23:15:00 37.86
1 2019-03-26 23:15:00 2019-03-26 23:30:00 37.86
2 2019-03-26 23:30:00 2019-03-26 23:45:00 37.86
3 2019-03-26 23:45:00 2019-03-27 00:00:00 37.86
4 2019-03-27 00:00:00 2019-03-27 00:15:00 37.86
5 2019-03-27 00:15:00 2019-03-27 00:30:00 37.86
6 2019-03-27 00:30:00 2019-03-27 00:45:00 37.86
7 2019-03-27 00:45:00 2019-03-27 01:00:00 37.86
8 2019-03-27 01:00:00 2019-03-27 01:15:00 37.91
9 2019-03-27 01:15:00 2019-03-27 01:30:00 37.91
10 2019-03-27 01:30:00 2019-03-27 01:45:00 37.91
11 2019-03-27 01:45:00 2019-03-27 02:00:00 37.91
12 2019-03-27 02:00:00 2019-03-27 02:15:00 37.91
13 2019-03-27 02:15:00 2019-03-27 02:30:00 37.91
14 2019-03-27 02:30:00 2019-03-27 02:45:00 37.91
15 2019-03-27 02:45:00 2019-03-27 03:00:00 37.91
16 2019-03-27 03:00:00 2019-03-27 03:15:00 34.54
17 2019-03-27 03:15:00 2019-03-27 03:30:00 34.54
18 2019-03-27 03:30:00 2019-03-27 03:45:00 34.54
19 2019-03-27 03:45:00 2019-03-27 04:00:00 34.54
20 2019-03-27 04:00:00 2019-03-27 04:15:00 34.54
21 2019-03-27 04:15:00 2019-03-27 04:30:00 34.54
22 2019-03-27 04:30:00 2019-03-27 04:45:00 34.54
23 2019-03-27 04:45:00 2019-03-27 05:00:00 34.54
答案 2 :(得分:0)
这看起来像时间序列数据。这意味着源数据中将出现问题。最终,依赖源数据没有错误最终会成为现实系统的问题。
因此,重采样是处理此数据并为不可避免的抖动做好准备的合理方法。
此外,在每个阶段都有机会干预数据并对数据采取行动。
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
csvdata = StringIO("""startdatetime,endatetime,value
2019-03-26 23:00:00.000,2019-03-27 01:00:00.000,37.86
2019-03-27 01:00:00.000,2019-03-27 03:00:00.000,37.91
2019-03-27 03:00:00.000,2019-03-27 05:00:00.000,34.54""")
df = pd.read_csv(csvdata, sep=",", index_col="startdatetime", parse_dates=True, infer_datetime_format=True)
# flexibility to statistically pick resampled values should the index
# not be on a ten minute boundary
df = df.resample('15T').last()
df = df.reset_index()
# now that the DataFrame has a ten minute freq index, use it to make the end interval
enddatetime = df['startdatetime']
enddatetime = enddatetime.append(pd.Series(enddatetime.values[-1] + pd.Timedelta(minutes=15)))
enddatetime = enddatetime.shift(-1).values[:-1]
df['endatetime'] = enddatetime
# flexibility to fill missing values
df['value'] = df['value'].ffill()
# results
print(df)