我有不规则间隔的时间序列数据。我有总能量消耗和能量使用的持续时间。
build
我想计算15分钟窗口内所有负载曲线的总和。我可以在必要时进行回合(例如,最接近1分钟)。我不能立即使用重新采样,因为它会将使用量平均到下一个时间戳,在第一个条目1/3 12:28 PM的情况下,它将需要6.23 kWH并均匀分布直到下午4:55 ,这是不准确的。 6.23千瓦时应该分散到下午12:28 + 2.23小时〜=下午2:42。
答案 0 :(得分:3)
这是一个简单的实现,它只是设置一个系列,
result
,其索引具有分钟频率,然后循环遍历行
df
(使用df.itertuples
)并为每个人添加适当的电量
相关区间中的行:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Duration (Hours)': [2.233333333, 1.8999999999999999, 7.2166666670000001, 3.4500000000000002, 1.6000000000000001, 1.6000000000000001], 'Start Date': ['1/3/2016', '1/3/2016', '1/4/2016', '1/4/2016', '1/4/2016', '1/4/2016'], 'Start Time': ['12:28:00 PM', '4:55:00 PM', '6:47:00 PM', '7:00:00 AM', '7:26:00 AM', '7:32:00 AM'], 'Usage(kWh)': [6.2300000000000004, 11.449999999999999, 11.93, 9.4499999999999993, 7.3300000000000001, 4.54]} )
df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['start_date'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Start Date', 'Start Time', 'Duration (Hours)'], axis=1)
result = pd.Series(0,
index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))
power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
result.loc[row.start_date:row.end_date] += row[power_idx]
# The sum of the usage over 15 minute windows is computed using the `resample/sum` method:
usage = result.resample('15T').sum()
usage.plot(kind='line', label='usage')
plt.legend(loc='best')
plt.show()
关于效果的说明:循环df
行并不是很好
特别是如果len(df)
很大的话。为了获得更好的性能,您可能需要一个
more clever method,处理
所有行以“矢量化”方式“一次”:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Here is an example using a larger DataFrame
N = 10**3
dates = pd.date_range('2016-1-1', periods=N*10, freq='H')
df = pd.DataFrame({'Duration (Hours)': np.random.uniform(1, 10, size=N),
'start_date': np.random.choice(dates, replace=False, size=N),
'Usage(kWh)': np.random.uniform(1,20, size=N)})
df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
def using_loop(df):
result = pd.Series(0,
index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))
power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
result.loc[row.start_date:row.end_date] += row[power_idx]
usage = result.resample('15T').sum()
return usage
def using_cumsum(df):
result = pd.melt(df[['power (kW/min)','start_date','end_date']],
id_vars=['power (kW/min)'], var_name='usage', value_name='date')
result['usage'] = result['usage'].map({'start_date':1, 'end_date':-1})
result['usage'] *= result['power (kW/min)']
result = result.set_index('date')
result = result[['usage']].resample('T').sum().fillna(0).cumsum()
usage = result.resample('15T').sum()
return usage
usage = using_cumsum(df)
usage.plot(kind='line', label='usage')
plt.legend(loc='best')
plt.show()
len(df)
等于1000,using_cumsum
比using_loop
快10倍以上:
In [117]: %timeit using_loop(df)
1 loop, best of 3: 545 ms per loop
In [118]: %timeit using_cumsum(df)
10 loops, best of 3: 52.7 ms per loop
答案 1 :(得分:1)
我在下面使用的解决方案是itertuples方法。请注意使用numpy的.sum函数对我不起作用。我改为使用pandas resample关键字“how”并将其设置为sum。
我还重命名了文件中的列,以便更轻松地导入。
我没有时间/资源限制,所以我使用了itertuples方法,因为我很容易实现。
Itertuples代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#load data
df = pd.read_excel(r'C:\input_file.xlsx', sheetname='sheet1')
#convert columns
df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Duration (Hours)'], axis=1)
#create result df with timestamps
result = pd.Series(0, index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))
#iterate through to calculate total energy at each minute
power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
result.loc[row.start_date:row.end_date] += row[power_idx]
# The sum of the usage over 15 minute windows is computed using the `resample/sum` method
usage = result.resample('15T', how='sum')
#plot
plt.plot(usage)
plt.show()
#write to file
usage.to_csv(r'C:\output_folder\output_file.csv')