我正在处理一个数据集,其中包含一个日期时间列和一个我感兴趣的变量。我想要做的是将数据分组为15分钟组,所以我编写了以下代码,基本上计算了一个较低的上限日期界限并创建一个15分钟间隔的日期时间对象列表。然后对每对datetime对象之间感兴趣的变量求和,并将总和放在一个新的数据帧中。但它运行速度非常慢(处理75000行需要大约五个小时),我无法弄清楚原因。任何人都可以指出代码有什么问题吗?
如果您想自己测试代码,here是一个小样本数据。
def create_sales_with_intervals(df, tank_id_col='tank_id'):
tank_id = df.iloc[0][tank_id_col]
tank_dates = get_date_range(df)
tank_sales =[]
for idx in tnrange(len(tank_dates) - 1):
t1 = tank_dates[idx]
t2 = tank_dates[idx+1]
sales = get_sales_between(df, t1, t2)
row={}
row['start_date'] = t1
row['end_date'] = t2
row['total_sale'] = sales
row['tank_id'] = tank_id
tank_sales.append(row)
return pd.DataFrame(tank_sales, columns=['tank_id', 'start_date', 'end_date', 'total_sale'])
def get_date_range(df_tank, date_col='date_time', freq='15MIN'):
start_date = df_tank.iloc[0][date_col]
end_date = df_tank.iloc[-1][date_col]
lower_bound = find_interval(start_date, 'lower')
upper_bound = find_interval(end_date, 'upper')
start_date_rounded = round_time(start_date, lower_bound) # Rounds the minute portion of the datetime object to nearest lower bound (0, 15, 30 , 45)
end_date_rounded = round_time(end_date, upper_bound) # Rounds the minute portion of the datetime object to nearest upper bound (0, 15, 30 , 45)
tank_dates = pd.date_range(start_date_rounded, end_date_rounded, freq=freq)
return tank_dates
def get_sales_between(df, t1, t2, date_col='date_time', sale_col='sold'):
cond1 = df[df[date_col] > t1]
cond2 = df[df[date_col] < t2]
idx = cond1.index & cond2.index
total_sale = df.loc[idx.values][sale_col].sum()
return total_sale
答案 0 :(得分:1)
如果您拥有pd.DataFrame.resample()
:
DatetimeIndex
方法的以下内容
# your sample dataframe
df = pd.DataFrame(
{
'date_time': {0: '2015-01-02 23:18:00',
1: '2015-01-03 01:00:00',
2: '2015-01-03 02:42:00',
3: '2015-01-03 04:24:00',
4: '2015-01-03 06:06:00',
5: '2015-01-03 07:48:00',
6: '2015-01-03 09:30:00',
7: '2015-01-03 11:12:00',
8: '2015-01-03 12:54:00',
9: '2015-01-03 14:36:00'},
'sold': {0: 78.3,
1: 0.0,
2: 112.9,
3: 13.8,
4: 32.0,
5: 95.1,
6: 56.4,
7: 28.3,
8: 0.0,
9: 0.0},
'tank_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
})
df
Out[3]:
date_time sold tank_id
0 2015-01-02 23:18:00 78.3 1
1 2015-01-03 01:00:00 0 1
2 2015-01-03 02:42:00 112.9 1
3 2015-01-03 04:24:00 13.8 1
4 2015-01-03 06:06:00 32 1
5 2015-01-03 07:48:00 95.1 1
6 2015-01-03 09:30:00 56.4 1
7 2015-01-03 11:12:00 28.3 1
8 2015-01-03 12:54:00 0 1
9 2015-01-03 14:36:00 0 1
# convert your timestamps to `pd.Timestamp` objects
df['date_time'] = pd.to_datetime(df['date_time'])
# give the dataframe a `DatetimeIndex`
df.set_index('date_time', inplace=True)
df
Out[6]:
sold tank_id
date_time
2015-01-02 23:18:00 78.3 1
2015-01-03 01:00:00 0 1
2015-01-03 02:42:00 112.9 1
2015-01-03 04:24:00 13.8 1
2015-01-03 06:06:00 32 1
2015-01-03 07:48:00 95.1 1
2015-01-03 09:30:00 56.4 1
2015-01-03 11:12:00 28.3 1
2015-01-03 12:54:00 0 1
2015-01-03 14:36:00 0 1
# resample the `sold` column in 15 minuTe chunks and then sum each chunk
df['sold'].resample('15T').sum()
Out[8]:
date_time
2015-01-02 23:15:00 78.3
2015-01-02 23:30:00 0
2015-01-02 23:45:00 0
2015-01-03 00:00:00 0
2015-01-03 00:15:00 0
2015-01-03 00:30:00 0
2015-01-03 00:45:00 0
2015-01-03 01:00:00 0
2015-01-03 01:15:00 0
2015-01-03 01:30:00 0
2015-01-03 01:45:00 0
2015-01-03 02:00:00 0
2015-01-03 02:15:00 0
2015-01-03 02:30:00 112.9
2015-01-03 02:45:00 0
2015-01-03 03:00:00 0
2015-01-03 03:15:00 0
2015-01-03 03:30:00 0
2015-01-03 03:45:00 0
2015-01-03 04:00:00 0
2015-01-03 04:15:00 13.8
2015-01-03 04:30:00 0
2015-01-03 04:45:00 0
2015-01-03 05:00:00 0
2015-01-03 05:15:00 0
2015-01-03 05:30:00 0
2015-01-03 05:45:00 0
2015-01-03 06:00:00 32
2015-01-03 06:15:00 0
2015-01-03 06:30:00 0
2015-01-03 07:15:00 0
2015-01-03 07:30:00 0
2015-01-03 07:45:00 95.1
2015-01-03 08:00:00 0
2015-01-03 08:15:00 0
2015-01-03 08:30:00 0
2015-01-03 08:45:00 0
2015-01-03 09:00:00 0
2015-01-03 09:15:00 0
2015-01-03 09:30:00 56.4
2015-01-03 09:45:00 0
2015-01-03 10:00:00 0
2015-01-03 10:15:00 0
2015-01-03 10:30:00 0
2015-01-03 10:45:00 0
2015-01-03 11:00:00 28.3
2015-01-03 11:15:00 0
2015-01-03 11:30:00 0
2015-01-03 11:45:00 0
2015-01-03 12:00:00 0
2015-01-03 12:15:00 0
2015-01-03 12:30:00 0
2015-01-03 12:45:00 0
2015-01-03 13:00:00 0
2015-01-03 13:15:00 0
2015-01-03 13:30:00 0
2015-01-03 13:45:00 0
2015-01-03 14:00:00 0
2015-01-03 14:15:00 0
2015-01-03 14:30:00 0
Freq: 15T, Name: sold, Length: 62, dtype: float64
您可以在pandas
文档here中找到更多信息。