我有2个DataFrames
:
trips_df
总条目= 1,048,568 weather_df
总条目= 2,654 我正在尝试为每次旅行计算total_precipitation
并将其附加为一列。为此,我需要在start_timestamp
中查找从end_timestamp
开始的每次旅行的trips_df
和weather_df
日期时间,并将这些时间之内的precipitation_amount
求和,然后将该值附加到新列下的trips_df
中。
用于执行此操作的代码:
def sum_precipitation(datetime1, datetime2, weather_data):
time1_rd = datetime1.replace(minute=0, second=0)
time2_ru = datetime2.replace(minute=0, second=0) + dt.timedelta(hours=1)
if time1_rd in set(weather_data['start_precipitation_datetime']):
start_idx = weather_data.start_precipitation_datetime[
weather_data.start_precipitation_datetime==time1_rd].index[0]
if time2_ru in set(weather_data['end_precipitation_datetime']):
end_idx = weather_data.end_precipitation_datetime[
weather_data.end_precipitation_datetime==time2_ru].index[0]
precipitation_sum = weather_data.iloc[start_idx:end_idx+1, 7].sum()
else: precipitation_sum = 0
else: precipitation_sum = 0
return round(precipitation_sum, 3)
def join_weather_to_trips(trips_data, weather_data):
trips_weather_df = trips_data.copy()
fn = lambda row : sum_precipitation(row.start_timestamp, row.end_timestamp, weather_data)
col = trips_data.apply(fn, axis=1)
trips_weather_df = trips_weather_df.assign(total_precipitation=col.values)
return trips_weather_df
trip_weather_df = join_weather_to_trips(trips_df, weather_df)
我在65个条目的子集上运行了代码,花了大约1.3秒。 (CPU times: user 1.27 s, sys: 8.77 ms, total: 1.28 s, Wall time: 1.28 s
)。将该性能推算到我的整个数据中,将需要(1.3 * 1048568)/ 65 = 20971.36秒或5.8小时。
有更多经验的人可以告诉我我是否做得正确,可以在哪里加速这段代码,或者是否有其他选择(例如更快的实现方式)?
答案 0 :(得分:2)
这可能不是最快的,但是您可以尝试:
trips_df['precipitation_amount'] = 0
for s,e,p in zip(weather_df['start_precipitation_datetime'],
weather_df['end_precipitation_datetime'],
weather_df.precipitation_amount):
masks = trips_df.start_timestamp.between(s,e) | trips_df.end_timestamp.between(s,e)
trips_df.loc[masks, 'precipitation_amount'] += p
在我的计算机上,处理100万次旅行和260次天气花了10秒钟。实际数据大约为100s。
更新:我确实尝试了1百万次旅行和2600种天气,Wall time: 1min 36s
注意:如果旅行在某个小时开始,您可能需要将weather_df['end_precipitation_datetime']
减少一分钟,以避免重复计算。
答案 1 :(得分:0)
我建议使用pip install DateTimeRange
start_1 = datetime.datetime(2016, 3, 16, 20, 30)
end_1 = datetime.datetime(2016, 3, 17, 20, 30)
start_2 = datetime.datetime(2016, 3, 14, 20, 30)
end_2 = datetime.datetime(2016, 3, 17, 22, 30)
dtr1 = datetimerange.DateTimeRange(start_1, end_1)
dtr2 = datetimerange.DateTimeRange(start_2, end_2)
然后,如果要检查dtr2中是否包含dtr1:
>>> dtr1.start_datetime in dtr2
True
>>> dtr1.end_datetime in dtr2
True
那样,您可以节省很多“如果,那么”。
顺便说一句,我不确定您是否应该使用“ set”,为什么不呢?
weather_data['start_precipitation_datetime'].values