我对熊猫很新,只想在这里张贴,以确保我以最有效的方式使用它。以下代码有效但运行时间很长......
代码首先设置特定的时间间隔来对数据进行分组,并且每个分组都采用滚动平均值。一旦它具有所有这些平均值,它将该间隔的最大值保存到新的数据帧并继续到下一个间隔。这是针对数据集中的所有锻炼(' best_interval_df')以及数据集中的最新锻炼(' latest_df')进行的,因此可以将最近的锻炼与所有锻炼进行比较 - 时间很高。
我是否有不同的方法可以加快处理时间?
可在此处找到数据样本: https://www.dropbox.com/s/so4fo99q8ttkh24/samples.csv?dl=0
# 1 second intervals from 0-60 seconds
interval_lengths = [i for i in range(1, 61)]
# 15 second intervals from 1:15 - 5:00 mins
interval_lengths += [i for i in range(75, 301, 15)]
# 30 second intervals for 5:00 - 10:00
interval_lengths += [i for i in range(330, 601, 30)]
# 1 minute intervals for everything after 10 mins
interval_lengths += [i for i in range(660, df_samples['seconds_since_pedaling_start'].apply(
lambda x: int(math.ceil(x / 10.0)) * 10).max() + 1, 60)]
intervals = df_samples.sort_index(ascending=True)
intervals['power'] = intervals['power'].interpolate() # Used to fill in missing gaps in data
latest_df = intervals[intervals['workoutId'] == intervals.loc[intervals.index.max]['workoutId']]
latest_df_length = latest_df['seconds_since_pedaling_start'].max()
best_interval_df = pd.DataFrame()
latest_interval_df = pd.DataFrame()
for i in interval_lengths:
# Get interals for all time
temp_df = intervals
temp_df['best_power'] = intervals.groupby(['workoutId'])['power'].rolling(int(i),
min_periods=i - 1).mean().reset_index(
0,
drop=True)
temp_df['interval'] = i
best_interval_df = best_interval_df.append(temp_df.loc[temp_df['best_power'].idxmax()])
# Don't insert intervals for periods longer than the latest workout
if i <= latest_df_length:
latest_temp_df = latest_df
latest_temp_df['best_power'] = latest_df.groupby(['workoutId'])['power'].rolling(int(i),
min_periods=i - 1).mean().reset_index(
0, drop=True)
latest_temp_df['interval'] = i
latest_interval_df = latest_interval_df.append(latest_temp_df.loc[latest_temp_df['best_power'].idxmax()])
best_interval_df['datetime'] = best_interval_df.index
best_interval_df = best_interval_df.set_index('interval')
latest_interval_df['datetime'] = latest_interval_df.index
latest_interval_df = latest_interval_df.set_index('interval')
现在也尝试生成数据列表然后合并而不是在循环中附加到df ...仍然很慢:
# 1 second intervals from 0-60 seconds
interval_lengths = [i for i in range(1, 61)]
# 15 second intervals from 1:15 - 5:00 mins
interval_lengths += [i for i in range(75, 301, 15)]
# 30 second intervals for 5:00 - 10:00
interval_lengths += [i for i in range(330, 601, 30)]
# 1 minute intervals for everything after 10 mins
interval_lengths += [i for i in range(660, df_samples['seconds_since_pedaling_start'].apply(
lambda x: int(math.ceil(x / 10.0)) * 10).max() + 1, 60)]
intervals = df_samples.sort_index(ascending=True)
intervals['power'] = intervals['power'].interpolate() # Used to fill in missing gaps in data
latest_df = intervals[intervals['workoutId'] == intervals.loc[intervals.index.max]['workoutId']]
latest_df_length = latest_df['seconds_since_pedaling_start'].max()
best_interval_df_list = []
latest_interval_df_list = []
for i in interval_lengths:
# Get interals for all time
temp_df = intervals
temp_df['best_power'] = intervals.groupby(['workoutId'])['power'].rolling(int(i),
min_periods=i - 1).mean().reset_index(
0,
drop=True)
temp_df['interval'] = i
# Append list with best power record for given interval
best_interval_df_list.append(temp_df.loc[temp_df['best_power'].idxmax()])
# Don't insert intervals for periods longer than the latest workout
if i <= latest_df_length:
latest_temp_df = latest_df
latest_temp_df['best_power'] = latest_df.groupby(['workoutId'])['power'].rolling(int(i),
min_periods=i - 1).mean().reset_index(
0, drop=True)
latest_temp_df['interval'] = i
# Append list with best power record for given interval
latest_interval_df_list.append(latest_temp_df.loc[latest_temp_df['best_power'].idxmax()])
# Merge lists of series into df
latest_interval_df = pd.DataFrame(latest_interval_df_list)
best_interval_df = pd.DataFrame(best_interval_df_list)
best_interval_df['datetime'] = best_interval_df.index
best_interval_df = best_interval_df.set_index('interval')
latest_interval_df['datetime'] = latest_interval_df.index
latest_interval_df = latest_interval_df.set_index('interval')