大熊猫滚动平均优化的熊猫分组

时间:2018-05-14 14:15:45

标签: python pandas

我对熊猫很新,只想在这里张贴,以确保我以最有效的方式使用它。以下代码有效但运行时间很长......

代码首先设置特定的时间间隔来对数据进行分组,并且每个分组都采用滚动平均值。一旦它具有所有这些平均值,它将该间隔的最大值保存到新的数据帧并继续到下一个间隔。这是针对数据集中的所有锻炼(' best_interval_df')以及数据集中的最新锻炼(' latest_df')进行的,因此可以将最近的锻炼与所有锻炼进行比较 - 时间很高。

我是否有不同的方法可以加快处理时间?

可在此处找到数据样本: https://www.dropbox.com/s/so4fo99q8ttkh24/samples.csv?dl=0

# 1 second intervals from 0-60 seconds
interval_lengths = [i for i in range(1, 61)]
# 15 second intervals from 1:15 - 5:00 mins
interval_lengths += [i for i in range(75, 301, 15)]
# 30 second intervals for 5:00 - 10:00
interval_lengths += [i for i in range(330, 601, 30)]
# 1 minute intervals for everything after 10 mins
interval_lengths += [i for i in range(660, df_samples['seconds_since_pedaling_start'].apply(
    lambda x: int(math.ceil(x / 10.0)) * 10).max() + 1, 60)]

intervals = df_samples.sort_index(ascending=True)
intervals['power'] = intervals['power'].interpolate()  # Used to fill in missing gaps in data
latest_df = intervals[intervals['workoutId'] == intervals.loc[intervals.index.max]['workoutId']]
latest_df_length = latest_df['seconds_since_pedaling_start'].max()

best_interval_df = pd.DataFrame()
latest_interval_df = pd.DataFrame()
for i in interval_lengths:
    # Get interals for all time
    temp_df = intervals
    temp_df['best_power'] = intervals.groupby(['workoutId'])['power'].rolling(int(i),
                                                                              min_periods=i - 1).mean().reset_index(
        0,
        drop=True)
    temp_df['interval'] = i
    best_interval_df = best_interval_df.append(temp_df.loc[temp_df['best_power'].idxmax()])

    # Don't insert intervals for periods longer than the latest workout
    if i <= latest_df_length:
        latest_temp_df = latest_df
        latest_temp_df['best_power'] = latest_df.groupby(['workoutId'])['power'].rolling(int(i),
                                                                                         min_periods=i - 1).mean().reset_index(
            0, drop=True)
        latest_temp_df['interval'] = i
        latest_interval_df = latest_interval_df.append(latest_temp_df.loc[latest_temp_df['best_power'].idxmax()])

best_interval_df['datetime'] = best_interval_df.index
best_interval_df = best_interval_df.set_index('interval')
latest_interval_df['datetime'] = latest_interval_df.index
latest_interval_df = latest_interval_df.set_index('interval')

现在也尝试生成数据列表然后合并而不是在循环中附加到df ...仍然很慢:

# 1 second intervals from 0-60 seconds
interval_lengths = [i for i in range(1, 61)]
# 15 second intervals from 1:15 - 5:00 mins
interval_lengths += [i for i in range(75, 301, 15)]
# 30 second intervals for 5:00 - 10:00
interval_lengths += [i for i in range(330, 601, 30)]
# 1 minute intervals for everything after 10 mins
interval_lengths += [i for i in range(660, df_samples['seconds_since_pedaling_start'].apply(
    lambda x: int(math.ceil(x / 10.0)) * 10).max() + 1, 60)]

intervals = df_samples.sort_index(ascending=True)
intervals['power'] = intervals['power'].interpolate()  # Used to fill in missing gaps in data
latest_df = intervals[intervals['workoutId'] == intervals.loc[intervals.index.max]['workoutId']]
latest_df_length = latest_df['seconds_since_pedaling_start'].max()

best_interval_df_list = []
latest_interval_df_list = []
for i in interval_lengths:
    # Get interals for all time
    temp_df = intervals
    temp_df['best_power'] = intervals.groupby(['workoutId'])['power'].rolling(int(i),
                                                                              min_periods=i - 1).mean().reset_index(
        0,
        drop=True)
    temp_df['interval'] = i
    # Append list with best power record for given interval
    best_interval_df_list.append(temp_df.loc[temp_df['best_power'].idxmax()])

    # Don't insert intervals for periods longer than the latest workout
    if i <= latest_df_length:
        latest_temp_df = latest_df
        latest_temp_df['best_power'] = latest_df.groupby(['workoutId'])['power'].rolling(int(i),
                                                                                         min_periods=i - 1).mean().reset_index(
            0, drop=True)
        latest_temp_df['interval'] = i
        # Append list with best power record for given interval
        latest_interval_df_list.append(latest_temp_df.loc[latest_temp_df['best_power'].idxmax()])

# Merge lists of series into df
latest_interval_df = pd.DataFrame(latest_interval_df_list)
best_interval_df = pd.DataFrame(best_interval_df_list)

best_interval_df['datetime'] = best_interval_df.index
best_interval_df = best_interval_df.set_index('interval')
latest_interval_df['datetime'] = latest_interval_df.index
latest_interval_df = latest_interval_df.set_index('interval')

0 个答案:

没有答案