Question

我在pandas DataFrame中有数据，需要对应用于DataFrame的“ID”组的函数进行大量清理。如何应用任意函数来操纵Pandas DataFrame组？ DataFrame的简化示例如下：

import pandas as pd
import numpy as np

waypoint_time_string = ['0.5&3.0&6.0' for x in range(10)]
moving_string = ['0 0 0&0 0.1 0&1 1 1.2' for x in range(10)]

df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,2,2], 'time':[1,2,3,4,5,1,2,3,4,5],
         'X':[0,0,0,0,0,1,1,1,1,1],'Y':[0,0,0,0,0,1,1,1,1,1],'Z':[0,0,0,0,0,1,1,1,1,1],
         'waypoint_times':waypoint_time_string,
         'moving':moving_string})

我想将函数set_group_positions（在下面定义）应用于df的每个“ID”组。我只是通过DataFrame成功循环。似乎必须有更多的“Pandas.groupby”方式来做到这一点。以下是我想要替换的实现示例：

sub_frames = [] 
unique_IDs = df['ID'].unique()
for unique_ID in unique_IDs:
    working_df = df.loc[df['ID']==unique_ID]
    working_df = set_group_positions(working_df)
    sub_frames.append(working_df)

final_df = pd.concat(sub_frames)

要完成一个工作示例，这里有其他辅助函数：

def set_x_vel(row):
    return(row['X'] + row['x_movement'])
def set_y_vel(row):
    return(row['Y'] + row['y_movement'])
def set_z_vel(row):
    return(row['Z'] + row['z_movement'])

output_time_list = df['time'].unique().tolist()

#main function to apply to each ID group in the data frame:
def set_group_positions(df): #pass the combined df here
    working_df = df
    times_string = working_df['waypoint_times'].iloc[0]
    times_list = times_string.split('&')
    times_list = [float(x) for x in times_list]
    points_string = working_df['moving']
    points_string = points_string.iloc[0]
    points_list = points_string.split('&')
    points_x = []
    points_y = []
    points_z = []
    for point in points_list:
        point_list = point.split(' ')
        points_x.append(point_list[0])
        points_y.append(point_list[1])
        points_z.append(point_list[2])

    #get corresponding positions for HPAC times,
    #since there could be mismatches

    points_x = np.cumsum([float(x) for x in points_x])
    points_y = np.cumsum([float(x) for x in points_x])
    points_z = np.cumsum([float(x) for x in points_x])

    x_interp = np.interp(output_time_list,times_list,points_x).tolist()
    y_interp = np.interp(output_time_list,times_list,points_y).tolist()
    z_interp = np.interp(output_time_list,times_list,points_z).tolist()

    working_df.loc[:,('x_movement')] = x_interp
    working_df.loc[:,('y_movement')] = y_interp
    working_df.loc[:,('z_movement')] = z_interp

    working_df.loc[:,'x_pos'] = working_df.apply(set_x_vel, axis = 1)
    working_df.loc[:,'y_pos'] = working_df.apply(set_y_vel, axis = 1)
    working_df.loc[:,'z_pos'] = working_df.apply(set_z_vel, axis = 1)

    return(working_df)

虽然我当前的实现工作正常，但是在我的实际数据集上，运行大约需要20分钟，而我的DataFrame上的一个简单的groupby.apply lambda调用只需要几秒到一分钟。

Answer 1

您可以将apply与groupby和函数调用一起使用，而不是循环：

df = df.groupby('ID').apply(set_group_positions)

应用函数来操作Python Pandas DataFrame组

1 个答案: