堆叠熊猫行的更好方法?

时间:2018-06-29 20:35:19

标签: python python-3.x pandas numpy

我有下表

data = {'weekday': ["Monday", "Monday", "Monday", 
         "Thursday", "Thursday", "Thursday", "Thursday"],
        'Person 1': [12, 6, 5, 8, 11, 6, 4],
        'Person 2': [10, 6, 11, 5, 8, 9, 12],
        'Person 3': [8, 5, 7, 3, 7, 11, 15]}
df = pd.DataFrame(data, columns=['weekday',
        'Person 1', 'Person 2', 'Person 3'])

df

    weekday  Person 1  Person 2  Person 3
0    Monday        12        10         8
1    Monday         6         6         5
2    Monday         5        11         7
3  Thursday         8         5         3
4  Thursday        11         8         7
5  Thursday         6         9        11
6  Thursday         4        12        15

对于“工作日”列中的每个唯一项,将它们组合成一个长数组以获得以下输出

array([[12., 10.,  8.,  6.,  6.,  5.,  5., 11.,  7.,  0.,  0.,  0.],
   [ 8.,  5.,  3., 11.,  8.,  7.,  6.,  9., 11.,  4., 12., 15.]])

我当前的解决方案:

def getting_numpy_array(df, colum_name='weekday'):
    colum_name_to_use = [i for i in df.columns if i!=colum_name]

    max_seq_to_pad=df[colum_name].value_counts().max()* df[colum_name_to_use].shape[1] 
    uniq_items = df[colum_name].unique()
    stack_np = np.zeros(max_seq_to_pad)

    for item in uniq_items:
        group_value=df.loc[df[colum_name]==item][colum_name_to_use].get_values().reshape(-1)

        if group_value.shape[0]==max_seq_to_pad:
            stack_np=np.vstack((stack_np, group_value))

        else:

            group_value=np.pad(group_value, [0, max_seq_to_pad-group_value.shape[0]], mode='constant')
            stack_np=np.vstack((stack_np, group_value))

    return stack_np[1:]

getting_numpy_array(df,colum_name ='weekday')

array([[12., 10.,  8.,  6.,  6.,  5.,  5., 11.,  7.,  0.,  0.,  0.],
      [ 8.,  5.,  3., 11.,  8.,  7.,  6.,  9., 11.,  4., 12., 15.]])

是否有更好的方法来解决此问题?

1 个答案:

答案 0 :(得分:1)

您可以使用 groupby 和列表理解功能将其大大缩短:

s = df.groupby('weekday')
L = [s.get_group(i).values[:,1:].ravel() for i in s.groups]

padding = max(len(l) for l in L)
L = np.array([np.pad(i, [0, padding-len(i)], mode='constant') for i in L])

array([[12, 10, 8, 6, 6, 5, 5, 11, 7, 0, 0, 0],
       [8, 5, 3, 11, 8, 7, 6, 9, 11, 4, 12, 15]], dtype=object)

这还将在较大的数据上获得巨大的提速:

In [58]: len(df)
Out[58]: 700000

In [59]: %timeit getting_numpy_array(df)
7.43 s ± 260 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [60]: %timeit user3483203(df)
499 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)