我有下表
data = {'weekday': ["Monday", "Monday", "Monday",
"Thursday", "Thursday", "Thursday", "Thursday"],
'Person 1': [12, 6, 5, 8, 11, 6, 4],
'Person 2': [10, 6, 11, 5, 8, 9, 12],
'Person 3': [8, 5, 7, 3, 7, 11, 15]}
df = pd.DataFrame(data, columns=['weekday',
'Person 1', 'Person 2', 'Person 3'])
df
weekday Person 1 Person 2 Person 3
0 Monday 12 10 8
1 Monday 6 6 5
2 Monday 5 11 7
3 Thursday 8 5 3
4 Thursday 11 8 7
5 Thursday 6 9 11
6 Thursday 4 12 15
对于“工作日”列中的每个唯一项,将它们组合成一个长数组以获得以下输出
array([[12., 10., 8., 6., 6., 5., 5., 11., 7., 0., 0., 0.],
[ 8., 5., 3., 11., 8., 7., 6., 9., 11., 4., 12., 15.]])
我当前的解决方案:
def getting_numpy_array(df, colum_name='weekday'):
colum_name_to_use = [i for i in df.columns if i!=colum_name]
max_seq_to_pad=df[colum_name].value_counts().max()* df[colum_name_to_use].shape[1]
uniq_items = df[colum_name].unique()
stack_np = np.zeros(max_seq_to_pad)
for item in uniq_items:
group_value=df.loc[df[colum_name]==item][colum_name_to_use].get_values().reshape(-1)
if group_value.shape[0]==max_seq_to_pad:
stack_np=np.vstack((stack_np, group_value))
else:
group_value=np.pad(group_value, [0, max_seq_to_pad-group_value.shape[0]], mode='constant')
stack_np=np.vstack((stack_np, group_value))
return stack_np[1:]
getting_numpy_array(df,colum_name ='weekday')
array([[12., 10., 8., 6., 6., 5., 5., 11., 7., 0., 0., 0.],
[ 8., 5., 3., 11., 8., 7., 6., 9., 11., 4., 12., 15.]])
是否有更好的方法来解决此问题?
答案 0 :(得分:1)
您可以使用 groupby
和列表理解功能将其大大缩短:
s = df.groupby('weekday')
L = [s.get_group(i).values[:,1:].ravel() for i in s.groups]
padding = max(len(l) for l in L)
L = np.array([np.pad(i, [0, padding-len(i)], mode='constant') for i in L])
array([[12, 10, 8, 6, 6, 5, 5, 11, 7, 0, 0, 0],
[8, 5, 3, 11, 8, 7, 6, 9, 11, 4, 12, 15]], dtype=object)
这还将在较大的数据上获得巨大的提速:
In [58]: len(df)
Out[58]: 700000
In [59]: %timeit getting_numpy_array(df)
7.43 s ± 260 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [60]: %timeit user3483203(df)
499 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)