熊猫groupby并使用有序列扩大数据框

时间:2020-10-01 05:11:13

标签: python-3.x pandas pivot

我有一个长格式的数据框,其中包含每个主题的多个样本和时间点。样本数量和时间点可以变化,时间点之间的天数也可以变化:

test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
                    "sample":["A", "B", "C", "D", "E", "F"],
                    "timepoint":[19,11,8,6,2,12],
                    "time_order":[3,2,1,2,1,1]
 })

   subject_id   sample  timepoint   time_order
0    1            A        19           3
1    1            B        11           2
2    1            C         8           1
3    2            D         6           2
4    2            E         2           1
5    3            F        12           1

我需要想出一种方法,可以按subject_id对该数据帧进行分组,然后将所有样本和时间点按时间顺序放在同一行。

期望的输出:

    subject_id  sample1 timepoint1  sample2   timepoint2  sample3 timepoint3
0    1            C         8         B        11        A      19                              
1    2            E         2         D         6       null   null         
5    3            F        12        null      null     null   null   

数据透视使我离我很近,但是我对如何从那里继续感到困惑:

test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')

enter image description here

2 个答案:

答案 0 :(得分:1)

DataFrame.set_indexDataFrame.unstack一起使用以进行枢轴旋转,按列对MultiIndex排序,将其展平并最后将subject_id转换为列:

df = (test_df.set_index(['subject_id', 'time_order'])
             .unstack()
             .sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
   subject_id sample1  timepoint1 sample2  timepoint2 sample3  timepoint3
0           1       C         8.0       B        11.0       A        19.0
1           2       E         2.0       D         6.0     NaN         NaN
2           3       F        12.0     NaN         NaN     NaN         NaN

答案 1 :(得分:1)

a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)

    

            sample1  timepoint1 sample2  timepoint2 sample3  timepoint3
subject_id                                                            
1                C           8       B        11.0       A        19.0
2                E           2       D         6.0     NaN         NaN
3                F          12     NaN         NaN     NaN         NaN