重塑Pandas中的GroupBy,如果失踪则填充nan;

时间:2017-07-06 14:30:06

标签: python pandas reshape

给定每个组中具有不同数量元素的数据框(由某个变量“groupby”),我需要重塑为具有预定义列数的矩阵。例如:

    summary_x  participant_id_x response_date cuts
0         3.0                11    2016-05-05    a
1         3.0                11    2016-05-06    a
2         4.0                11    2016-05-07    a
3         4.0                11    2016-05-08    a
4         3.0                11    2016-05-09    a
5         3.0                11    2016-05-10    a
6         3.0                11    2016-05-11    a
7         3.0                11    2016-05-12    a
8         3.0                11    2016-05-13    a
9         3.0                11    2016-05-14    a
13        4.0                11    2016-05-22    b
14        4.0                11    2016-05-23    b
15        3.0                11    2016-05-24    b
16        3.0                11    2016-05-25    b
17        3.0                11    2016-05-26    b
18        3.0                11    2016-05-27    b
19        3.0                11    2016-05-28    b
20        3.0                11    2016-06-02    c
21        3.0                11    2016-06-03    c
22        3.0                11    2016-06-04    c
23        3.0                11    2016-06-05    c
24        3.0                11    2016-06-06    c
25        3.0                11    2016-06-07    c
26        3.0                11    2016-06-08    c
27        3.0                11    2016-06-09    c
28        3.0                11    2016-06-10    c
29        5.0                11    2016-06-11    c

每个组(by'cuts')包含10个元素,但组'b'仅包含7.我希望将'summary_x'中的矩阵重新整形为(3,10),其中缺少的值将为用nans填写:

pd.DataFrame(df.summary_x.values.reshape((-1,10)))

      0    1    2    3    4    5    6    7    8    9
0   3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1   nan  nan  nan  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2   3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

任何解决方案?

1 个答案:

答案 0 :(得分:1)

您可以将cumcount[::-1]一起用于列(行)的更改顺序:

g = df.groupby('cuts').cumcount(ascending=False)
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).iloc[:,::-1]
       .reset_index(drop=True)
df.columns = np.arange(len(df.columns))
print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  NaN  NaN  NaN  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

另一种解决方案:

L = df[::-1].groupby('cuts')['summary_x'].apply(list).values.tolist()
df = pd.DataFrame(L).iloc[:, ::-1]
df.columns = np.arange(len(df.columns))
print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  NaN  NaN  NaN  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

但如果NaN s最终也可以:

g = df.groupby('cuts').cumcount()
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).reset_index(drop=True)

print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  4.0  4.0  3.0  3.0  3.0  3.0  3.0  NaN  NaN  NaN
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0