pandas:如何正确堆叠数据?

时间:2017-04-12 22:15:57

标签: python pandas

我有一个dataframe,当它最初加载一个列表列表时,如下所示:

              0       1       2  3       4       5       6       7       8   \
0        Segment  Nov-12  Dec-12     Jan-13  Feb-13  Mar-13  Apr-13  May-13   
1           A                        N/A     N/A     N/A     N/A     N/A   
2           B                        N/A     N/A     N/A     N/A     N/A   
3           C                        N/A     N/A     N/A     N/A     N/A   
4           D                        N/A     N/A     N/A     N/A     N/A   
5           Total                    N/A     N/A     N/A     N/A     N/A   

每个月下的值将为浮动值。我想转动dataframe所以我最终得到的结果是:

  Segment Month Value
0 A       month value
1 A       month value
2 B       month value
3 B       month value
etc...

最好的方法是什么?

2 个答案:

答案 0 :(得分:2)

v = df.values[1:, 1:].astype(float)

mux = pd.MultiIndex.from_product(
    [df.iloc[1:, 0], df.iloc[0, 1:]],
    names=['Segment', 'Month']
)

d1 = pd.Series(v.ravel(), mux).reset_index(name='Value')
print(d1)
   Segment   Month  Value
0        A  Nov-12    NaN
1        A  Dec-12    NaN
2        A  Jan-13    NaN
3        A  Feb-13    NaN
4        A  Mar-13    NaN
5        A  Apr-13    NaN
6        A  May-13    NaN
7        B  Nov-12    NaN
8        B  Dec-12    NaN
9        B  Jan-13    NaN
10       B  Feb-13    NaN
11       B  Mar-13    NaN
12       B  Apr-13    NaN
13       B  May-13    NaN
14       C  Nov-12    NaN
15       C  Dec-12    NaN
16       C  Jan-13    NaN
17       C  Feb-13    NaN
18       C  Mar-13    NaN
19       C  Apr-13    NaN
20       C  May-13    NaN
21       D  Nov-12    NaN
22       D  Dec-12    NaN
23       D  Jan-13    NaN
24       D  Feb-13    NaN
25       D  Mar-13    NaN
26       D  Apr-13    NaN
27       D  May-13    NaN
28   Total  Nov-12    NaN
29   Total  Dec-12    NaN
30   Total  Jan-13    NaN
31   Total  Feb-13    NaN
32   Total  Mar-13    NaN
33   Total  Apr-13    NaN
34   Total  May-13    NaN

解释

# Your data obviously has an index in the first column
# and column headers in the first row
# I grab the underlyting `numpy` array
# from the 2nd column and 2nd row onward
# and convert to float
v = df.values[1:, 1:].astype(float)

# I'm going to create a `pd.MultiIndex` to enable me
# to unstack the `pd.Series` I'll create
# the first level of the index will be that first column
# that was obviously the index
# the second level will be the first row that was
# obviously the column headers
# the trick here is that I use `from_product`
# which gives me every combination of those arrays
# `ravel` unwinds or flattens the matrix and now
# lines up with this `pd.MultiIndex` that has every combination
# of row and column labels
mux = pd.MultiIndex.from_product(
    [df.iloc[1:, 0], df.iloc[0, 1:]],
    names=['Segment', 'Month']
)

# I construct the `pd.Series` and `unstack` to make the matrix
# `reset_index` takes those levels of the index and pushes them out
# the the dataframe data part.  `name='Value'` just makes sure the 
# values of the series get a column name
d1 = pd.Series(v.ravel(), mux).reset_index(name='Value')
print(d1)

答案 1 :(得分:0)

我最终找到了解决方案,但请让我知道如何改进它。

        cac_df = pd.DataFrame(data=vals)
        cac_df.rename(index=cac_df[0], inplace=True)
        del cac_df[0]
        cac_df = cac_df.rename(columns=cac_df.loc['Segment']).drop('Segment')
        cac_df = cac_df.applymap(lambda x: None if not x or x == 'N/A' else x)
        cac_df = pd.DataFrame(
            cac_df.dropna(axis=1, how='all').stack()
        )

堆栈引发了我一个循环,因为它返回了Series而不是DataFrame,如果您只有一个级别的列层次结构,则会在文档中注明。