使用NaT从数据框中提取Pandas multiindex

时间:2016-02-25 05:10:47

标签: python excel pandas dataframe multi-index

我正在使用pandas来解析Excel电子表格。电子表格有几个工作表,每个工作表如下所示。请注意,每列具有与不同日期对应的值,并且具有不同的长度:

Excel spreadsheet

无论出于何种原因,当pandas解析Excel电子表格时,第一个工作表将第一列日期解析为索引(即使index_col参数已指定为None)。那还是可以管理的。

但是,在其他工作表中,它将索引解析为多索引:

Screenshot of multiindex

我想要做的是最终重建数据帧,以便它们共享一个公共日期索引,并且对于没有值的任何日期都填充NaN。但是,我似乎无法从多索引中提取日期甚至开始这个过程。

我尝试在0和1级别的数据框架上执行reset_index(),但它抱怨IndexError: cannot do a non-empty take from an empty axes.我也试过unstack(),但抱怨ValueError: Index contains duplicate entries, cannot reshape 1}}。

1 个答案:

答案 0 :(得分:0)

我认为您使用read_excel参数parse_colsheaderindex_col。然后按iloc从每对创建DataFrame,并将它们最后concat设为一个:

import pandas as pd

df = pd.read_excel('f_name.xlsx', parse_cols=[0, 1, 3, 4, 7 , 8], index_col=0, header=0)
#if you need reset NaT in index, but it is not necessary
#df.index = df.index.to_series().fillna(0)
print df
            Column_val1 Unnamed: 1  Column_val2 Unnamed: 3  Column_val3
1999-01-01            4 2000-01-01            5 2000-01-01            5
1999-01-02            1 2000-01-02            7 2000-01-02            7
1999-01-03            2 2000-01-03            8 2000-01-03            8
1999-01-04            3 2000-01-04            3 2000-01-04            3
1999-01-05            3 2000-01-05            6 2000-01-05            6
1999-01-06            3 2000-01-06            9 2000-01-06            9
1999-01-07            4 2000-01-07            1 2000-01-07            1
1999-01-08            6 2000-01-08            5 2000-01-08            5
1999-01-09            8 2000-01-09            2 2000-01-09            2
1999-01-10            2 2000-01-10            3 2000-01-10            3
1999-01-11            4 2000-01-11           47 2000-01-11           47
1999-01-12            5 2000-01-12            2 2000-01-12            2
NaT                 NaN 2000-01-13            8 2000-01-13            8
NaT                 NaN 2000-01-14            2 2000-01-14            2
NaT                 NaN 2000-01-15           87 2000-01-15           87
NaT                 NaN 2000-01-16            6 2000-01-16            6
NaT                 NaN 2000-01-17           89 2000-01-17           89
NaT                 NaN        NaT          NaN 2000-01-18            7
NaT                 NaN        NaT          NaN 2000-01-19            8
print df['Column_val1']
1999-01-01     4
1999-01-02     1
1999-01-03     2
1999-01-04     3
1999-01-05     3
1999-01-06     3
1999-01-07     4
1999-01-08     6
1999-01-09     8
1999-01-10     2
1999-01-11     4
1999-01-12     5
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
Name: Column_val1, dtype: float64
print df.set_index(df.iloc[:, 1])['Column_val2']
Unnamed: 1
2000-01-01     5
2000-01-02     7
2000-01-03     8
2000-01-04     3
2000-01-05     6
2000-01-06     9
2000-01-07     1
2000-01-08     5
2000-01-09     2
2000-01-10     3
2000-01-11    47
2000-01-12     2
2000-01-13     8
2000-01-14     2
2000-01-15    87
2000-01-16     6
2000-01-17    89
NaT          NaN
NaT          NaN
Name: Column_val2, dtype: float64
print df.set_index(df.iloc[:, 3])['Column_val3']
Unnamed: 3
2000-01-01     5
2000-01-02     7
2000-01-03     8
2000-01-04     3
2000-01-05     6
2000-01-06     9
2000-01-07     1
2000-01-08     5
2000-01-09     2
2000-01-10     3
2000-01-11    47
2000-01-12     2
2000-01-13     8
2000-01-14     2
2000-01-15    87
2000-01-16     6
2000-01-17    89
2000-01-18     7
2000-01-19     8
Name: Column_val3, dtype: int64
df = pd.concat([df['Column_val1'], 
                df.set_index(df.iloc[:, 1])['Column_val2'], 
                df.set_index(df.iloc[:, 3])['Column_val3'] ])

#better is use sort index
df = df.sort_index()
print df
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
1999-01-01     4
1999-01-02     1
1999-01-03     2
1999-01-04     3
1999-01-05     3
1999-01-06     3
1999-01-07     4
1999-01-08     6
1999-01-09     8
1999-01-10     2
1999-01-11     4
1999-01-12     5
2000-01-01     5
2000-01-01     5
2000-01-02     7
2000-01-02     7
2000-01-03     8
2000-01-03     8
2000-01-04     3
2000-01-04     3
2000-01-05     6
2000-01-05     6
2000-01-06     9
2000-01-06     9
2000-01-07     1
2000-01-07     1
2000-01-08     5
2000-01-08     5
2000-01-09     2
2000-01-09     2
2000-01-10     3
2000-01-10     3
2000-01-11    47
2000-01-11    47
2000-01-12     2
2000-01-12     2
2000-01-13     8
2000-01-13     8
2000-01-14     2
2000-01-14     2
2000-01-15    87
2000-01-15    87
2000-01-16     6
2000-01-16     6
2000-01-17    89
2000-01-17    89
2000-01-18     7
2000-01-19     8
dtype: float64
#if you need remove rows where index is NaT
print df[pd.notnull(df.index)]
1999-01-01     4
1999-01-02     1
1999-01-03     2
1999-01-04     3
1999-01-05     3
1999-01-06     3
1999-01-07     4
1999-01-08     6
1999-01-09     8
1999-01-10     2
1999-01-11     4
1999-01-12     5
2000-01-01     5
2000-01-01     5
2000-01-02     7
2000-01-02     7
2000-01-03     8
2000-01-03     8
2000-01-04     3
2000-01-04     3
2000-01-05     6
2000-01-05     6
2000-01-06     9
2000-01-06     9
2000-01-07     1
2000-01-07     1
2000-01-08     5
2000-01-08     5
2000-01-09     2
2000-01-09     2
2000-01-10     3
2000-01-10     3
2000-01-11    47
2000-01-11    47
2000-01-12     2
2000-01-12     2
2000-01-13     8
2000-01-13     8
2000-01-14     2
2000-01-14     2
2000-01-15    87
2000-01-15    87
2000-01-16     6
2000-01-16     6
2000-01-17    89
2000-01-17    89
2000-01-18     7
2000-01-19     8
dtype: float64