我正在使用pandas来解析Excel电子表格。电子表格有几个工作表,每个工作表如下所示。请注意,每列具有与不同日期对应的值,并且具有不同的长度:
无论出于何种原因,当pandas解析Excel电子表格时,第一个工作表将第一列日期解析为索引(即使index_col参数已指定为None)。那还是可以管理的。
但是,在其他工作表中,它将索引解析为多索引:
我想要做的是最终重建数据帧,以便它们共享一个公共日期索引,并且对于没有值的任何日期都填充NaN。但是,我似乎无法从多索引中提取日期甚至开始这个过程。
我尝试在0和1级别的数据框架上执行reset_index()
,但它抱怨IndexError: cannot do a non-empty take from an empty axes.
我也试过unstack()
,但抱怨ValueError: Index contains duplicate entries, cannot reshape
1}}。
答案 0 :(得分:0)
我认为您使用read_excel
参数parse_cols
,header
,index_col
。然后按iloc
从每对创建DataFrame,并将它们最后concat
设为一个:
import pandas as pd
df = pd.read_excel('f_name.xlsx', parse_cols=[0, 1, 3, 4, 7 , 8], index_col=0, header=0)
#if you need reset NaT in index, but it is not necessary
#df.index = df.index.to_series().fillna(0)
print df
Column_val1 Unnamed: 1 Column_val2 Unnamed: 3 Column_val3
1999-01-01 4 2000-01-01 5 2000-01-01 5
1999-01-02 1 2000-01-02 7 2000-01-02 7
1999-01-03 2 2000-01-03 8 2000-01-03 8
1999-01-04 3 2000-01-04 3 2000-01-04 3
1999-01-05 3 2000-01-05 6 2000-01-05 6
1999-01-06 3 2000-01-06 9 2000-01-06 9
1999-01-07 4 2000-01-07 1 2000-01-07 1
1999-01-08 6 2000-01-08 5 2000-01-08 5
1999-01-09 8 2000-01-09 2 2000-01-09 2
1999-01-10 2 2000-01-10 3 2000-01-10 3
1999-01-11 4 2000-01-11 47 2000-01-11 47
1999-01-12 5 2000-01-12 2 2000-01-12 2
NaT NaN 2000-01-13 8 2000-01-13 8
NaT NaN 2000-01-14 2 2000-01-14 2
NaT NaN 2000-01-15 87 2000-01-15 87
NaT NaN 2000-01-16 6 2000-01-16 6
NaT NaN 2000-01-17 89 2000-01-17 89
NaT NaN NaT NaN 2000-01-18 7
NaT NaN NaT NaN 2000-01-19 8
print df['Column_val1']
1999-01-01 4
1999-01-02 1
1999-01-03 2
1999-01-04 3
1999-01-05 3
1999-01-06 3
1999-01-07 4
1999-01-08 6
1999-01-09 8
1999-01-10 2
1999-01-11 4
1999-01-12 5
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
Name: Column_val1, dtype: float64
print df.set_index(df.iloc[:, 1])['Column_val2']
Unnamed: 1
2000-01-01 5
2000-01-02 7
2000-01-03 8
2000-01-04 3
2000-01-05 6
2000-01-06 9
2000-01-07 1
2000-01-08 5
2000-01-09 2
2000-01-10 3
2000-01-11 47
2000-01-12 2
2000-01-13 8
2000-01-14 2
2000-01-15 87
2000-01-16 6
2000-01-17 89
NaT NaN
NaT NaN
Name: Column_val2, dtype: float64
print df.set_index(df.iloc[:, 3])['Column_val3']
Unnamed: 3
2000-01-01 5
2000-01-02 7
2000-01-03 8
2000-01-04 3
2000-01-05 6
2000-01-06 9
2000-01-07 1
2000-01-08 5
2000-01-09 2
2000-01-10 3
2000-01-11 47
2000-01-12 2
2000-01-13 8
2000-01-14 2
2000-01-15 87
2000-01-16 6
2000-01-17 89
2000-01-18 7
2000-01-19 8
Name: Column_val3, dtype: int64
df = pd.concat([df['Column_val1'],
df.set_index(df.iloc[:, 1])['Column_val2'],
df.set_index(df.iloc[:, 3])['Column_val3'] ])
#better is use sort index
df = df.sort_index()
print df
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
NaT NaN
1999-01-01 4
1999-01-02 1
1999-01-03 2
1999-01-04 3
1999-01-05 3
1999-01-06 3
1999-01-07 4
1999-01-08 6
1999-01-09 8
1999-01-10 2
1999-01-11 4
1999-01-12 5
2000-01-01 5
2000-01-01 5
2000-01-02 7
2000-01-02 7
2000-01-03 8
2000-01-03 8
2000-01-04 3
2000-01-04 3
2000-01-05 6
2000-01-05 6
2000-01-06 9
2000-01-06 9
2000-01-07 1
2000-01-07 1
2000-01-08 5
2000-01-08 5
2000-01-09 2
2000-01-09 2
2000-01-10 3
2000-01-10 3
2000-01-11 47
2000-01-11 47
2000-01-12 2
2000-01-12 2
2000-01-13 8
2000-01-13 8
2000-01-14 2
2000-01-14 2
2000-01-15 87
2000-01-15 87
2000-01-16 6
2000-01-16 6
2000-01-17 89
2000-01-17 89
2000-01-18 7
2000-01-19 8
dtype: float64
#if you need remove rows where index is NaT
print df[pd.notnull(df.index)]
1999-01-01 4
1999-01-02 1
1999-01-03 2
1999-01-04 3
1999-01-05 3
1999-01-06 3
1999-01-07 4
1999-01-08 6
1999-01-09 8
1999-01-10 2
1999-01-11 4
1999-01-12 5
2000-01-01 5
2000-01-01 5
2000-01-02 7
2000-01-02 7
2000-01-03 8
2000-01-03 8
2000-01-04 3
2000-01-04 3
2000-01-05 6
2000-01-05 6
2000-01-06 9
2000-01-06 9
2000-01-07 1
2000-01-07 1
2000-01-08 5
2000-01-08 5
2000-01-09 2
2000-01-09 2
2000-01-10 3
2000-01-10 3
2000-01-11 47
2000-01-11 47
2000-01-12 2
2000-01-12 2
2000-01-13 8
2000-01-13 8
2000-01-14 2
2000-01-14 2
2000-01-15 87
2000-01-15 87
2000-01-16 6
2000-01-16 6
2000-01-17 89
2000-01-17 89
2000-01-18 7
2000-01-19 8
dtype: float64