从文件中读取pandas DataFrame时出错

时间:2015-01-28 10:20:26

标签: python pandas

我正在尝试使用python pandas中的DataFrame.from_csv()读取文件。该文件包含此值。

TICKER,date,ASKHI,PRC,BIDLO,PortfolioDate,PortfolioName
MSFT,2012-06-29 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-07-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-08-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-09-28 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-10-31 00:00:00,28.88,28.54,28.5,2010-12-31 00:00:00,SAP500

但是,当我访问时,我从数据帧中读取它,帧生成如下。

     date  ASKHI    PRC  BIDLO        PortfolioDate  \
TICKER                                                                  
MSFT    2012-06-29 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
MSFT    2012-07-31 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
MSFT    2012-08-31 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
MSFT    2012-09-28 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
MSFT    2012-10-31 00:00:00  28.88  28.54   28.5  2010-12-31 00:00:00   

       PortfolioName  
TICKER                
MSFT          SAP500  
MSFT          SAP500  
MSFT          SAP500  
MSFT          SAP500  
MSFT          SAP500  

当我使用frame ['date']选择列'date'时,结果是:

TICKER
MSFT      2012-06-29 00:00:00
MSFT      2012-07-31 00:00:00
MSFT      2012-08-31 00:00:00
MSFT      2012-09-28 00:00:00
MSFT      2012-10-31 00:00:00

我的代码是:

frame = DataFrame.from_csv('/home/raghu/log.txt',sep=',');

我是新手。有什么我想念的吗?为什么第一列是这样的?

编辑:熊猫版:'0.14.1'

1 个答案:

答案 0 :(得分:3)

请勿使用from_csv不再维护,而是使用read_csv

In [112]:
import io
temp="""TICKER,date,ASKHI,PRC,BIDLO,PortfolioDate,PortfolioName
MSFT,2012-06-29 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-07-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-08-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-09-28 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-10-31 00:00:00,28.88,28.54,28.5,2010-12-31 00:00:00,SAP500"""
df = pd.read_csv(io.StringIO(temp))

df
Out[112]:
  TICKER                 date  ASKHI    PRC  BIDLO        PortfolioDate  \
0   MSFT  2012-06-29 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
1   MSFT  2012-07-31 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
2   MSFT  2012-08-31 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
3   MSFT  2012-09-28 00:00:00    NaN    NaN    NaN  2010-12-31 00:00:00   
4   MSFT  2012-10-31 00:00:00  28.88  28.54   28.5  2010-12-31 00:00:00   

  PortfolioName  
0        SAP500  
1        SAP500  
2        SAP500  
3        SAP500  
4        SAP500  
In [113]:

df['date']
Out[113]:
0    2012-06-29 00:00:00
1    2012-07-31 00:00:00
2    2012-08-31 00:00:00
3    2012-09-28 00:00:00
4    2012-10-31 00:00:00
Name: date, dtype: object

您对第一列感到陌生的原因是,当您使用from_csv时,它会将第一列视为索引(index_col的默认值为0){ {3}}没有(index_col的默认值为None)。

修改

要修正错误而不升级,只需将参数中的index_col=None设置为from_csv

In [115]:

df = pd.DataFrame.from_csv(io.StringIO(temp), index_col=None)
df['date']
Out[115]:
0    2012-06-29 00:00:00
1    2012-07-31 00:00:00
2    2012-08-31 00:00:00
3    2012-09-28 00:00:00
4    2012-10-31 00:00:00
Name: date, dtype: object