read_html正在创建一个数据帧,其列数为2x

时间:2016-10-04 19:51:27

标签: pandas io

我从网站创建了一些数据框。

df = pd.read_html('http://finviz.com/insidertrading.ashx?tc=1', header = 0)[4].set_index('Date')

然后我创建了一个html文件,其中包含html文件的名称作为日期。

     today_date = dt.date.today().isoformat()
    html_name = 'Insider Trading/{}_buys.html'.format(today_date)
    df.to_html(html_name)

当我打开它时,html文件看起来像这样(但在行和列周围有边框)。非常干净,没有错误。

        Ticker                         Owner    Relationship    Transaction     Cost    #Shares     Value ($)   #Shares Total   SEC Form 4
Date                                    
Sep 30  PIH     Fundamental Global Investors,       10% Owner          Buy       6.25        700         4375         352202    Sep 30 06:28 PM
Sep 28  PIH     Fundamental Global Investors,       10% Owner          Buy       6.05       36400       220220        351502    Sep 30 06:28 PM
Sep 30  FSTR                     Vizi Bradley        Director          Buy      12.00       14419       173028        801209    Sep 30 05:21 PM
Sep 29  FSTR                     Vizi Bradley        Director          Buy      12.00       11292       135504        786790    Sep 30 05:21 PM
Sep 28  FSTR                   Vizi Bradley          Director          Buy      11.83        9500       112385        775498    Sep 30 05:21 PM

现在,当我尝试将html文件读回数据框时,如下所示:

import pandas as pd

df =pd.read_html('Insider Trading/2016-09-30_buys.html')[0]

(当我读取html时只有一个数据帧,这就是我使用[0]的原因)

我的列数增加了一倍,而不是10列,而这10个额外的列有“未命名的1”类型的名称。

所以我的输出结果如下:

  Unnamed: 0 Ticker                          Owner Relationship  Transaction  \
0     Sep 30    PIH  Fundamental Global Investors,    10% Owner         Buy   
1     Sep 28    PIH  Fundamental Global Investors,    10% Owner         Buy   
2     Sep 30   FSTR                   Vizi Bradley     Director         Buy   
3     Sep 29   FSTR                   Vizi Bradley     Director         Buy   
4     Sep 28   FSTR                   Vizi Bradley     Director         Buy   

    Cost  #Shares  Value ($)  #Shares Total       SEC Form 4  Date  \
0   6.25      700       4375         352202  Sep 30 06:28 PM   NaN   
1   6.05    36400     220220         351502  Sep 30 06:28 PM   NaN   
2  12.00    14419     173028         801209  Sep 30 05:21 PM   NaN   
3  12.00    11292     135504         786790  Sep 30 05:21 PM   NaN   
4  11.83     9500     112385         775498  Sep 30 05:21 PM   NaN   

   Unnamed: 11  Unnamed: 12  Unnamed: 13  Unnamed: 14  Unnamed: 15  \
0          NaN          NaN          NaN          NaN          NaN   
1          NaN          NaN          NaN          NaN          NaN   
2          NaN          NaN          NaN          NaN          NaN   
3          NaN          NaN          NaN          NaN          NaN   
4          NaN          NaN          NaN          NaN          NaN   

   Unnamed: 16  Unnamed: 17  Unnamed: 18  Unnamed: 19  
0          NaN          NaN          NaN          NaN  
1          NaN          NaN          NaN          NaN  
2          NaN          NaN          NaN          NaN  
3          NaN          NaN          NaN          NaN  
4          NaN          NaN          NaN          NaN  

我可能做错了什么?

我也根据建议尝试了这段代码:

df = pd.read_html('http://finviz.com/insidertrading.ashx?tc=1', header = 0, attrs = {'class': 'body-table'})[0].set_index('SEC Form 4')

但似乎遇到了同样的问题。

0 个答案:

没有答案