我从网站创建了一些数据框。
df = pd.read_html('http://finviz.com/insidertrading.ashx?tc=1', header = 0)[4].set_index('Date')
然后我创建了一个html文件,其中包含html文件的名称作为日期。
today_date = dt.date.today().isoformat()
html_name = 'Insider Trading/{}_buys.html'.format(today_date)
df.to_html(html_name)
当我打开它时,html文件看起来像这样(但在行和列周围有边框)。非常干净,没有错误。
Ticker Owner Relationship Transaction Cost #Shares Value ($) #Shares Total SEC Form 4
Date
Sep 30 PIH Fundamental Global Investors, 10% Owner Buy 6.25 700 4375 352202 Sep 30 06:28 PM
Sep 28 PIH Fundamental Global Investors, 10% Owner Buy 6.05 36400 220220 351502 Sep 30 06:28 PM
Sep 30 FSTR Vizi Bradley Director Buy 12.00 14419 173028 801209 Sep 30 05:21 PM
Sep 29 FSTR Vizi Bradley Director Buy 12.00 11292 135504 786790 Sep 30 05:21 PM
Sep 28 FSTR Vizi Bradley Director Buy 11.83 9500 112385 775498 Sep 30 05:21 PM
现在,当我尝试将html文件读回数据框时,如下所示:
import pandas as pd
df =pd.read_html('Insider Trading/2016-09-30_buys.html')[0]
(当我读取html时只有一个数据帧,这就是我使用[0]的原因)
我的列数增加了一倍,而不是10列,而这10个额外的列有“未命名的1”类型的名称。
所以我的输出结果如下:
Unnamed: 0 Ticker Owner Relationship Transaction \
0 Sep 30 PIH Fundamental Global Investors, 10% Owner Buy
1 Sep 28 PIH Fundamental Global Investors, 10% Owner Buy
2 Sep 30 FSTR Vizi Bradley Director Buy
3 Sep 29 FSTR Vizi Bradley Director Buy
4 Sep 28 FSTR Vizi Bradley Director Buy
Cost #Shares Value ($) #Shares Total SEC Form 4 Date \
0 6.25 700 4375 352202 Sep 30 06:28 PM NaN
1 6.05 36400 220220 351502 Sep 30 06:28 PM NaN
2 12.00 14419 173028 801209 Sep 30 05:21 PM NaN
3 12.00 11292 135504 786790 Sep 30 05:21 PM NaN
4 11.83 9500 112385 775498 Sep 30 05:21 PM NaN
Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
我可能做错了什么?
我也根据建议尝试了这段代码:
df = pd.read_html('http://finviz.com/insidertrading.ashx?tc=1', header = 0, attrs = {'class': 'body-table'})[0].set_index('SEC Form 4')
但似乎遇到了同样的问题。