Question

Q1。有没有办法从表格中提取数据但仍能跟踪轴标题？ Q2。哪种方法更好地从html表中提取数据？ HTMLParser或beautifulsoup还是其他？

我试图提取这个收入表 http://investing.businessweek.com/research/stocks/financials/financials.asp?ticker=TSCO:LN

我想成为

“数百万英镑的货币”，“2009”，“2010”，“2011”，“2012”

“收入”，“53,898.0”，“56,910.0”，“60,455.0”，“64,539.0”

“TOTAL REVENUES”，“53,898.0”，“56,910.0”，“60,455.0”，“64,539.0”

与此同时，我想知道“56,910.0”是2009年的收入

但我遇到了两个问题：

HTMLParser.HTMLParseError：格式错误的开始标记，位于第1148行，第47列或 HTMLParser.HTMLParseError：错误结束标记：“”，第225行，第104列
无法跟踪轴标题

非常感谢

Answer 1

我做了很多刮，而BeautifulSoup很少令人失望。


from BeautifulSoup import BeautifulSoup 
URL = "http://investing.businessweek.com/research/stocks/financials/financials.asp?ticker=TSCO:LN"
from urllib import urlopen
HTML = urlopen ( URL )
soup = BeautifulSoup ( HTML )
statement = soup . find ( 'table', { 'class' : "financialStatement" } )
rows = statement . findAll ( 'tr' )

此时我认为您会发现行的长度为25，其第一项是标题，最后一行是所需表格的最后一行。

python-提取html表而不会丢失轴标题

1 个答案: