使用bs4在同一页面的几个表中刮取特定的html表

时间:2017-05-13 20:07:18

标签: python-3.x beautifulsoup

所以我想在本网站http://www.baseball-reference.com/players/a/alberma01.shtml

上抓一张标题为"Salaries"的表格。
url = 'http://www.baseball-reference.com/players/a/alberma01.shtml'
r = urllib.request.urlopen(url).read()
soup = BeautifulSoup(r)

我已经尝试了

div = soup.find('div', id='all_br-salaries')

div  = soup.find('div', attrs={'id': 'all_br-salaries'})

当我打印div时,我会看到表格中的数据,但是当我尝试这样的时候:

div.find('thead')
div.find('tbody')

我一无所获。我的问题是如何正确选择表格,以便我可以迭代tr / td& th标签提取数据?

1 个答案:

答案 0 :(得分:1)

原因?该表的HTML是 - 不要问我为什么 - 在评论字段中。因此,从评论中挖出HTML,将那个变成汤,并以通常的方式挖掘汤。

>>> import requests
>>> page = requests.get('http://www.baseball-reference.com/players/a/alberma01.shtml').text
>>> from bs4 import BeautifulSoup
>>> table_code = page[page.find('<table class="sortable stats_table" id="br-salaries"'):]
>>> soup = BeautifulSoup(table_code, 'lxml')
>>> rows = soup.findAll('tr')
>>> len(rows)
14
>>> for row in rows[1:]:
...     row.text
...     
'200825Baltimore\xa0Orioles$395,000? '
'200926Baltimore\xa0Orioles$410,000? '
'201027Baltimore\xa0Orioles$680,0002.141 '
'201128Boston\xa0Red\xa0Sox$875,0003.141 '
'201229Boston\xa0Red\xa0Sox$1,075,0004.141contracts '
'201330Cleveland\xa0Indians$1,750,0005.141contracts '
'201431Houston\xa0Astros$2,250,0006.141contracts '
'201532Chicago\xa0White\xa0Sox$1,500,0007.141contracts '
'201532Houston\xa0Astros$200,000Buyout of contract option'
'201633Chicago\xa0White\xa0Sox$2,000,0008.141 '
'201734Chicago\xa0White\xa0Sox$250,000Buyout of contract option'
'2017 StatusSigned thru 2017, Earliest Free Agent: 2018'
'Career to date (may be incomplete)$11,385,000'

编辑:我发现这是在评论字段中,通过在Chrome浏览器中打开页面的HTML,然后向下浏览所需的表格。这是我发现的。注意开场<!--

HTML image