所以我想在本网站http://www.baseball-reference.com/players/a/alberma01.shtml
上抓一张标题为"Salaries"
的表格。
url = 'http://www.baseball-reference.com/players/a/alberma01.shtml'
r = urllib.request.urlopen(url).read()
soup = BeautifulSoup(r)
我已经尝试了
div = soup.find('div', id='all_br-salaries')
和
div = soup.find('div', attrs={'id': 'all_br-salaries'})
当我打印div
时,我会看到表格中的数据,但是当我尝试这样的时候:
div.find('thead')
div.find('tbody')
我一无所获。我的问题是如何正确选择表格,以便我可以迭代tr / td& th标签提取数据?
答案 0 :(得分:1)
原因?该表的HTML是 - 不要问我为什么 - 在评论字段中。因此,从评论中挖出HTML,将那个变成汤,并以通常的方式挖掘汤。
>>> import requests
>>> page = requests.get('http://www.baseball-reference.com/players/a/alberma01.shtml').text
>>> from bs4 import BeautifulSoup
>>> table_code = page[page.find('<table class="sortable stats_table" id="br-salaries"'):]
>>> soup = BeautifulSoup(table_code, 'lxml')
>>> rows = soup.findAll('tr')
>>> len(rows)
14
>>> for row in rows[1:]:
... row.text
...
'200825Baltimore\xa0Orioles$395,000? '
'200926Baltimore\xa0Orioles$410,000? '
'201027Baltimore\xa0Orioles$680,0002.141 '
'201128Boston\xa0Red\xa0Sox$875,0003.141 '
'201229Boston\xa0Red\xa0Sox$1,075,0004.141contracts '
'201330Cleveland\xa0Indians$1,750,0005.141contracts '
'201431Houston\xa0Astros$2,250,0006.141contracts '
'201532Chicago\xa0White\xa0Sox$1,500,0007.141contracts '
'201532Houston\xa0Astros$200,000Buyout of contract option'
'201633Chicago\xa0White\xa0Sox$2,000,0008.141 '
'201734Chicago\xa0White\xa0Sox$250,000Buyout of contract option'
'2017 StatusSigned thru 2017, Earliest Free Agent: 2018'
'Career to date (may be incomplete)$11,385,000'
编辑:我发现这是在评论字段中,通过在Chrome浏览器中打开页面的HTML,然后向下浏览所需的表格。这是我发现的。注意开场<!--
。