我在让汤返回所有粗体且有网址的链接时遇到问题。现在它只返回页面上的第一个。
以下是来源的一部分:
<div class="section_wrapper" id="all_players_">
<div class="section_heading">
<span class="section_anchor" id="players__link" data-label="925 Players"></span>
<h2>925 Players</h2> <div class="section_heading_text">
<ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
</ul>
</div>
</div> <div class="section_content" id="div_players_">
<p><a href="/players/d/d'acqjo01.shtml">John D'Acquisto</a> (1973-1982)</p>
<p><a href="/players/d/d'amije01.shtml">Jeff D'Amico</a> (1996-2004)</p>
<p><a href="/players/d/d'amije02.shtml">Jeff D'Amico</a> (2000-2000)</p>
<p><a href="/players/d/dantoja01.shtml">Jamie D'Antona</a> (2008-2008)</p>
<p><a href="/players/d/dorseje02.shtml">Jerry D'Arcy</a> (1911-1911)</p>
<p><b><a href="/players/d/darnach01.shtml">Chase d'Arnaud</a> (2011-2016)</b></p>
<p><b><a href="/players/d/darnatr01.shtml">Travis d'Arnaud</a> (2013-2016)</b></p>
<p><a href="/players/d/daalom01.shtml">Omar Daal</a> (1993-2003)</p>
<p><a href="/players/d/dadepa01.shtml">Paul Dade</a> (1975-1980)</p>
<p><a href="/players/d/dagenjo01.shtml">John Dagenhard</a> (1943-1943)</p>
<p><a href="/players/d/daglipe01.shtml">Pete Daglia</a> (1932-1932)</p>
<p><a href="/players/d/dagrean01.shtml">Angelo Dagres</a> (1955-1955)</p>
<p><b><a href="/players/d/dahlda01.shtml">David Dahl</a> (2016-2016)</b></p>
<p><a href="/players/d/dahlja01.shtml">Jay Dahl</a> (1963-1963)</p>
<p><a href="/players/d/dahlebi01.shtml">Bill Dahlen</a> (1891-1911)</p>
<p><a href="/players/d/dahlgba01.shtml">Babe Dahlgren</a> (1935-1946)</p>**strong text**
这是我的剧本:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
for player_url in soup.b.find_all(limit=None):
for player_link in re.findall('/players/', player_url['href']):
print ('http://www.baseball-reference.com' + player_url['href'])
另一部分是,还有其他的div id有类似的列表,我不在乎。我想只从这个div类中获取带有<b>
标记的URL。 <b>
标记表示他们是活跃的玩家,而这正是我想要捕捉的。
答案 0 :(得分:1)
使用BeautifulSoup进行&#34;选择&#34;工作并深入研究您的数据:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
bolds = soup.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
现在,如果只想要一个div
id=div_players_
,您可以添加额外的过滤器:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
答案 1 :(得分:0)
这就是我最终做的事情
url = 'http://www.baseball-reference.com/players/d/'
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')
for player_div in soup.find_all('div', {'id':'all_players_'}):
for player_bold in player_div('b'):
for player_href in player_bold('a'):
print ('http://www.baseball-reference.com' + player_href['href'])