BS4:发现2个标签的href的问题

时间:2017-02-24 16:11:13

标签: beautifulsoup

我在让汤返回所有粗体且有网址的链接时遇到问题。现在它只返回页面上的第一个。

以下是来源的一部分:

<div class="section_wrapper" id="all_players_">
<div class="section_heading">
  <span class="section_anchor" id="players__link" data-label="925 Players"></span>
    <h2>925 Players</h2>    <div class="section_heading_text">
      <ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
      </ul>
    </div>      
</div>    <div class="section_content" id="div_players_">
<p><a href="/players/d/d'acqjo01.shtml">John D'Acquisto</a>  (1973-1982)</p>
<p><a href="/players/d/d'amije01.shtml">Jeff D'Amico</a>  (1996-2004)</p>
<p><a href="/players/d/d'amije02.shtml">Jeff D'Amico</a>  (2000-2000)</p>
<p><a href="/players/d/dantoja01.shtml">Jamie D'Antona</a>  (2008-2008)</p>
<p><a href="/players/d/dorseje02.shtml">Jerry D'Arcy</a>  (1911-1911)</p>
<p><b><a href="/players/d/darnach01.shtml">Chase d'Arnaud</a>  (2011-2016)</b></p>
<p><b><a href="/players/d/darnatr01.shtml">Travis d'Arnaud</a>  (2013-2016)</b></p>
<p><a href="/players/d/daalom01.shtml">Omar Daal</a>  (1993-2003)</p>
<p><a href="/players/d/dadepa01.shtml">Paul Dade</a>  (1975-1980)</p>
<p><a href="/players/d/dagenjo01.shtml">John Dagenhard</a>  (1943-1943)</p>
<p><a href="/players/d/daglipe01.shtml">Pete Daglia</a>  (1932-1932)</p>
<p><a href="/players/d/dagrean01.shtml">Angelo Dagres</a>  (1955-1955)</p>
<p><b><a href="/players/d/dahlda01.shtml">David Dahl</a>  (2016-2016)</b></p>
<p><a href="/players/d/dahlja01.shtml">Jay Dahl</a>  (1963-1963)</p>
<p><a href="/players/d/dahlebi01.shtml">Bill Dahlen</a>  (1891-1911)</p>
<p><a href="/players/d/dahlgba01.shtml">Babe Dahlgren</a>  (1935-1946)</p>**strong text**

这是我的剧本:

import urllib.request
from bs4 import BeautifulSoup as bs
import re

url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")

for player_url in soup.b.find_all(limit=None):
        for player_link in re.findall('/players/', player_url['href']):
                  print ('http://www.baseball-reference.com' + player_url['href'])

另一部分是,还有其他的div id有类似的列表,我不在乎。我想只从这个div类中获取带有<b>标记的URL。 <b>标记表示他们是活跃的玩家,而这正是我想要捕捉的。

2 个答案:

答案 0 :(得分:1)

使用BeautifulSoup进行&#34;选择&#34;工作并深入研究您的数据:

url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")

bolds = soup.find_all('b')
for bold in bolds:
    player_link = bold.find('a')
    if player_link:
        relative_path = player_link['href']
        print('http://www.baseball-reference.com' + relative_path)

现在,如果只想要一个div id=div_players_,您可以添加额外的过滤器:

url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")

div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
    player_link = bold.find('a')
    if player_link:
        relative_path = player_link['href']
        print('http://www.baseball-reference.com' + relative_path)

答案 1 :(得分:0)

这就是我最终做的事情

url = 'http://www.baseball-reference.com/players/d/'
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')

for player_div in soup.find_all('div', {'id':'all_players_'}):
        for player_bold in player_div('b'):
                    for player_href in player_bold('a'):
                        print ('http://www.baseball-reference.com' + player_href['href'])