使用Python和BeautifulSoup基于属性解析'a'标签

时间:2013-12-06 20:04:50

标签: python html python-2.7 web-scraping beautifulsoup

使用这个html:

    <td align="left">
     <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2000032">
      Russell, Addison
     </a>
     SS OAK  - Won at $0
     <br>
      <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425">
       Vargas, Jason
      </a>
      SP LAA
      <span title="Angels interested in bringing back Jason Vargas">
       <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425" subtab="Update">
        <img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
       </a>
      </span>
      - Dropped
     </br>
    </td>

我想只显示块,如果它们没有subtab =“Update”。但是我还没弄清楚如何使用BeautifulSoup在Python循环中引用子选项卡。这就是我的尝试:

        soup = BeautifulSoup(html)
        pl = soup.findAll('a',{'class': 'playerLink'})
        for a in pl:
            if a.subtab == "Update":
                print "UPDATE"
            else:
                print "Player Name: " + a.text

我也试过引用findAll部分中的子类型:

        pl = soup.findAll('a',{'class': 'playerLink'}, {'subtype':0})

这两种方式都不奏效。我的问题是,在所有情况下,类都是'playerLink',所以子类型是我能区分它的唯一方法。我对BS很新,所以我不太擅长处理标签和属性。在第二个例子中,如果我只想要subtype = Update,它可能会工作,但我希望每个标签都不存在子类型。

5 个答案:

答案 0 :(得分:2)

a.attrs<a>的属性作为字典返回。您可以使用<a>检查subtab代码是否没有'subtab' not in a.attrs属性:

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink')
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
         for a in soup.find_all(player_links) if 'subtab' not in a.attrs]
print(names)
# -> ['Russell, Addison', 'Vargas, Jason']

我找不到the documentation中提到的位置,但似乎指定subtab=False也可以排除任何具有subtab属性的标记:

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
         for a in soup.find_all(player_links)]
print(names)

如果找不到标签(player_links),则可以省略.find_all(player_links)来电:

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip() for a in soup]
print(names)

答案 1 :(得分:1)

您可以使用getattr()函数检查元素是否具有属性:

from bs4 import BeautifulSoup
import sys

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

for a in soup.find_all('a', attrs={'class': 'playerLink'}):
    #if getattr(a, 'subtab'): continue
    if a.get('subtab'): continue
    print(a.get_text("", strip=True))

像以下一样运行:

python3 script.py htmlfile

它产生:

Russell, Addison
Vargas, Jason

答案 2 :(得分:1)

一个简单但不是特别优雅的解决方案就是在每个元素中搜索字符串'subtab':

for a in pl:
    if 'subtab' in a.prettify():
        print "UPDATE"
    else:
        print "Player Name: " + a.text

答案 3 :(得分:0)

通过attrs功能搞清楚我发现它有效:

if str(a.attrs).find('subtab') > 0

这可能不是最干净的方法,但它确实有效。

答案 4 :(得分:0)

你可以试试这个:

containers = page_soup.findAll("a", {"class":"playerLink"})
for container in containers:
      url = ("<a href='%s'>%s</a>" %(container.get("href"), container.a))