使用这个html:
<td align="left">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2000032">
Russell, Addison
</a>
SS OAK - Won at $0
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425">
Vargas, Jason
</a>
SP LAA
<span title="Angels interested in bringing back Jason Vargas">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425" subtab="Update">
<img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
</a>
</span>
- Dropped
</br>
</td>
我想只显示块,如果它们没有subtab =“Update”。但是我还没弄清楚如何使用BeautifulSoup在Python循环中引用子选项卡。这就是我的尝试:
soup = BeautifulSoup(html)
pl = soup.findAll('a',{'class': 'playerLink'})
for a in pl:
if a.subtab == "Update":
print "UPDATE"
else:
print "Player Name: " + a.text
我也试过引用findAll部分中的子类型:
pl = soup.findAll('a',{'class': 'playerLink'}, {'subtype':0})
这两种方式都不奏效。我的问题是,在所有情况下,类都是'playerLink',所以子类型是我能区分它的唯一方法。我对BS很新,所以我不太擅长处理标签和属性。在第二个例子中,如果我只想要subtype = Update,它可能会工作,但我希望每个标签都不存在子类型。
答案 0 :(得分:2)
a.attrs
将<a>
的属性作为字典返回。您可以使用<a>
检查subtab
代码是否没有'subtab' not in a.attrs
属性:
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
player_links = SoupStrainer('a', 'playerLink')
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
for a in soup.find_all(player_links) if 'subtab' not in a.attrs]
print(names)
# -> ['Russell, Addison', 'Vargas, Jason']
我找不到the documentation中提到的位置,但似乎指定subtab=False
也可以排除任何具有subtab
属性的标记:
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
for a in soup.find_all(player_links)]
print(names)
如果找不到标签(player_links
),则可以省略.find_all(player_links)
来电:
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip() for a in soup]
print(names)
答案 1 :(得分:1)
您可以使用getattr()
函数检查元素是否具有属性:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for a in soup.find_all('a', attrs={'class': 'playerLink'}):
#if getattr(a, 'subtab'): continue
if a.get('subtab'): continue
print(a.get_text("", strip=True))
像以下一样运行:
python3 script.py htmlfile
它产生:
Russell, Addison
Vargas, Jason
答案 2 :(得分:1)
一个简单但不是特别优雅的解决方案就是在每个元素中搜索字符串'subtab':
for a in pl:
if 'subtab' in a.prettify():
print "UPDATE"
else:
print "Player Name: " + a.text
答案 3 :(得分:0)
通过attrs功能搞清楚我发现它有效:
if str(a.attrs).find('subtab') > 0
这可能不是最干净的方法,但它确实有效。
答案 4 :(得分:0)
你可以试试这个:
containers = page_soup.findAll("a", {"class":"playerLink"})
for container in containers:
url = ("<a href='%s'>%s</a>" %(container.get("href"), container.a))