我正在尝试将这样的东西与beautifulsoup相匹配。
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
在正则表达式中,它看起来像这样。我想抓住网址。
<a href="\.(.*)">
<b>.*</b>
</a>
如何使用BeautifulSoup做这样的事情?我需要在我想要的'a'标签内部使用b标签,因为这是区分这些'a'与页面上任何其他链接的唯一因素。好像我只能写regexps来匹配标签名称或特定属性?
答案 0 :(得分:2)
如果您只想从包含一个href
标记的所有a
代码中获取b
:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """
... <html><head><title>Title</title></head><body>
... <a href="first/index.php"><b>first</b></a>
... <a><b>no-href</b></a>
... <div><a href="second/index.php"><b>second</b></a></div>
... <div><a href="third/index.php"><b>third</b></a></div>
... <a href="foo/index.php">no-bold-tag</a>
... <a href="foo/index.php"><b>text</b><p>other-stuff</p></a>
... </body></html>
... ... """
>>> soup = BeautifulSoup(html)
>>> [a['href'] for a in soup('a', href=True) if a.b and len(a) == 1]
[u'first/index.php', u'second/index.php', u'third/index.php']
答案 1 :(得分:1)
如果你不介意使用lxml
,可以使用XPath表达式完美地完成。
import lxml.html as lh
html = '''
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
<a href="./Some/URL.php"></a>
<a href="./Another/URL.php">
<b>foo</b>
<p>bar</p>
</a>
'''
tree = lh.fromstring(html)
for link in tree.xpath('a[count(b) = 1 and count(*) = 1]'):
print lh.tostring(link)
<强>结果:强>
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
或者如果您想使用与lxml更类似@ ekhumoro的方法,您可以这样做:
[a for a in tree.xpath('a[@href]') if a.find('b') != None and len(a) == 1]