将HTML子树与BeautifulSoup匹配

时间:2011-11-17 23:57:17

标签: python regex beautifulsoup

我正在尝试将这样的东西与beautifulsoup相匹配。

<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>

在正则表达式中,它看起来像这样。我想抓住网址。

<a href="\.(.*)">
<b>.*</b>
</a>

如何使用BeautifulSoup做这样的事情?我需要在我想要的'a'标签内部使用b标签,因为这是区分这些'a'与页面上任何其他链接的唯一因素。好像我只能写regexps来匹配标签名称或特定属性?

2 个答案:

答案 0 :(得分:2)

如果您只想从包含一个href标记的所有a代码中获取b

>>> from BeautifulSoup import BeautifulSoup
>>> html = """
... <html><head><title>Title</title></head><body>
... <a href="first/index.php"><b>first</b></a>
... <a><b>no-href</b></a>
... <div><a href="second/index.php"><b>second</b></a></div>
... <div><a href="third/index.php"><b>third</b></a></div>
... <a href="foo/index.php">no-bold-tag</a>
... <a href="foo/index.php"><b>text</b><p>other-stuff</p></a>
... </body></html>
... ... """
>>> soup = BeautifulSoup(html)
>>> [a['href'] for a in soup('a', href=True) if a.b and len(a) == 1]
[u'first/index.php', u'second/index.php', u'third/index.php']

答案 1 :(得分:1)

如果你不介意使用lxml,可以使用XPath表达式完美地完成。

import lxml.html as lh

html = '''
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
    <b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>

<a href="./Some/URL.php"></a>

<a href="./Another/URL.php">
    <b>foo</b>
    <p>bar</p>
</a>
'''

tree = lh.fromstring(html)

for link in tree.xpath('a[count(b) = 1 and count(*) = 1]'):
    print lh.tostring(link)

<强>结果:

<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
    <b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>

或者如果您想使用与lxml更类似@ ekhumoro的方法,您可以这样做:

[a for a in tree.xpath('a[@href]') if a.find('b') != None and len(a) == 1]