Question

我似乎无法从以下的html汤中提取href（页面上只有一个<strong>Website:</strong>）：

<div id='id_Website'>
<strong>Website:</strong> 
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>

这是我认为应该工作的

href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]

Answer 1

在这种情况下，

.next是一个NavigableString，其中包含<strong>标记和<a>标记之间的空格。此外，text=属性用于匹配NavigableString s，而不是元素。

我认为以下是您想要的：

import re
from BeautifulSoup import BeautifulSoup

html = '''<div id='id_Website'>
<strong>Website:</strong> 
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''

soup = BeautifulSoup(html)

for t in soup.findAll(text=re.compile(r'Website:')):
    # Find the parent of the NavigableString, and see
    # whether that's a <strong>:
    s = t.parent
    if s.name == 'strong':
        print s.nextSibling.nextSibling['href']

......但这不是很强大。如果封闭的div具有可预测的ID，那么最好找到它，然后在其中找到第一个<a>元素。

美丽的汤 - 如何获得href

1 个答案: