Question

我想从html页面中仅提取相对网址;有人提出这个建议：

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)', re.IGNORECASE)

但它返回：

1 /页面中的所有绝对和相对网址。

2 /网址可能会被""或''随机排除。

Answer 1

使用the tool for the job：HTML parser，例如BeautifulSoup。

您可以pass a function作为find_all()的属性值，并检查href是否以http开头：

from bs4 import BeautifulSoup

data = """
<div>
<a href="http://google.com">test1</a>
<a href="test2">test2</a>
<a href="http://amazon.com">test3</a>
<a href="here/we/go">test4</a>
</div>
"""
soup = BeautifulSoup(data)
print soup.find_all('a', href=lambda x: not x.startswith('http'))

或者，使用urlparse和checking for network location part：

def is_relative(url):
    return not bool(urlparse.urlparse(url).netloc)

print soup.find_all('a', href=is_relative)

两种解决方案都打印出来：

[<a href="test2">test2</a>, 
 <a href="here/we/go">test4</a>]

从html页面获取相关链接

1 个答案: