Question

可能重复：
What is the best regular expression to check if a string is a valid URL?

考虑如下字符串：

string = "<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>"

我怎么能用Python在锚标记的href中提取网址？类似的东西：

>>> url = getURLs(string)
>>> url
['http://example.com', 'http://example2.com']

谢谢！

Answer 1

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://example2.com']

Answer 2

最好的答案是......

不要使用正则表达式

accepted answer中的表达式遗漏了很多案例。除此之外， URL中可以包含unicode字符。你想要的正则表达式是here，看了之后，你可能会得出结论，毕竟你真的不想要它。最正确的版本是万字符。

不可否认，如果您从简单的非结构化文本开始，其中包含一堆URL，那么您可能需要一万个字符长的正则表达式。但如果您的输入是结构化的，请使用结构。您声明的目标是“在锚标记的href中提取网址”。当你可以做一些更简单的事情时，为什么要使用一个长达一万字符的正则表达式？

解析HTML而不是

对于许多任务，使用Beautiful Soup将更快更容易使用：

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']

如果您不想使用外部工具，也可以直接使用Python自带的内置HTML解析库。这是HTMLParser的一个非常简单的子类，它完全符合您的要求：

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

测试：

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://example2.com']

您甚至可以创建一个接受字符串的新方法，调用feed并返回output_list。这是一种比正则表达式更强大，更可扩展的方法，可以从html中提取信息。

正则表达式使用Python从HTML中的href属性中提取URL

2 个答案:

不要使用正则表达式

解析HTML而不是