Question

我正在尝试使用Python中的regex从网页中提取超链接。

假设我的文字字符串是：

text = '<a href="/status/ALL">ALL</a></td>/n<a href="/status/ASSIGN">ASSIGN</a></td>'

我要提取ALL和ASSIGN，我正在使用这个正则表达式：

re.findall(r'<a href=.*>(\w+)</a>', text, re.DOTALL)

这只是返回ASSIGN。

有人可以帮我指出正则表达式中的错误吗？我是这个话题的新手。

Answer 1

您正在使用正则表达式，并且将XML与此类表达式匹配得到too complicated, too fast。

请不要让自己变得困难并使用HTML解析器，Python有几个可供选择：

ElementTree示例：

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('a'):
    print ElementTree.tostring(elem)