Question

我正在使用python正则表达式从网页中检索所有链接。我的目的是从src，href，contnet和图像背景链接等不同属性中检索链接。以下是我使用的代码。

import re
import urllib2
page=urllib2.urlopen('http://xxxx.html').read();
page2=re.sub(r'(?s)<!--.*?-->', '', page); #this is for restricting links from commented code.
links=re.findall(r'\s*[\"\'\(](http|\/)(.*?)[\"\'\)]',page2)

能够通过使用上面的代码从网页检索所有链接，但问题是有时它没有检索到完整的网址

<a href="http://xxx.yyy.com/help/xxx(yyy)_Help.exe"

预期输出应为＆＃39; http://xxx.yyy.com/help/xxx(yyy)_Help.exe＆＃39; ，但实际输出是＆＃39; http://xxx.yyy.com/help/xxx＆＃39;。

有没有办法限制，如果字符串以single或double开头，或者（它必须检索链接到找到匹配的引号。

Answer 1

雅，这应该是可能的。但它只适用于引号..

>>> s = '<a href="http://xxx.yyy.com/help/xxx(yyy)_Help.exe" '
>>> re.findall(r'\s*(["\'])((?:http|\/).*?)\1[>\s]', s)
[('"', 'http://xxx.yyy.com/help/xxx(yyy)_Help.exe')]
>>> [i[1] for i in re.findall(r'\s*(["\'])((?:http|\/).*?)\1[>\s]', s)]
['http://xxx.yyy.com/help/xxx(yyy)_Help.exe']
>>>

<强>疑问？播放Here..

如果你的正则表达式也用于括号，那么使用，

\s*(["\'(])((?:http|\/).*?)\1[>\s]|\(((?:http|\/).*?)\)[>\s]

DEMO

Answer 2

以下正则表达式将匹配单引号或双引号或（）引号对：

import re
page2 = """
<a href="https://xxx.yyy.com/help/xxx(yyy)_Help.exe">
<a href=(http://xxx.yyy.com/help/xxx(yyy)_Help.exe)>
<a href='http://xxx.yyy.com/help/xxx(yyy)_Help'.exe'>
<a href="http://xxx.yyy.com/help/xxx(yyy)_Help.exe'>
<a href='http://xxx.yyy.com/help/xxx(yyy)_Help.exe)>
... """

pattern = '(?P<q>"|\')(https?:\/\/.*)(?P=q)|\((https?:\/\/.*)\)'
links = [ m[1] if m[1] else m[2] for m in re.findall(pattern, page2)]
print links

输出：

['https://xxx.yyy.com/help/xxx(yyy)_Help.exe',
 'http://xxx.yyy.com/help/xxx(yyy)_Help.exe',
 "http://xxx.yyy.com/help/xxx(yyy)_Help'.exe"]

python正则表达式，用于在匹配的引号之间提取URL，无论是单引号，双引号还是（

2 个答案: