Question

如何用“？”提取字符串（即带参数的链接）在里面？当我尝试使用时：

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
html = """
<script type='text/javascript' src='http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3'></script>
<a href="http://www.somesite.com/hard-circuit-editor-double-layout-design-now/">
"""
print re.findall( r'(href=|src=)"([^"]*)"', html, re.U)
print re.findall( r'(href=|src=)"(.*?)"', html, re.U)

字符串只是被忽略了。将第三组中的?ver=1.3分开是非常好的。有什么帮助吗？

Answer 1

属性值不仅被"包围，还被'包围。

需要修改正则表达式：

print re.findall( r'''(href=|src=)["']([^"']*)["']''', html, re.U)

使用["']匹配"或'。

<强>更新

要获得ver=1.3部分，您最好使用urlparse.urlparse（在Python 3.x中，urllib.parse.urlparse）。

>>> import re
>>> import urlparse
>>>
>>> html = """
... <script type='text/javascript' src='http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3'></script>
... <a href="http://www.somesite.com/hard-circuit-editor-double-layout-design-now/">
... """
>>> for attrname, value in re.findall(r'''(href=|src=)["']([^"']*)["']''', html, re.U):
...     print value, urlparse.urlparse(value).query
...
http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3 ver=1.3
http://www.somesite.com/hard-circuit-editor-double-layout-design-now/

Answer 2

它与角色?无关（我不确定你为什么会这样做）。

您不使用字符"来分隔网址，而是使用字符'。只需将字符串更改为：

html = """
<script type='text/javascript' src="http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3"></script>
<a href="http://www.somesite.com/hard-circuit-editor-double-layout-design-now/">
"""

它会产生正确的结果：

>>> print(re.findall( r'(href=|src=)"([^"]*)"', html, re.U))
[('src=', 'http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3'), ('href=', 'http://www.somesite.com/hard-circuit-editor-double-layout-design-now/')]

用于在Python中使用询问（？）标记提取字符串的正则表达式

2 个答案: