Question

我正在尝试在python中编写一个正则表达式来查找Markdown文本字符串中的URL。一旦找到了网址，我想检查它是否被降价链接包裹：文字我对后者有问题。我正在使用正则表达式 - link_exp - 进行搜索，但结果并不是我所期望的，并且无法理解它。

这可能是我看不到的简单。

这里是link_exp正则表达式的代码和解释

import re

text = '''
[Vocoder](http://en.wikipedia.org/wiki/Vocoder )
[Turing]( http://en.wikipedia.org/wiki/Alan_Turing)
[Autotune](http://en.wikipedia.org/wiki/Autotune)
http://en.wikipedia.org/wiki/The_Voder
'''

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) #find all urls
for url in urls:
    url = re.escape(url)
    link_exp = re.compile('\[.*\]\(\s*{0}\s*\)'.format(url) ) # expression with url wrapped in link syntax.     
    search = re.search(link_exp, text)
    if search != None:
        print url

# expression should translate to:
# \[ - literal [
# .* - any character or no character 
# \] - literal ]
# \( - literal (
# \s* - whitespaces or no whitespace 
# {0} - the url
# \s* - whitespaces or no whitespace 
# \) - literal )
# NOTE: I am including whitespaces to encompass cases like [foo]( http://www.foo.sexy   )

我得到的输出只是：

http\:\/\/en\.wikipedia\.org\/wiki\/Vocoder

表示表达式仅在右括号之前找到带有空格的链接。这不仅仅是我想要的，而且只考虑一个没有空格的案例链接。

你觉得你可以帮我这个吗？欢呼声

Answer 1

这里的问题是您首先要删除网址的正则表达式，其中包括网址中的)。这意味着您要查找两次右括号。这种情况发生在第一个的所有区域（空间为您节省了空间）。

我不太确定您的网址正则表达式的每个部分正在尝试做什么，但部分内容如下： [$-_@.&+]，包括从$（ASCII 36）到_（ASCII 137）的范围，其中包含您可能并不意味着的大量字符，包括{{1 }}

不是寻找网址，然后检查它们是否在链接中，为什么不一次做两个？这样你的URL正则表达式就会变得更加懒惰，因为额外的约束使它不太可能成为其他任何东西：

<强>结果：

# Anything that isn't a square closing bracket
name_regex = "[^]]+"
# http:// or https:// followed by anything but a closing paren
url_regex = "http[s]?://[^)]+"

markup_regex = '\[({0})]\(\s*({1})\s*\)'.format(name_regex, url_regex)

for match in re.findall(markup_regex, text):
    print match

如果您需要更严格，可以改进URL正则表达式。

python正则表达式无法识别降价链接

1 个答案: