从Python中的给定字符串中删除所有形式的URL

时间:2012-12-29 11:07:13

标签: python regex

我是python的新手,并且想知道是否有更好的解决方案来匹配可能在给定字符串中找到的所有形式的URL。在谷歌搜索,似乎有很多解决方案提取域,用链接等替换它,但没有一个从字符串中删除/删除它们。我在下面提到了一些例子供参考。谢谢!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

错误日志:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

2 个答案:

答案 0 :(得分:7)

您的代码中存在错误(实际上是两个):

1.你应该在倒数第二个单引号前面加一个反斜杠来逃避它:

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

2.您不应该使用str作为变量的名称,因为它是保留关键字,因此将其命名为thestring或其他任何内容

例如:

thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

print URLless_string

结果:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.

答案 1 :(得分:6)

在源文件的顶部包含编码行(正则表达式字符串包含非{ascii符号,如»),例如:

# -*- coding: utf-8 -*-
import re
...

还将三重单(或双)引号中的正则表达式字符串括起来 - '''"""而不是单引号,因为此字符串本身已包含引号符号('和{{1} })。

"