Question

我尝试过滤字符串中的某些特殊字符，并使用以下代码和正则表达式条件。

我希望它能过滤除-，+和#以外的所有内容，但还有一些字符未被过滤。

text = "This is a long string~!@#$%^&*()_+|\=-{}[];':<>?with special characters"

print sub(r'[^a-zA-Z0-9 -+#]+', '', text)

正在显示的结果是：

This is a long string!#$%&*()+'with special characters

我打算打印出来的是：

This is a long string with #+- special characters

任何人都可以解释为什么会发生这种情况以及我如何纠正我的正则表达式或代码以过滤掉剩余的字符？

Answer 1

您不得在字符类中使用未转义的连字符，请使用：

print re.sub(r'[^a-zA-Z0-9 +#-]+', '', text)

Answer 2

你也可以这样做。

>>> re.sub(r'(?![#+ -])[_\W]', '', text)
'This is a long string#+-with special characters'

Answer 3

您可以使用str.translate：

text = "This is a long string~!@#$%^&*()_+|\=-{}[];':<>?"

# string.punctuation - +-#
rem ="""!"$%&'()*,./:;<=>?@[\]^_`{|}~"""

print(text.translate(None,rem)))
This is a long string#+-

当你只想删除字符时，

translate效率更高：

In [28]: r  = re.compile(r'[^a-zA-Z0-9 +#-]+')    
In [29]: timeit text.translate(None,rem)
1000000 loops, best of 3: 408 ns per loop   

In [30]: timeit r.sub("", text)
100000 loops, best of 3: 2.66 µs per loop  

In [31]: r.sub("", text)
Out[31]: 'This is a long string#+-'    
In [32]: text.translate(None,rem)
Out[32]: 'This is a long string#+-'

特殊字符不会被正则表达式替换

3 个答案: