如何替换除字母,数字,正斜杠和反斜杠之外的所有字符

时间:2014-05-08 03:01:51

标签: python regex

想要解析文本并仅返回字母,数字,正斜杠和反斜杠,并用''替换所有其他内容。

是否可以只使用一个正则表达式模式而不是几个然后调用循环?我无法得到下面的模式,不能替换后退和正斜杠。

line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E  xce|pt Forw:ard\" $An>d B,?a..ck Sl'as<he#s\\2"
line2 = line
RGX_PATTERN = "[^\w]", "_"

for pattern in RGX_PATTERN:
    line = re.sub(r"%s" %pattern, '', line)
print("replace1: " + line)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

code below from SO已经过测试,发现比正则表达式快,但它会替换所有特殊字符,包括我想要保留的/和\。有没有办法编辑它以适用于我的用例并仍然保持其优于正则表达式?

line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

作为额外的障碍,正在解析的文本应该是ASCII格式,因此如果可能的话,来自任何其他编码的字符也应该替换为''

2 个答案:

答案 0 :(得分:8)

速度更快,适用于Unicode:

full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def re_replace(string):
    return re.sub(full_pattern, '', string)

如果你想要真的快,这是迄今为止最好(但有点模糊)的方法:

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    # Remove all non-ASCII characters. Heavily optimised.
    string = string.encode('ascii', errors='ignore').decode('ascii')

    # Remove unwanted ASCII characters
    return string.translate(ascii_code_point_filter)

时序:

SETUP="
busy = ''.join(chr(i) for i in range(512))

import re
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def in_whitelist(character):
    return character.isalnum() or character in '\\/'

def re_replace(string):
    return re.sub(full_pattern, '', string)

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    string = string.encode('ascii', errors='ignore').decode('ascii')
    return string.translate(ascii_code_point_filter)
"

python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"

结果:

10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop

答案 1 :(得分:3)

为什么你不能这样做:

def in_whitelist(character):
    return character.isalnum() or character in ['\\','/']

line2 = ''.join(e for e in line2 if in_whitelist(e))

根据建议来编辑缩小功能。