Question

想要解析文本并仅返回字母，数字，正斜杠和反斜杠，并用''替换所有其他内容。

是否可以只使用一个正则表达式模式而不是几个然后调用循环？我无法得到下面的模式，不能替换后退和正斜杠。

line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E  xce|pt Forw:ard\" $An>d B,?a..ck Sl'as<he#s\\2"
line2 = line
RGX_PATTERN = "[^\w]", "_"

for pattern in RGX_PATTERN:
    line = re.sub(r"%s" %pattern, '', line)
print("replace1: " + line)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

code below from SO已经过测试，发现比正则表达式快，但它会替换所有特殊字符，包括我想要保留的/和\。有没有办法编辑它以适用于我的用例并仍然保持其优于正则表达式？

line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

作为额外的障碍，正在解析的文本应该是ASCII格式，因此如果可能的话，来自任何其他编码的字符也应该替换为''

Answer 1

速度更快，适用于Unicode：

full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def re_replace(string):
    return re.sub(full_pattern, '', string)

如果你想要真的快，这是迄今为止最好（但有点模糊）的方法：

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    # Remove all non-ASCII characters. Heavily optimised.
    string = string.encode('ascii', errors='ignore').decode('ascii')

    # Remove unwanted ASCII characters
    return string.translate(ascii_code_point_filter)

时序：

SETUP="
busy = ''.join(chr(i) for i in range(512))

import re
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def in_whitelist(character):
    return character.isalnum() or character in '\\/'

def re_replace(string):
    return re.sub(full_pattern, '', string)

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    string = string.encode('ascii', errors='ignore').decode('ascii')
    return string.translate(ascii_code_point_filter)
"

python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"

结果：

10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop

Answer 2

为什么你不能这样做：

def in_whitelist(character):
    return character.isalnum() or character in ['\\','/']

line2 = ''.join(e for e in line2 if in_whitelist(e))

根据建议来编辑缩小功能。

如何替换除字母，数字，正斜杠和反斜杠之外的所有字符

2 个答案: