想要解析文本并仅返回字母,数字,正斜杠和反斜杠,并用''
替换所有其他内容。
是否可以只使用一个正则表达式模式而不是几个然后调用循环?我无法得到下面的模式,不能替换后退和正斜杠。
line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E xce|pt Forw:ard\" $An>d B,?a..ck Sl'as<he#s\\2"
line2 = line
RGX_PATTERN = "[^\w]", "_"
for pattern in RGX_PATTERN:
line = re.sub(r"%s" %pattern, '', line)
print("replace1: " + line)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2
code below from SO已经过测试,发现比正则表达式快,但它会替换所有特殊字符,包括我想要保留的/和\。有没有办法编辑它以适用于我的用例并仍然保持其优于正则表达式?
line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2
作为额外的障碍,正在解析的文本应该是ASCII格式,因此如果可能的话,来自任何其他编码的字符也应该替换为''
答案 0 :(得分:8)
速度更快,适用于Unicode:
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')
def re_replace(string):
return re.sub(full_pattern, '', string)
如果你想要真的快,这是迄今为止最好(但有点模糊)的方法:
def wanted(character):
return character.isalnum() or character in '\\/'
ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]
def fast_replace(string):
# Remove all non-ASCII characters. Heavily optimised.
string = string.encode('ascii', errors='ignore').decode('ascii')
# Remove unwanted ASCII characters
return string.translate(ascii_code_point_filter)
时序:
SETUP="
busy = ''.join(chr(i) for i in range(512))
import re
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')
def in_whitelist(character):
return character.isalnum() or character in '\\/'
def re_replace(string):
return re.sub(full_pattern, '', string)
def wanted(character):
return character.isalnum() or character in '\\/'
ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]
def fast_replace(string):
string = string.encode('ascii', errors='ignore').decode('ascii')
return string.translate(ascii_code_point_filter)
"
python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"
结果:
10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop
答案 1 :(得分:3)
为什么你不能这样做:
def in_whitelist(character):
return character.isalnum() or character in ['\\','/']
line2 = ''.join(e for e in line2 if in_whitelist(e))
根据建议来编辑缩小功能。