我正在尝试在字符串中找到一个单字或n词短语,然后将其替换为星号。面临的挑战是,即使某个单词或n个单词的短语被某些字符所混淆,我也想这样做。
假设以下内容。 REPLACE_CHAR
是我要用来替换单词或n单词短语的字符。 ILLEGAL_CHAR
是我要忽略的字符。我也想忽略这种情况。
REPLACE_CHAR = "*"
ILLEGAL_CHAR = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
在这里,我想用星号代替“美元”。在字符串中,您可以看到存在“美元”,但是它被随机符号和大写字母所混淆。
string = "Lorem ipsum %@do^l&oR sit amet"
find = "dolor"
提示结果为"Lorem ipsum ***** sit amet"
,其中星号数量与找到的单词的长度匹配。
在这里,我要在保留空格的同时用星号替换“ dolour sit”。在字符串中,您可以看到存在“美元坐在”,但是它被随机符号和大写字母所混淆。
string = "Lorem ipsum %@do^l&oR s%)i!T~ amet"
find = "dolor sit"
提示结果为"Lorem ipsum ***** *** amet"
,其中星号数量与找到的单词的长度匹配。
此解决方案基于@ Ajax1234响应。
我们使用re.sub
并在函数外部构建表,而不是使用ILLEGAL_CHAR
来删除translate
。这会稍微提高性能。
import re
REPLACE_CHAR = "*"
ILLEGAL_CHAR = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
trans = str.maketrans("", "", ILLEGAL_CHAR)
text = "Lorem ipsum %@do^l&oR sit amet"
token = "dolor sit"
def replace(data, token):
data = data.translate(trans)
return re.sub(token, lambda x:' '.join('*'*len(i) for i in x.group().split(' ')), data, flags=re.I)
print(replace(text, token))
答案 0 :(得分:2)
import re
ignore_chars = "!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~"
string = "Lorem ipsum %@do^l&oR s%)i!T~ amet"
clean_string = "".join(char for char in string if char not in ignore_chars)
bad_words = ["dolor", "sit"]
for bad_word in bad_words:
pattern = f"\\b{bad_word}\\b"
replace = "*" * len(bad_word)
clean_string = re.sub(pattern, replace, clean_string, flags=re.IGNORECASE)
print(clean_string)
输出:
Lorem ipsum ***** *** amet
答案 1 :(得分:2)
您可以使用re.sub
删除非法字符,然后再用re.sub
加上另一个re.I
:
import re
def replace(word, target):
w = re.sub('[\!"#\$%\&\'\(\)\*\+,\-\./:;\<\=\>\?@\[\]\^_`\{\|\}~]+', '', word)
return re.sub(target, lambda x:' '.join('*'*len(i) for i in x.group().split(' ')), w, flags=re.I)
string = "Lorem ipsum %@do^l&oR sit amet"
find = "dolor"
r = replace(string, find)
输出:
'Lorem ipsum ***** sit amet'
string = "Lorem ipsum %@do^l&oR s%)i!T~ amet"
find = "dolor sit"
r = replace(string, find)
输出:
'Lorem ipsum ***** *** amet'
答案 2 :(得分:1)
您可以随意使用re.sub
来完成单词的混淆和重新混淆处理!这里已经有很多好的答案;该脚本的设计易于编辑,尤其是当您计划从用户或其他外部来源获得输入时。
#we'll be using regex to solve this problem
import re
#establish some constants - these can be changed later, or even read as user input
REPLACE_CHAR = "*"
ILLEGAL_CHAR = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
#your search string - this can be read as user input
search = "Lorem ipsum %@do^l&oR sit amet"
#this regex will remove the illegal characters - specifically, it substitutes an empty
#character ('') in place of any illegal character we find.
#note that since the brackets are included here, the user can directly input illegal
#symbols themselves without worrying about formatting
strip = re.sub('['+ILLEGAL_CHAR+']', '', search)
#the string to obfuscate - this can also be read as user input
find = "ipsum dolor sit"
#this splits the words on spaces, so there's spaces between tee asterisks
find_words = find.split(' ')
#now we'll check each find_word - we'll look for it in the string, and if we find it,
#we'll replace it with asterisks of the same length as the original word.
#(we'll use a ranged for loop to go over the words)
for f_word in find_words:
#check each f_word to see if it appears in the string. note "flags=re.I" - this
#tells our regex to use case-insensitive matching
if(re.search(f_word, strip, flags=re.I)):
#we found a word! check the length of the word, then substitute an equal number of
#REPLACE_CHARs
strip = re.sub(f_word, (REPLACE_CHAR * len(f_word)), strip, flags=re.I)
#ta-daa!
print(strip)