在email_two
中,它包含一些段落的全局字符串,其中包括“研究者”和“她自己”。我必须检查email_two
列表中proprietary_terms
的单词(它在函数中属于term
)。但是,当我使用
email_two_new = email_two.split()
for item in email_two_new:
for i in range(len(term)):
if item in term[i]:
它从“研究人员”和“她自己”中分离出“她”。 “研究人员”不应受到审查,而“她自己”应完全受到审查。我检查了“研究者”不在“她”中,所以不应该将其切掉,并且item
被打印为每个单词的整个字符串,而不是单词的每个字符,所以我不知道出了错。
proprietary_terms = ["she", "personality matrix", "sense of self", "self-preservation", "learning algorithm", "her", "herself"]
def censor_email_two(term):
result = email_two
email_two_new = email_two.split()
for item in email_two_new:
for i in range(len(term)):
if item in term[i]:
result = ''.join(result.split(term[i]))
else:
continue
return result
答案 0 :(得分:0)
所以我认为最好使用正则表达式。
proprietary_terms = [
"she", "personality matrix", "sense of self",
"self-preservation", "learning algorithm", "her", "herself"
]
def censor_email_two(email_string, terms, rep_str):
subbed_str = email_string
for t in terms:
pat = r'\b%s\b' % t
subbed_str = re.sub(pat, rep_str, subbed_str)
#Run a split and join to remove double spaces created by the re.sub
return ' '.join(subbed_str.split())
estr = "Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!"
censor_email_two(estr, proprietary_terms, '')
结果字符串:
"Not only that, but we have configured to allow for communication between the system and our team of researchers. That's how we know considers to be a ! We asked!"
您可以使用rep_str
参数更轻松地查看检查的位置:
censor_email_two(estr, proprietary_terms, "CENSORED")
"Not only that, but we have configured CENSORED CENSORED to allow for communication between the system and our team of researchers. That's how we know CENSORED considers CENSORED to be a CENSORED! We asked!"
编辑:添加了rep_str
功能
编辑2:有关正则表达式的进一步说明。
因此r
表示raw string。
然后\b
正在从文档中寻找单词边界:
匹配空字符串,但仅在单词的开头或结尾。 单词定义为单词字符序列。注意 正式地,\ b被定义为\ w和\ W之间的边界 字符(反之亦然),或者\ w与该字符的开头/结尾之间 串。这意味着r'\ bfoo \ b'与'foo','foo。','(foo)', 'bar foo baz',而不是'foobar'或'foo3'。
%s
是字符串格式,并被循环中每个术语t
所取代。如果您使用的是Python 3.6或更高版本,则可以用combining f
string notation with r
raw string代替:
fr'\b{t}\b'
。
我认为从技术上讲,您也可以使用.format()
语法,但是由于原始字符串的行为,使用旧的%
样式会更容易。