Question

在email_two中，它包含一些段落的全局字符串，其中包括“研究者”和“她自己”。我必须检查email_two列表中proprietary_terms的单词（它在函数中属于term）。但是，当我使用

email_two_new = email_two.split()

for item in email_two_new:
    for i in range(len(term)):
      if item in term[i]:

它从“研究人员”和“她自己”中分离出“她”。 “研究人员”不应受到审查，而“她自己”应完全受到审查。我检查了“研究者”不在“她”中，所以不应该将其切掉，并且item被打印为每个单词的整个字符串，而不是单词的每个字符，所以我不知道出了错。

proprietary_terms = ["she", "personality matrix", "sense of self", "self-preservation", "learning algorithm", "her", "herself"]
def censor_email_two(term):
  result = email_two
  email_two_new = email_two.split()

  for item in email_two_new:
    for i in range(len(term)):
      if item in term[i]:
        result = ''.join(result.split(term[i]))
      else:
        continue
  return result

Answer 1

所以我认为最好使用正则表达式。

proprietary_terms = [
    "she", "personality matrix", "sense of self", 
    "self-preservation", "learning algorithm", "her", "herself"
]

def censor_email_two(email_string, terms, rep_str):
    subbed_str = email_string
    for t in terms: 
        pat = r'\b%s\b' % t 
        subbed_str = re.sub(pat, rep_str, subbed_str)
    #Run a split and join to remove double spaces created by the re.sub
    return ' '.join(subbed_str.split())

estr = "Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!"

censor_email_two(estr, proprietary_terms, '')

结果字符串：

"Not only that, but we have configured to allow for communication between the system and our team of researchers. That's how we know considers to be a ! We asked!"

您可以使用rep_str参数更轻松地查看检查的位置：

censor_email_two(estr, proprietary_terms, "CENSORED")

"Not only that, but we have configured CENSORED CENSORED to allow for communication between the system and our team of researchers. That's how we know CENSORED considers CENSORED to be a CENSORED! We asked!"

编辑：添加了rep_str功能

编辑2：有关正则表达式的进一步说明。

因此r表示raw string。

然后\b正在从文档中寻找单词边界：

匹配空字符串，但仅在单词的开头或结尾。单词定义为单词字符序列。注意正式地，\ b被定义为\ w和\ W之间的边界字符（反之亦然），或者\ w与该字符的开头/结尾之间串。这意味着r'\ bfoo \ b'与'foo'，'foo。'，'（foo）'， 'bar foo baz'，而不是'foobar'或'foo3'。

%s是字符串格式，并被循环中每个术语t所取代。如果您使用的是Python 3.6或更高版本，则可以用combining f string notation with r raw string代替： fr'\b{t}\b'。

我认为从技术上讲，您也可以使用.format()语法，但是由于原始字符串的行为，使用旧的%样式会更容易。

为什么一串字的某些字符不丢失？

1 个答案: