我有一个文本和一个概念列表,如下所示。
concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"
我想确定列表中的concepts
是否在text
中,并用concepts[1:]
替换所有出现的concepts[0]
。因此,以上文字的结果应为;
"levels and data mining of dna data mining methylation"
我的代码如下:
concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"
if any(word in text for word in concepts):
for terms in concepts[1:]:
if terms in text:
text=text.replace(terms,concepts[0])
text=' '.join(text.split())
print(text)
但是,我得到的输出为;
levels and data mining mining of dna data mining source methylation
看起来data
的概念已被data mining
取代,这是不正确的。更具体地说,我希望在替换时首先考虑最长的选项。
即使我更改了concepts
的顺序,它也不起作用。
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
if any(word in text for word in concepts):
for terms in concepts[1:]:
if terms in text:
text=text.replace(terms,concepts[0])
text=' '.join(text.split())
print(text)
对于上面的代码,我得到了以下输出。
levels and data mining mining of dna data mining mining methylation
很高兴在需要时提供更多详细信息。
答案 0 :(得分:3)
这里的问题是您的迭代策略,一次只能替换一次。由于您的替换条款包含您要替换的条款之一,因此您可以对先前迭代中已更改为替换条款的内容进行替换。
解决此问题的一种方法是<原子>原子进行所有这些替换,以便它们全部同时发生,并且输出永远不会影响其他替换的结果。有以下几种策略:
第2个示例是Python sub()
库的re
方法。这是其用法示例:
import re
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
targets = sorted(concepts[1:], key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)
result = re.sub(target_re, concepts[0], text)
请注意,这仍然会导致data mining mining
具有您的原始替换集,因为它没有mining
之后的现有data
的概念。如果您想避免这种情况,则可以简单地将要替换的实际商品也包含在内,以作为替换目标,这样它就可以在短期内得到匹配:
import re
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
#
# !!!No [1:] !!!
#
targets = sorted(concepts, key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)
result = re.sub(target_re, concepts[0], text)
答案 1 :(得分:1)
琥珀色的溶液非常干净。我写了一个长格式的版本,上面有一些注释,这些注释遍历了单词,并期待检查匹配项。它应该可以帮助您解决原始代码丢失的概念(检查多字匹配并避免重复替换) 这不适用于所有“概念”列表,因为它只能处理相同数量的单词或单个单词匹配的替换。
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
textSplit = text.split()
finalText = ""
maxX = len(textSplit)
#add a look ahead for mulitwords
for x in range(0, maxX):
tempSplit = concepts[0].split()
tempMax = len(tempSplit)
foundFullMatch = True
for y in range(0,tempMax):
if (x + tempMax <= maxX):
if (textSplit[x+y] != tempSplit[y]):
foundFullMatch = False
else:
foundFullMatch = False
if (foundFullMatch):
#skip past it in the loop
x = x + tempMax
continue
else:
# now start looking at rest of list - make sure is sorted with most words first
for terms in concepts[1:]:
tempSplit2 = terms.split()
tempMax2 = len(tempSplit2)
foundFullMatch = True
for y in range(0,tempMax2):
if (x + tempMax2 <= maxX):
if (textSplit[x+y] != tempSplit2[y]):
foundFullMatch = False
else:
foundFullMatch = False
if (foundFullMatch):
if (tempMax == tempMax2):
# found match same number words - replace
for y in range(0,tempMax2):
textSplit[x+y] = tempSplit[y]
x = x + tempMax
continue
else:
# found match but not same number of words as concept 0
if (tempMax2 == 1):
#covers 1 word answer
textSplit[x] = concepts[0]
continue
print(" ".join(textSplit))