Question

我有一个文本和一个概念列表，如下所示。

concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"

我想确定列表中的concepts是否在text中，并用concepts[1:]替换所有出现的concepts[0]。因此，以上文字的结果应为；

"levels and data mining of dna data mining methylation"

我的代码如下：

concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"

if any(word in text for word in concepts):
    for terms in concepts[1:]:
        if terms in text:
            text=text.replace(terms,concepts[0])
        text=' '.join(text.split())
    print(text)

但是，我得到的输出为；

levels and data mining mining of dna data mining source methylation

看起来data的概念已被data mining取代，这是不正确的。更具体地说，我希望在替换时首先考虑最长的选项。

即使我更改了concepts的顺序，它也不起作用。

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

if any(word in text for word in concepts):
    for terms in concepts[1:]:
        if terms in text:
            text=text.replace(terms,concepts[0])
        text=' '.join(text.split())
    print(text)

对于上面的代码，我得到了以下输出。

levels and data mining mining of dna data mining mining methylation

很高兴在需要时提供更多详细信息。

Answer 1

这里的问题是您的迭代策略，一次只能替换一次。由于您的替换条款包含您要替换的条款之一，因此您可以对先前迭代中已更改为替换条款的内容进行替换。

解决此问题的一种方法是<原子>原子进行所有这些替换，以便它们全部同时发生，并且输出永远不会影响其他替换的结果。有以下几种策略：

您可以将字符串分解为与您的各种术语相匹配的标记，然后在事实发生后替换它们（并确保没有任何重叠）。
您可以使用对多个选项进行原子替换的函数。

第2个示例是Python sub()库的re方法。这是其用法示例：

import re

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
targets = sorted(concepts[1:], key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)

result = re.sub(target_re, concepts[0], text)

请注意，这仍然会导致data mining mining具有您的原始替换集，因为它没有mining之后的现有data的概念。如果您想避免这种情况，则可以简单地将要替换的实际商品也包含在内，以作为替换目标，这样它就可以在短期内得到匹配：

import re

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
#
# !!!No [1:] !!!
#
targets = sorted(concepts, key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)

result = re.sub(target_re, concepts[0], text)

Answer 2

琥珀色的溶液非常干净。我写了一个长格式的版本，上面有一些注释，这些注释遍历了单词，并期待检查匹配项。它应该可以帮助您解决原始代码丢失的概念（检查多字匹配并避免重复替换）这不适用于所有“概念”列表，因为它只能处理相同数量的单词或单个单词匹配的替换。

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
textSplit = text.split()
finalText = ""
maxX = len(textSplit)
#add a look ahead for mulitwords
for x in range(0, maxX):
    tempSplit = concepts[0].split()
    tempMax = len(tempSplit)
    foundFullMatch = True
    for y in range(0,tempMax):
        if (x + tempMax <= maxX):
            if (textSplit[x+y] != tempSplit[y]):
                foundFullMatch = False
        else:
            foundFullMatch = False
    if (foundFullMatch):
        #skip past it in the loop
        x = x + tempMax
        continue
    else:
        # now start looking at rest of list - make sure is sorted with most words first
        for terms in concepts[1:]:
            tempSplit2 = terms.split()
            tempMax2 = len(tempSplit2)
            foundFullMatch = True
            for y in range(0,tempMax2):
                if (x + tempMax2 <= maxX):
                    if (textSplit[x+y] != tempSplit2[y]):
                        foundFullMatch = False
                else:
                    foundFullMatch = False
            if (foundFullMatch):
                if (tempMax == tempMax2):
                    # found match same number words - replace
                    for y in range(0,tempMax2):
                        textSplit[x+y] = tempSplit[y]
                    x = x + tempMax
                    continue
                else:
                    # found match but not same number of words as concept 0
                    if (tempMax2 == 1):
                        #covers 1 word answer
                        textSplit[x] = concepts[0]
                        continue
print(" ".join(textSplit))

如何检查python中最长的子字符串

2 个答案: