如何检查python中最长的子字符串

时间:2019-01-22 00:04:27

标签: python

我有一个文本和一个概念列表,如下所示。

concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"

我想确定列表中的concepts是否在text中,并用concepts[1:]替换所有出现的concepts[0]。因此,以上文字的结果应为;

"levels and data mining of dna data mining methylation"

我的代码如下:

concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"

if any(word in text for word in concepts):
    for terms in concepts[1:]:
        if terms in text:
            text=text.replace(terms,concepts[0])
        text=' '.join(text.split())
    print(text)

但是,我得到的输出为;

levels and data mining mining of dna data mining source methylation

看起来data的概念已被data mining取代,这是不正确的。更具体地说,我希望在替换时首先考虑最长的选项。

即使我更改了concepts的顺序,它也不起作用。

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

if any(word in text for word in concepts):
    for terms in concepts[1:]:
        if terms in text:
            text=text.replace(terms,concepts[0])
        text=' '.join(text.split())
    print(text)

对于上面的代码,我得到了以下输出。

levels and data mining mining of dna data mining mining methylation

很高兴在需要时提供更多详细信息。

2 个答案:

答案 0 :(得分:3)

这里的问题是您的迭代策略,一次只能替换一次。由于您的替换条款包含您要替换的条款之一,因此您可以对先前迭代中已更改为替换条款的内容进行替换。

解决此问题的一种方法是<原子>原子进行所有这些替换,以便它们全部同时发生,并且输出永远不会影响其他替换的结果。有以下几种策略:

  1. 您可以将字符串分解为与您的各种术语相匹配的标记,然后在事实发生后替换它们(并确保没有任何重叠)。
  2. 您可以使用对多个选项进行原子替换的函数。

第2个示例是Python sub()库的re方法。这是其用法示例:

import re

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
targets = sorted(concepts[1:], key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)

result = re.sub(target_re, concepts[0], text)

请注意,这仍然会导致data mining mining具有您的原始替换集,因为它没有mining之后的现有data的概念。如果您想避免这种情况,则可以简单地将要替换的实际商品也包含在内,以作为替换目标,这样它就可以在短期内得到匹配:

import re

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
#
# !!!No [1:] !!!
#
targets = sorted(concepts, key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)

result = re.sub(target_re, concepts[0], text)

答案 1 :(得分:1)

琥珀色的溶液非常干净。我写了一个长格式的版本,上面有一些注释,这些注释遍历了单词,并期待检查匹配项。它应该可以帮助您解决原始代码丢失的概念(检查多字匹配并避免重复替换) 这不适用于所有“概念”列表,因为它只能处理相同数量的单词或单个单词匹配的替换。

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
textSplit = text.split()
finalText = ""
maxX = len(textSplit)
#add a look ahead for mulitwords
for x in range(0, maxX):
    tempSplit = concepts[0].split()
    tempMax = len(tempSplit)
    foundFullMatch = True
    for y in range(0,tempMax):
        if (x + tempMax <= maxX):
            if (textSplit[x+y] != tempSplit[y]):
                foundFullMatch = False
        else:
            foundFullMatch = False
    if (foundFullMatch):
        #skip past it in the loop
        x = x + tempMax
        continue
    else:
        # now start looking at rest of list - make sure is sorted with most words first
        for terms in concepts[1:]:
            tempSplit2 = terms.split()
            tempMax2 = len(tempSplit2)
            foundFullMatch = True
            for y in range(0,tempMax2):
                if (x + tempMax2 <= maxX):
                    if (textSplit[x+y] != tempSplit2[y]):
                        foundFullMatch = False
                else:
                    foundFullMatch = False
            if (foundFullMatch):
                if (tempMax == tempMax2):
                    # found match same number words - replace
                    for y in range(0,tempMax2):
                        textSplit[x+y] = tempSplit[y]
                    x = x + tempMax
                    continue
                else:
                    # found match but not same number of words as concept 0
                    if (tempMax2 == 1):
                        #covers 1 word answer
                        textSplit[x] = concepts[0]
                        continue
print(" ".join(textSplit))