匹配Python中最长的子字符串

时间:2011-07-04 21:23:40

标签: python substring

考虑我有以下字符串,左边和左边有一个标签。正确的文本文件部分:

The dreams of REM (Geo) sleep         The sleep paralysis

我想匹配上面的字符串,左边和左边都匹配正确的部分在另一个文件的每一行:

The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep. 

如果无法与填充字符串匹配,则尝试与子字符串匹配。

我想用最左边和最右边的图案搜索。 例如(最左边的情况)

The dreams of REM  sleep     paralysis
The dreams of REM  sleep     The sleep

例如(最右边的情况):

REM  sleep    The sleep paralysis
The dreams of   The sleep paralysis

再次感谢您提供任何帮助。

1 个答案:

答案 0 :(得分:3)

(好的,你澄清了你想要的大部分内容。让我重申一下,然后澄清下面列出的要点仍然不清楚......还要把我给你看的初学者代码,改编,发给我们结果。)< / p>

您希望逐行搜索,不区分大小写,以获得与一对匹配模式中每个匹配模式最长的连续匹配。所有模式似乎都是不相交(不可能在patternX和patternY上匹配,因为它们使用不同的短语,例如不能匹配'额叶'和'前额皮质')。 / p>

您的模式是作为一系列对提供的('dom','rang'),=&gt;让我们用它们的下标[0]和[1来引用它们,你可以使用string.split('\ t')来获取它。) 重要的是匹配行必须匹配 dom rang 模式(完全或部分)。 订单是独立的,因此我们可以匹配 rang 然后 dom ,反之亦然=&gt;每行使用2个单独的正则表达式,测试d和r匹配。

模式有可选部分,在括号中=&gt;所以只需使用(optionaltext)?语法将它们编写/转换为正则表达式语法,例如:re.compile('Frontallobes of (leftside)? the brain', re.IGNORECASE)

返回值应该是到目前为止具有最长子字符串匹配的字符串缓冲区。

现在,有几件事情需要澄清 - 请编辑您的问题以解释以下内容:

  • 如果您发现任何一对模式完全匹配,请返回。
  • 如果找不到任何完整匹配项,请搜索这两种模式的部分匹配。 “部分匹配”在某种模式中意味着“最多的单词”或“最高比例(%)的单词”?据推测,我们将虚假匹配排除在像'the'之类的单词上,在这种情况下,我们只是从搜索模式中省略'the'就不会失去任何东西,这就保证了对任何模式的所有部分匹配都很重要。
  • 我们得分部分匹配(不知何故),例如'包含来自模式X'的大多数单词,或'包含来自模式X的单词的最高百分比'。我们应该为所有模式执行此操作,然后返回具有最高分数的模式。您需要稍微考虑一下,是否更好地匹配5字模式的2个单词(40%),例如'梦想',或1 of 2(50%),例如'prefrontal BUT NOT cortex'?我们如何打破关系等?如果我们匹配'睡眠'会发生什么呢?

上述每个问题都会影响解决方案,因此您需要为我们解答。编写代码页来解决最常见的情况是没有意义的,只需要简单的东西。 通常,这称为“NLP”(自然语言处理)。您最终可能会使用NLP库。

到目前为止,代码的一般结构听起来像是:

import re

# normally, read your input directly from file, but this allows us to test:
input = """The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
The optic tract is a part of the visual system in the brain.
The inferior frontal gyrus is a gyrus of the frontal lobe of the human brain.
The prefrontal cortex (PFC) is the anterior part of the frontallobes of the brain, lying in front of the motor and premotor areas.
There are three possible ways to define the prefrontal cortex as the granular frontal cortex as that part of the frontal cortex whose electrical stimulation does not evoke movements.
This allowed the establishment of homologies despite the lack of a granular frontal cortex in nonprimates.
Modern  tracing studies have shown that projections of the mediodorsal nucleus of the thalamus are not restricted to the granular frontal cortex in primates.
""".split('\n')

patterns = [
    ('(dreams of REM (Geo)? sleep)', '(sleep paralysis)'),
    ('(frontal lobe)',            '(inferior frontal gyrus)'),
    ('(prefrontal cortex)',       '(frontallobes of (leftside )?(the )?brain)'),
    ('(modern tract)',            '(probably mediodorsal nucleus)') ]

# Compile the patterns as regexes
patterns = [ (re.compile(dstr),re.compile(rstr)) for (dstr,rstr) in patterns ]

def longest(t):
    """Get the longest from a tuple of strings."""
    l = list(t) # tuples can't be sorted (immutable), so convert to list...
    l.sort(key=len,reverse=True)
    return l[0]

def custommatch(line):
    for (d,r) in patterns:
        # If got full match to both (d,r), return it immediately...
        (dm,rm) = (d.findall(line), r.findall(line))
        # Slight design problem: we get tuples like: [('frontallobes of the brain', '', 'the ')]
        #... so return the longest match strings for each of dm,rm
        if dm and rm: # must match both dom & rang
            return [longest(dm), longest(rm)]
        # else score any partial matches to (d,r) - how exactly?
        # TBD...
    else:
        # We got here because we only have partial matches (or none)
        # TBD: return the 'highest-scoring' partial match
        return ('TBD... partial match')

for line in input:
    print custommatch(line)

并在您提供的7行输入上运行:

TBD... partial match
TBD... partial match
['frontal lobe', 'inferior frontal gyrus']
['prefrontal cortex', ('frontallobes of the brain', '', 'the ')]
TBD... partial match
TBD... partial match
TBD... partial match
TBD... partial match