考虑我有以下字符串,左边和左边有一个标签。正确的文本文件部分:
The dreams of REM (Geo) sleep The sleep paralysis
我想匹配上面的字符串,左边和左边都匹配正确的部分在另一个文件的每一行:
The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
如果无法与填充字符串匹配,则尝试与子字符串匹配。
我想用最左边和最右边的图案搜索。 例如(最左边的情况)
The dreams of REM sleep paralysis
The dreams of REM sleep The sleep
例如(最右边的情况):
REM sleep The sleep paralysis
The dreams of The sleep paralysis
再次感谢您提供任何帮助。
答案 0 :(得分:3)
(好的,你澄清了你想要的大部分内容。让我重申一下,然后澄清下面列出的要点仍然不清楚......还要把我给你看的初学者代码,改编,发给我们结果。)< / p>
您希望逐行搜索,不区分大小写,以获得与一对匹配模式中每个匹配模式最长的连续匹配。所有模式似乎都是不相交(不可能在patternX和patternY上匹配,因为它们使用不同的短语,例如不能匹配'额叶'和'前额皮质')。 / p>
您的模式是作为一系列对提供的('dom','rang'),=&gt;让我们用它们的下标[0]和[1来引用它们,你可以使用string.split('\ t')来获取它。) 重要的是匹配行必须匹配 dom 和 rang 模式(完全或部分)。 订单是独立的,因此我们可以匹配 rang 然后 dom ,反之亦然=&gt;每行使用2个单独的正则表达式,测试d和r匹配。
模式有可选部分,在括号中=&gt;所以只需使用(optionaltext)?
语法将它们编写/转换为正则表达式语法,例如:re.compile('Frontallobes of (leftside)? the brain', re.IGNORECASE)
返回值应该是到目前为止具有最长子字符串匹配的字符串缓冲区。
现在,有几件事情需要澄清 - 请编辑您的问题以解释以下内容:
上述每个问题都会影响解决方案,因此您需要为我们解答。编写代码页来解决最常见的情况是没有意义的,只需要简单的东西。 通常,这称为“NLP”(自然语言处理)。您最终可能会使用NLP库。
到目前为止,代码的一般结构听起来像是:
import re
# normally, read your input directly from file, but this allows us to test:
input = """The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
The optic tract is a part of the visual system in the brain.
The inferior frontal gyrus is a gyrus of the frontal lobe of the human brain.
The prefrontal cortex (PFC) is the anterior part of the frontallobes of the brain, lying in front of the motor and premotor areas.
There are three possible ways to define the prefrontal cortex as the granular frontal cortex as that part of the frontal cortex whose electrical stimulation does not evoke movements.
This allowed the establishment of homologies despite the lack of a granular frontal cortex in nonprimates.
Modern tracing studies have shown that projections of the mediodorsal nucleus of the thalamus are not restricted to the granular frontal cortex in primates.
""".split('\n')
patterns = [
('(dreams of REM (Geo)? sleep)', '(sleep paralysis)'),
('(frontal lobe)', '(inferior frontal gyrus)'),
('(prefrontal cortex)', '(frontallobes of (leftside )?(the )?brain)'),
('(modern tract)', '(probably mediodorsal nucleus)') ]
# Compile the patterns as regexes
patterns = [ (re.compile(dstr),re.compile(rstr)) for (dstr,rstr) in patterns ]
def longest(t):
"""Get the longest from a tuple of strings."""
l = list(t) # tuples can't be sorted (immutable), so convert to list...
l.sort(key=len,reverse=True)
return l[0]
def custommatch(line):
for (d,r) in patterns:
# If got full match to both (d,r), return it immediately...
(dm,rm) = (d.findall(line), r.findall(line))
# Slight design problem: we get tuples like: [('frontallobes of the brain', '', 'the ')]
#... so return the longest match strings for each of dm,rm
if dm and rm: # must match both dom & rang
return [longest(dm), longest(rm)]
# else score any partial matches to (d,r) - how exactly?
# TBD...
else:
# We got here because we only have partial matches (or none)
# TBD: return the 'highest-scoring' partial match
return ('TBD... partial match')
for line in input:
print custommatch(line)
并在您提供的7行输入上运行:
TBD... partial match
TBD... partial match
['frontal lobe', 'inferior frontal gyrus']
['prefrontal cortex', ('frontallobes of the brain', '', 'the ')]
TBD... partial match
TBD... partial match
TBD... partial match
TBD... partial match