如果存在或不存在python正则表达式匹配行

时间:2011-02-17 17:52:09

标签: python regex line

我的正则表达式有点问题。

以下是要解析的文本示例:

output = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zzzzzzz
continent : Asia
planet : Earth
-------
country : Izbud
zzzzzzz
continent : Gladiora
zzzzzzz
zzzzzzz
planet : Mars
"""

我想解析这个问题,然后回到这个国家,大陆,最终回归这个星球。

所以我做了一个正则表达式:

results = re.findall(
    r"""(?mx)
        ^country\s:\s*(.+)\s
        (?:^.+\s)*?
        ^continent\s:\s*(.+)\s
        (?:^.+\s)*?
        (?:^planet\s:\s*(.+)\s)*?
""",output)

但回报是:

[('USA', 'Americ', ''), ('China', 'Asia', ''), ('Izbud', 'Gladiora', '')]

而且我不知道我的正则表达式错在哪里?

如果有人有想法, 感谢。

6 个答案:

答案 0 :(得分:1)

我将建议我会做什么,这将是试图避免使用这种复杂的正则表达式的东西。可能类似于:

while true:
    line = readline()
    if line == "----------":
        # Do cleanup stuff
        continue
    elif 'country' in line.split():
        country = line.split()[2]
    elif 'continent' in line.split():
        continent = line.split()[2]
    # etc...
    # update your list or dict or w/e
    line = readline()

答案 1 :(得分:1)

我发现了一种似乎有用的模式:

r"""(?mx)
    ^country\s:\s*(.+)\s
    (?:^.+\s)*?
    ^continent\s:\s*(.+)\s
    (?:^.+\s)*?
    (?:^(?:planet\s:\s*(.+)\s|-+\s|\Z))
"""

基本上,我改变了最后一部分,以便它必须匹配下列之一:行星的东西,一堆 - 或字符串的结尾。这有点难看,但这是我能找到的唯一方法,以确保它得到了星球的东西。我的解决方案的一个问题是字符串末尾必须有一个空行(如您的示例所示),否则它将无法获得最后一个匹配。

顺便说一句,部分解决方案是修复OP模式的最后一行,以便它只有一个?最后而不是*?但是,它只会匹配行星信息,即大陆信息之后的行。之前没有得到任何东西的原因是*?很懒。如果可能,它将避免匹配。

答案 2 :(得分:0)

试试这个:

"""(?mx)
        ^country\s:\s*(.+)\s
        (?:^.+\s)*?
        ^continent\s:\s*(.+)\s
        (?:^.+\s)*?
        ^planet\s:\s*(.+)\s.*
"""

答案 3 :(得分:0)

# seen this '\n' can break string into LIST of strings 

n_line = output.split('\n')

tempn_line = n_line[:]

# loop through the new List (without '\n')

for n_text in tempn_line:
    if ':' not in n_text:
        #print n_text
        n_line.remove(n_text)


for l_text in n_line:
    n_split = l_text.split(':')
    #print n_split
    if 'country' in n_split[0]:
        print n_split[1]
    elif 'continent' in n_split[0]:
        print n_split[1]
    elif 'planet' in n_split[0]:
        print n_split[1]

答案 4 :(得分:0)

我非常同意其他所有人说你不应该用正则表达式来做这件事。也就是说,如果在使用每个“垃圾”行之前使用负向前瞻,你可以使它工作。 E.g:

print re.findall(r"""(?mx)
    ^country\s:\s*(.+)\s
    (?:^.+\s)*?
    ^continent\s:\s*(.+)\s
    (?:(?:(?!(?:planet|country|continent)\s:)^.+\s)*
       (?:^planet\s:\s*(.+)\s))?
""",output)

答案 5 :(得分:0)

import re

pat = re.compile('country : (.+)\n.+\ncontinent : (.+)(?:\n.*)*?(?:\nplanet : (.+)|\n-+|\n?\Z)')

output1 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zziiiiiiiiiiiizz
continent : Asia
planet : Earth
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
planet : Mars """

output2 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
"""

output3 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug"""

output4 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
"""

output5 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora"""

for ch in (output1,output2,output3,output4,output5):
    print ch
    print
    print repr(ch)
    print
    print '\n'.join(repr(u) for u in pat.findall(ch))
    print '======================================================================'

结果:

country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zziiiiiiiiiiiizz
continent : Asia
planet : Earth
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
planet : Mars 

'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n------\ncountry : China\nzziiiiiiiiiiiizz\ncontinent : Asia\nplanet : Earth\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug\nplanet : Mars '

('USA', 'Americ', '')
('China', 'Asia', 'Earth')
('Izbud', 'Gladiora', 'Mars ')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug


'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug\n'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug

'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora


'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\n'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora

'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================