我的正则表达式有点问题。
以下是要解析的文本示例:
output = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zzzzzzz
continent : Asia
planet : Earth
-------
country : Izbud
zzzzzzz
continent : Gladiora
zzzzzzz
zzzzzzz
planet : Mars
"""
我想解析这个问题,然后回到这个国家,大陆,最终回归这个星球。
所以我做了一个正则表达式:
results = re.findall(
r"""(?mx)
^country\s:\s*(.+)\s
(?:^.+\s)*?
^continent\s:\s*(.+)\s
(?:^.+\s)*?
(?:^planet\s:\s*(.+)\s)*?
""",output)
但回报是:
[('USA', 'Americ', ''), ('China', 'Asia', ''), ('Izbud', 'Gladiora', '')]
而且我不知道我的正则表达式错在哪里?
如果有人有想法, 感谢。
答案 0 :(得分:1)
我将建议我会做什么,这将是试图避免使用这种复杂的正则表达式的东西。可能类似于:
while true:
line = readline()
if line == "----------":
# Do cleanup stuff
continue
elif 'country' in line.split():
country = line.split()[2]
elif 'continent' in line.split():
continent = line.split()[2]
# etc...
# update your list or dict or w/e
line = readline()
答案 1 :(得分:1)
我发现了一种似乎有用的模式:
r"""(?mx)
^country\s:\s*(.+)\s
(?:^.+\s)*?
^continent\s:\s*(.+)\s
(?:^.+\s)*?
(?:^(?:planet\s:\s*(.+)\s|-+\s|\Z))
"""
基本上,我改变了最后一部分,以便它必须匹配下列之一:行星的东西,一堆 - 或字符串的结尾。这有点难看,但这是我能找到的唯一方法,以确保它得到了星球的东西。我的解决方案的一个问题是字符串末尾必须有一个空行(如您的示例所示),否则它将无法获得最后一个匹配。
顺便说一句,部分解决方案是修复OP模式的最后一行,以便它只有一个?最后而不是*?但是,它只会匹配行星信息,即大陆信息之后的行。之前没有得到任何东西的原因是*?很懒。如果可能,它将避免匹配。
答案 2 :(得分:0)
试试这个:
"""(?mx)
^country\s:\s*(.+)\s
(?:^.+\s)*?
^continent\s:\s*(.+)\s
(?:^.+\s)*?
^planet\s:\s*(.+)\s.*
"""
答案 3 :(得分:0)
# seen this '\n' can break string into LIST of strings
n_line = output.split('\n')
tempn_line = n_line[:]
# loop through the new List (without '\n')
for n_text in tempn_line:
if ':' not in n_text:
#print n_text
n_line.remove(n_text)
for l_text in n_line:
n_split = l_text.split(':')
#print n_split
if 'country' in n_split[0]:
print n_split[1]
elif 'continent' in n_split[0]:
print n_split[1]
elif 'planet' in n_split[0]:
print n_split[1]
答案 4 :(得分:0)
我非常同意其他所有人说你不应该用正则表达式来做这件事。也就是说,如果在使用每个“垃圾”行之前使用负向前瞻,你可以使它工作。 E.g:
print re.findall(r"""(?mx)
^country\s:\s*(.+)\s
(?:^.+\s)*?
^continent\s:\s*(.+)\s
(?:(?:(?!(?:planet|country|continent)\s:)^.+\s)*
(?:^planet\s:\s*(.+)\s))?
""",output)
答案 5 :(得分:0)
import re
pat = re.compile('country : (.+)\n.+\ncontinent : (.+)(?:\n.*)*?(?:\nplanet : (.+)|\n-+|\n?\Z)')
output1 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zziiiiiiiiiiiizz
continent : Asia
planet : Earth
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
planet : Mars """
output2 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
"""
output3 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug"""
output4 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
"""
output5 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora"""
for ch in (output1,output2,output3,output4,output5):
print ch
print
print repr(ch)
print
print '\n'.join(repr(u) for u in pat.findall(ch))
print '======================================================================'
结果:
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zziiiiiiiiiiiizz
continent : Asia
planet : Earth
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
planet : Mars
'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n------\ncountry : China\nzziiiiiiiiiiiizz\ncontinent : Asia\nplanet : Earth\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug\nplanet : Mars '
('USA', 'Americ', '')
('China', 'Asia', 'Earth')
('Izbud', 'Gladiora', 'Mars ')
======================================================================
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug\n'
('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug'
('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\n'
('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora'
('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================