Question

我的输入类似于以下内容：

# file contents
US,This
is the title
CA, New Title
CA, Newer Title

我想获得每个国家的参赛作品。最终输出应为：

# 3 items
['US, This is the title', 'CA, New Title', 'CA, Newer Title']

我可以拆分ISO代码，但我还需要包含它。我如何将以下正则表达式修改为正确的正则表达式？

re.split(r'\n[A-Z]{2,3},', contents)

Answer 1

使用前瞻。

>>> re.split(r'\n(?=[A-Z]{2,3},)', contents)
['US,This\nis the title', 'CA, New Title', 'CA, Newer Title']

Answer 2

通过re.split和string.replace功能。

>>> s = """US,This
is the title
CA, New Title
CA, Newer Title"""
>>> [i.replace('\n', ' ') for i in re.split(r'\n(?=[A-Z]{2,3},)', s)]
['US,This is the title', 'CA, New Title', 'CA, Newer Title']

通过re.findall和string.replace函数。

>>> [i.replace('\n', ' ') for i in re.findall(r'(?s)(?:^|\n)([A-Z]{2,3},.*?)(?=\n[A-Z]{2,3},|$)', s)]
['US,This is the title', 'CA, New Title', 'CA, Newer Title']

要获得准确的所需输出，您需要使用re.sub代替string.replace。

>>> [re.sub(r'(?<=,)(?!\s)|\n', ' ', i) for i in re.findall(r'(?s)(?:^|\n)([A-Z]{2,3},.*?)(?=\n[A-Z]{2,3},|$)', s)]
['US, This is the title', 'CA, New Title', 'CA, Newer Title']

(?<=,)(?!\s)匹配逗号旁边的所有边界，并且后面不能跟空格字符
|或
\n换行符。

替换匹配的边界，带有单个空格字符的换行符将为您提供所需的输出。

排除re.split中捕获的组

2 个答案: