Question

我想通过使用正则表达式在字符串中输入错字来取消联接，并在匹配的表达式之间插入空格。

我尝试了类似问题的解决方案……但它对我不起作用-（Insert space between characters regex）;解决方案-在re.sub中将替换字符串用作'\ 1 \ 2'。

import re

corpus = ''' 
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''

clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)

clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)

预期输出：

This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.

Answer 1

您需要将捕获组括号放在与要复制到结果中的每个字符串匹配的模式的附近。

在+之后也不需要使用\d。您只需要匹配数字的最后一位。

clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)

DEMO

Answer 2

我不确定其他可能的输入，我们也许可以使用类似于以下内容的表达式来添加空格：

(\d+)([a-z]+)\b

此后，我们将用单个空格替换任意两个空格，它可能会起作用，但不确定：

import re

print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))

该表达式在this demo的右上角进行了说明，如果您想进一步探索或修改它，在this link中，您可以逐步观察它如何与某些示例输入匹配步骤，如果您愿意的话。

Answer 3

以括号(和)标记的捕获组应围绕您要匹配的模式。

这应该对您有用

clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)

正则表达式(\d+)([^\d,\s])读取：将1个或多个数字（\d+）匹配为组1（第一组括号），将非数字和非空格匹配为组2。

之所以不起作用，是因为您没有想要重用的模式周围的括号。

在正则表达式匹配之间插入空格

3 个答案: