Question

为了更好地清理我的论坛消息语料库，我想在标点符号之前删除前导空格，并在需要时使用两个正则表达式在之后添加一个。后者没有问题（(?<=[.,!?()])(?! )）但至少我遇到了一些问题。

我使用了这个表达式：\s([?.!,;:"](?:\s|$))

但到目前为止还不够灵活：

即使在标点符号之前已经有空格（或更多），它也会匹配

如果标点字符后面没有空格，则不匹配

它没有匹配任何未列出的标点字符（但我想我可以在一天结束时使用[:punct:]）

最后，两者都匹配小数点（虽然它们不应该）

我怎样才能最终重写表达式以满足我的需求？

示例字符串和预期输出

This is the end .Hello world! # This is the end. Hello world! (remove the leading, add the trailing) This is the end, Hello world! # This is the end, Hello world! (ok!) This is the end . Hello world! # This is the end. Hello world! (remove the leading, ok the trailing) This is a .15mm tube # This is a .15 mm tube (ok since it's a decimal point)

Answer 1

使用\p{P}匹配所有标点符号。使用\h*代替\s*，因为\s也会匹配换行符。

(?<!\d)\h*(\p{P}+)\h*(?!\d)

用\1<space>

替换匹配的字符串

DEMO

> x <- c('This is the end .Stuff', 'This is the end, Stuff', 'This is the end . Stuff', 'This is a .15mm tube')
> gsub("(?<!\\d)\\h*(\\p{P}+)\\h*(?!\\d)", "\\1 ", x, perl=T)
[1] "This is the end. Stuff" "This is the end, Stuff" "This is the end. Stuff"
[4] "This is a .15mm tube"

Answer 2

这是一个检测需要替换的子串的表达式：

\s*\.\s*(?!\d)

您需要将它们替换为：.（点和空格）

以下是有关其工作原理的演示链接：http://regex101.com/r/zB2bY3/1

正则表达式的解释：

\s* - 匹配空格，任意数量的字符（0 - 无界）
\. - 匹配点
\s* - 与上述相同
(?!\d) - 负向前瞻。这意味着为了匹配，字符串不能跟一个数字（这会处理你的上一个测试用例）。

正则表达式前导空格/在标点之前添加尾随空格

2 个答案: