Question

我在R中使用正则表达式时遇到问题。目标是在R中解析Markdown / reST / knitr报告文本文件以删除我自己的自定义注释。这些评论采用以下形式：

Some sentence is about something <find a citation to this>.

Markdown使用＆lt;＆gt;对于HTML标记，我需要删除这些注释（使用我的自定义函数）以避免混淆。在我这样做之后，句子采用以下形式：

Some sentence is about something .

注意最后一个单词和点之间的空格。删除它很容易，但是文本可能包含reST注释，其中包含R代码（knitr），以..开头：

.. {r chunk-name}
.. some R code 
.. ..

所以基本上我需要替换“。”在前一种情况下，但不在后一种情况下。我虽然使用R regexp原子的重复修饰符来实现这一点：

gsub(pattern=" \\.{1}",replacement=".",x="Something ..")
[1] "Something.."

我期待这个表达式匹配单个空格，后跟一个（但不是更多）点。但是无论是否有一个点或两个点，字符串都会被替换。我是一个真正的新手，所以可能错过了一些明显的东西。即便如此，任何帮助都将非常感激。

此致马克西姆

Answer 1

一旦模式匹配，就会发生匹配。没有期待确保模式不再发生。我不确定它是否足够通用但是在提供的单个测试用例中使用带有否定运算符的字符类

> gsub(pattern=" \\.[^.]| \\.$",replacement=".",x="Something .")
[1] "Something."
> gsub(pattern=" \\.[^.]| \\.$",replacement=".",x="Something ..")
[1] "Something .."

Answer 2

您可以删除最后一个空格到.的所有内容，并在字符串末尾粘贴.，不是吗？

# anything followed by any amount of space followed 
# by < followed by anything until the end of the sentence
paste0(gsub("(.*)[ ].*<.*$", "\\1", tt), ".")
# [1] "Some sentence is about something."

那就是说，你应该really read this。

或者，如果标记出现在句子的中间，而您只是想删除它们及其周围的空格，那么：

# remove everything within <...> including < and > 
# and any spaces surrounding them
gsub("[ ]*<.*?>[ ]*", "", tt)
# [1] "Some sentence is about something."

# example:
tt <- ".. some sentences are wrong <bla bla>. But some are <bla bla> right."
gsub("[ ]*<.*?>[ ]*", "", tt)
# [1] ".. some sentences are wrong. But some are right."

请注意.*>和.*?>之间的区别。第一个是“贪婪”，因为它会匹配所有角色，直到最后一个＆gt;。然而，第二个匹配将在第一个匹配后停止，这在此是合乎需要的，并且您希望删除每个匹配项。

Answer 3

您可以使用Perl正则表达式中的负向前瞻模式来完成您想要的任务。这基本上是为了匹配模式，但只有在没有这种模式的情况下才会出现。一个简单的例子：

> gsub(pattern=" \\.(?!\\.)",replacement=".",x="Something .", perl=TRUE)
[1] "Something."
> gsub(pattern=" \\.(?!\\.)",replacement=".",x="Something ..", perl=TRUE)
[1] "Something .."

R中的正则表达式：{}的模式重复

3 个答案: