Question

尝试为R gsub构建正则表达式以通过要删除的换行符匹配字符串。

示例字符串：

text <- "categories: crime, punishment, france\nTags: valjean, javert,les mis\nAt the end of the day, the criminal Valjean escaped once more."

理想的结果是gsub替换前两个文本块，以便剩下的只是后面的文本。

最终，罪犯瓦吉安（Valjean）再次逃脱。

摆脱类别和标签。

这是我使用的模式：

^categor*.\n{1}

它应该与行的开头，单词片段之后的所有内容匹配，直到到达第一个换行符为止，但它仅与片段匹配。我在做什么错了？

而且，有没有比两个gsub更好的方法了？

Answer 1

1）：这里有人问什么问题，因此第一个选项删除了前两行：

sub("^categor([^\n]*\n){2}", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

如果categor部分无关紧要，那么

tail(strsplit(text, "\n")[[1]], -2)
## [1] "At the end of the day, the criminal Valjean escaped once more."

2）如果要删除...:....\n形式的任何行，其中每行中冒号之前的字符必须是单词字符：

gsub("\\w+:[^\n]+\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

或

gsub("\\w+:.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

或

grep("^\\w+:", unlist(strsplit(text, "\n")), invert = TRUE, value = TRUE)
## [1] "At the end of the day, the criminal Valjean escaped once more."

3），或者如果我们要删除仅包含某些标签的行：

gsub("(categories|Tags):.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

4）如果您还想捕获标签，可能还需要使用read.dcf。

s <- unlist(strsplit(text, "\n"))
ix <- grep("^\\w+:", s, invert = TRUE)
s[ix] <- paste("Content", s[ix], sep = ": ")
out <- read.dcf(textConnection(s))

给出3列矩阵：

> out
     categories                  Tags                     
[1,] "crime, punishment, france" "valjean, javert,les mis"
     Content                                                         
[1,] "At the end of the day, the criminal Valjean escaped once more."

Answer 2

尝试一下（换行符与\\n匹配：

gsub("^categor.*\\n",  "", text)
# [1] "At the end of the day, the criminal Valjean escaped once more."

Answer 3

也许是以下正则表达式：

sub("^categor.*\\n([^\n]*$)", "\\1", text)
#[1] "At the end of the day, the criminal Valjean escaped once more."

Answer 4

无需使用[^\n]，因为您可以仅使用.来匹配除换行符以外的任何字符。请注意，您需要对TRE（带有(?n) / (g)sub的默认正则表达式引擎）和(g)regexpr使用perl=TRUE修饰符，这是默认的.行为：

text <- "categories: crime, punishment, france\nTags: valjean, javert,les mis\nAt the end of the day, the criminal Valjean escaped once more."
sub("(?n)^categor(?:.*\n){2}", "", text)
sub("^categor(?:.*\n){2}", "", text, perl=TRUE)

在这里，如果字符串以categor开头，则前两行将被删除。

请参见R demo online。

模式详细信息

^-字符串锚点的开始
categor-文字子字符串
(?:.*\n){2}-任意字符正好连续2次出现（{2}），但换行符（.）连续零次或多次（*），然后出现LF字符。

gsub的正则表达式与行匹配，直到并通过换行符\ n字符

4 个答案: