如何删除具有特定单词的子条款

时间:2017-10-29 09:10:25

标签: r regex

目标: 我试图摆脱r中包含单词normal的各种句子中的子句。子条款被定义为由起始逗号分隔并以句号或逗号结尾。我想摆脱这个子条款。

输入句子

I walked down the hill, which was normal, but I also walked up another hill which was dull.

I looked at him and although he looked normal, he was not normal.

I am fine, but he is not normal, and she is fine and she is normal, but I think her brother is not normal.

所需的输出

I walked down the hill but I also walked up another hill which was dull

I looked at him and although he looked normal.

I am fine, and she is fine and she is normal.

尝试

gsub(", .*normal.*?(\\.|,|$)\\R*", "", input_string, perl = T, ignore.case = T)

当前输出:

I walked down the hill.
I looked at him and although he looked normal.
I am fine.

但是,如果有许多子条款,则不会给出预期的输出,主要是因为它从第一个逗号中删除了所有内容。如何使其匹配从最接近的逗号到“正常”?

1 个答案:

答案 0 :(得分:0)

您的示例和规则不一致(请参阅@janos的评论)。例如,你删除了上一个例句中的最后一个子句“但我认为她的兄弟不正常”,即使它没有以句号结束。

除此之外,以下内容应该让你开始:

ss <- c(
    "I walked down the hill, which was normal, but I also walked up another hill which was dull",
    "I looked at him and although he looked normal, he was not normal.",
    "I am fine, but he is not normal, and she is fine and she is normal, but I think her brother is not normal");

lapply(ss, function(x) gsub("\\,[a-zA-Z0-9_ ]+[\\,\\.]{1}", "", x));
#[[1]]
#[1] "I walked down the hill but I also walked up another hill which was dull"

#[[2]]
#[1] "I looked at him and although he looked normal"

#[[3]]
#[1] "I am fine and she is fine and she is normal, but I think her brother is not normal"