删除R中括号外的文本

时间:2014-08-21 04:38:25

标签: regex r text

所以我有一组如下所示的文本,其中包含所有命名实体的括号,我想删除括号外的文本并保留所有标点符号。有什么想法吗?

Sample_Text <- "[PERSON Meredith Vieira] will become the first woman to host [MISC Olympics] primetime coverage on her own when she fills on Friday night for the ailing [PERSON Bob Costas] , who is battling a continuing eye infection. " It 's an honor to fill in for him , " [PERSON Vieira] said on TODAY Friday ."

理想情况下,我最终会使用此向量。

Entities <- ([PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas]. [PERSON Vieira].)

我的第一次尝试是在两个括号之间插入文本,但我没有收到文本,而且我的REGEX无法正常工作。然后我意识到我需要标点符号。这是我在下面的REGEX尝试。想法?

grep("\\[.*?\\]", "", d, perl=TRUE)

1 个答案:

答案 0 :(得分:0)

您可以尝试使用动词(*SKIP)(*F)

的以下正则表达式
> gsub("(?:\\[.*?\\]|\\.)(*SKIP)(*F)|[\\w' ,\\\"]+", " ", Sample_Text, perl=TRUE);
[1] "[PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas] . [PERSON Vieira] ."

删除.之前的前导空格

> result <- gsub("(?:\\[.*?\\]|\\.)(*SKIP)(*F)|[\\w' ,\\\"]+", " ", Sample_Text, perl=TRUE);
> result
[1] "[PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas] . [PERSON Vieira] ."
> gsub(" +(?=\\.)", "", result, perl=TRUE);
[1] "[PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas]. [PERSON Vieira]."