所以我有一组如下所示的文本,其中包含所有命名实体的括号,我想删除括号外的文本并保留所有标点符号。有什么想法吗?
Sample_Text <- "[PERSON Meredith Vieira] will become the first woman to host [MISC Olympics] primetime coverage on her own when she fills on Friday night for the ailing [PERSON Bob Costas] , who is battling a continuing eye infection. " It 's an honor to fill in for him , " [PERSON Vieira] said on TODAY Friday ."
理想情况下,我最终会使用此向量。
Entities <- ([PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas]. [PERSON Vieira].)
我的第一次尝试是在两个括号之间插入文本,但我没有收到文本,而且我的REGEX无法正常工作。然后我意识到我需要标点符号。这是我在下面的REGEX尝试。想法?
grep("\\[.*?\\]", "", d, perl=TRUE)
答案 0 :(得分:0)
您可以尝试使用动词(*SKIP)(*F)
,
> gsub("(?:\\[.*?\\]|\\.)(*SKIP)(*F)|[\\w' ,\\\"]+", " ", Sample_Text, perl=TRUE);
[1] "[PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas] . [PERSON Vieira] ."
删除.
之前的前导空格
> result <- gsub("(?:\\[.*?\\]|\\.)(*SKIP)(*F)|[\\w' ,\\\"]+", " ", Sample_Text, perl=TRUE);
> result
[1] "[PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas] . [PERSON Vieira] ."
> gsub(" +(?=\\.)", "", result, perl=TRUE);
[1] "[PERSON Meredith Vieira] [MISC Olympics] [PERSON Bob Costas]. [PERSON Vieira]."