如何在R中使用正则表达式从句子中提取字符串?

时间:2019-01-21 16:29:43

标签: r regex string web-scraping regex-group

我想使用R中的regex从句子中提取字符串。我是R的新手,所以不知道从哪里开始或怎么做?

string<-c(".\n                Written by\nJ-S-Golden            \n        
\n        \n         \n                Plot Summary\n    |\n        Plot 
Synopsis\n    \n        \n            Plot Keywords:\n wrongful 
imprisonment\n                        |\n escape from prison\n                        
|\n based on the works of stephen king\n                        |\n 
prison\n                        |\n voice over narration\n            | See 
All (296) »      \n        \n            Taglines:\nFear can hold you 
prisoner. Hope can set you free.        \n        \n")

我有字符串,我想要输出的是:

Plot Keywords:
\n wrongful imprisonment\n
|\n escape from prison\n
|\n based on the works of stephen king\n                        
|\n prison\n                        
|\n voice over narration\n            
| See All (296) »      \n        \n

我不知道如何从字符串中提取干净的数据。有人可以帮我吗。

1 个答案:

答案 0 :(得分:1)

这是使用基数R的sub函数的解决方案。这匹配(包括)前导文本Plot Keywords:。然后,它使用一个经过修饰的点来匹配任何字符,直到但不包括以下第一个标签和冒号。

sub("(?s).*(Plot Keywords:(?:(?![^: ]+:).)*).*", "\\1", string, perl=TRUE)

[1] "Plot Keywords:\n wrongful \nimprisonment\n
                    |\n escape from prison\n
                    \n|\n based on the works of
     stephen king\n
                    |\n \nprison\n                        |\n voice over narration\n
        | See \nAll (296) »      \n        \n            "

在这种特殊情况下,纯正则表达式演示可能比R演示更有用,因此这里有一个链接:

Demo