R中的字符串匹配,如SAS中的PRXMATCH()

时间:2016-05-30 13:13:29

标签: r sas text-analysis

寻求帮助,了解如何匹配R中的字符串,例如PRXMATCH() SAS中的字符串。

List1 <-c("lead","good")
List2 <-c("Quality","understand")
Name  <-c("grp1","grp2")

我有一个包含列sentence的数据框。对于我需要的每一句话:

  • List1
  • 中查找单词
  • 如果找到单词,则会查找List2的相应单词。
  • 如果找到与List1中的字相距+ -5个字的单词,则应将Name中的名称添加到result列。

例如,在所有句子中搜索"lead"。找到"lead"后,如果在该句子中找到"Quality",如果在+ -5字距离处找到"grp1",则应在result列中添加test.cs,否则将其丢弃。< / p>

1 个答案:

答案 0 :(得分:0)

这样的事,也许?

myData <- data.frame(sentence = c("The quality bla bla bla lead bla", 
                                  "The quality bla bla bla bla bla lead bla",
                                  "The lead quality bla bla",
                                  "The lead bla bla quality",
                                  "The lead bla bla bla bla bla quality of",
                                  "It allows us to understand how good bla",
                                  "It is good to understand that bla",
                                  "It is also good bla bla bla if we understand",
                                  "lead quality is good to understand"),
                     Result = "",
                     stringsAsFactors = FALSE)

List1 <-c("lead","good")
List2 <-c("quality","understand")
Name  <-c("grp1","grp2")

regexpr <- paste0("(\\b",List1,"\\s+(\\w+\\s+){0,4}",List2,"\\b)|(\\b",List2,"\\s+(\\w+\\s+){0,4}",List1,"\\b)")


for(i in 1:length(regexpr)) {
  myData$Result <- ifelse(grepl(pattern = regexpr[i], x = myData$sentence), 
                          yes = paste(myData$Result, Name[i]), 
                          no = myData$Result)
}

结果

> myData
                                      sentence     Result
1             The quality bla bla bla lead bla       grp1
2     The quality bla bla bla bla bla lead bla           
3                     The lead quality bla bla       grp1
4                     The lead bla bla quality       grp1
5      The lead bla bla bla bla bla quality of           
6      It allows us to understand how good bla       grp2
7            It is good to understand that bla       grp2
8 It is also good bla bla bla if we understand           
9           lead quality is good to understand  grp1 grp2