R字符匹配和排名

时间:2015-05-24 16:48:51

标签: r grep pattern-matching string-matching

我有一个角色向量

var1 <- c("pine tree", "dense forest", "red fruits", "green fruits",
                 "clean water", "pine")

和一个清单

var2 <- list(c("tall tree", "fruits", "star"),  c("tree tall", "pine tree",
  "tree pine", "black forest", "water"), c("apple", "orange", "grapes"))

我想将var1中的单词与var2中的元素进行匹配,并获得var2的排名元素。例如,这里所需的输出是:

"tree tall"    "pine tree"    "tree pine"    "black forest" "water"

var2 [2]是等级1(var1中的4个短语:松树,茂密的森林,松树和水与var2匹配[2]

"tall tree" "fruits"    "star" 

var2 [1]是等级2,(var1中的3个短语:松树,红色水果和绿色水果与var2匹配[1])

 "apple"  "orange" "grapes"

var2 [3]是排名3,与var1

不匹配

我试过

indx1 <- sapply(var2, function(x) sum(grepl(var1, x)))

没有获得所需的输出。

如何解决?一个代码片段将不胜感激。 感谢。

编辑:

新数据如下:

var11 <- c("nature" ,  "environmental", "ringing", "valley" ,            "status" ,            "climate" ,          
       "forge"  ,            "environmental" ,     "common" ,           
       "birdwatch",          "big"    ,            "link" ,             
       "day" ,              "pintail"    ,        "morning" ,          
       "big garden" ,        "birdwatch deadline", "deadline february" ,
       "mu condition" ,        "garden birdwatch" ,  "status" ,           
       "chorus walk" ,       "dawn choru"  ,       "walk sunday", 
       "climate lobby" ,     "lobby parliament" ,  "u status" ,              
       "sandwell valley" ,   "my status of"  ,           "environmental lake")


var22 <- list(c("environmental condition"),  c("condition", "status"), c("water", "ocean water"))

2 个答案:

答案 0 :(得分:1)

我们可以循环'var2'(sapply(var2,),将字符串分隔为空格(strsplit(x, ' ')),grep输出列表元素作为'var1'的模式。检查是否有any匹配,sum逻辑向量和rank。这可以用于重新排序'var2'元素。

 indx <- rank(-sapply(var2, function(x) sum(sapply(strsplit(x, ' '),
              function(y) any(grepl(paste(y,collapse='|'), var1))))),
                 ties.method='first')
 indx
 #[1] 2 1 3


var2[indx]
#[[1]]
#[1] "tree tall"    "pine tree"    "tree pine"    "black forest" "water"       

#[[2]]
#[1] "tall tree" "fruits"    "star"     

#[[3]]
#[1] "apple"  "orange" "grapes"

更新

如果我们还需要计算重复项,请尝试

indx <- rank(-sapply(var22, function(x) sum(sapply(strsplit(x, ' '), 
        function(y) sum(sapply(strsplit(var11, ' '), 
          function(z) any(grepl(paste(y, collapse="|"), z))))))),
             ties.method='random')
indx
#[1] 1 2

UPDATE2

如果我们需要过滤掉'var2'中与'var1'没有任何匹配的元素

pat <- paste(unique(unlist(strsplit(var1, ' '))), collapse="|")
Filter(function(x) any(grepl(pat, x)), var2[indx])
#[[1]]
#[1] "tree tall"    "pine tree"    "tree pine"    "black forest" "water"       

#[[2]]
#[1] "tall tree" "fruits"    "star"     

答案 1 :(得分:0)

以下代码可以运作:

idx <- rank(-sapply(var2, 
         function(x) sum(unlist(sapply(strsplit(var1,split=' '), 
           function(y) any(unlist(sapply(y,
             function(z) grepl(z,x))>0))>0)))),
  ties.method='random')