我有一个字符列表。我希望使用R中的grep对列表中的每个元素进行精确匹配

时间:2015-02-23 07:06:00

标签: r grep

我有一个字符列表。

mylist <- list(c("apple", "banana", "cat", "dog", "elephant", "fish"), 
              c("apple", "banana", "camel", "doll", "egg"),
              c("apple", "bag", "cat", "donkey", "elephant", "frog", "gun"),
              c("apple", "ball", "cage", "dolphin", "doggy", "fishy"),
              c("apple", "baggy", "catty", "doggy", "eggie", "gun_powder"))

我希望使用R中的grep函数将列表中的每个元素与其他元素完全匹配,但我得到的是部分匹配。

这是我写的代码

matched <- vector("list", length(mylist))
  for(i in 1:length(mylist))
  {
    index <- NULL
    indexx <- vector("list", length(mylist[[i]]))
    for(j in 1:length(mylist[[i]]))
    {
      dummy <- NULL
      for(k in 1:length(mylist))
      {
        c <- grep(mylist[[i]][j], mylist[[k]], value = TRUE, fixed = TRUE)
        ind <- c(dummy, c)
        dummy <- ind
      }
      indexx[[j]] <- ind
    }
    matched[[i]] <- indexx
  }

请帮助我。

3 个答案:

答案 0 :(得分:2)

取消列表

ulist = unlist(mylist)

对于ulist的每个元素,找到所有ulist的完全匹配项。使用等效==而不是grep()执行此操作,并“比较”比较。

matches0 = lapply(ulist, function(elt) ulist[ulist == elt])

最后,将匹配重新列出到原始几何

relist(matches0, mylist)

以这种方式总结结果似乎很奇怪;或许改为计算每个单词出现的次数

tbl = table(ulist)

并将这些计数用作条目

relist(tbl[ulist], mylist)

一些整理是删除table()

返回的dimname的名称
names(dimnames(tbl)) = NULL

答案 1 :(得分:0)

如果我理解正确,你想要实现的目标:

    mylist <- list(c("apple", "banana", "cat", "dog", "elephant", "fish"), 
            c("apple", "banana", "camel", "doll", "egg"),
            c("apple", "bag", "cat", "donkey", "elephant", "frog", "gun"),
            c("apple", "ball", "cage", "dolphin", "doggy", "fishy"),
            c("apple", "baggy", "catty", "doggy", "eggie", "gun_powder"))

    ulist <- unique(unlist(mylist))
    matched <- vector("list", length(ulist))
    names(matched) <- ulist

    ### Counting every fruit
    countList = function(ls, container) { 
        sapply(ls, function(elem) {
                    isEmpty = is.null(container[[elem]])
                    container[[elem]] <<- ifelse(isEmpty, 1, container[[elem]] + 1)
                })
        container
    }
    counted = countList(unlist(mylist), matched)
    lapply(names(counted), function(lab) rep(lab, counted[[lab]]))

输出看起来像这样

[[1]]
[1] "apple" "apple" "apple" "apple" "apple"

[[2]]
[1] "banana" "banana"

[[3]]
[1] "cat" "cat"

[[4]]
[1] "dog"

[[5]]
[1] "elephant" "elephant"

[[6]]
[1] "fish"

[[7]]
[1] "camel"

[[8]]
[1] "doll"

[[9]]
[1] "egg"

[[10]]
[1] "bag"

[[11]]
[1] "donkey"

[[12]]
[1] "frog"

[[13]]
[1] "gun"

[[14]]
[1] "ball"

[[15]]
[1] "cage"

[[16]]
[1] "dolphin"

[[17]]
[1] "doggy" "doggy"

[[18]]
[1] "fishy"

[[19]]
[1] "baggy"

[[20]]
[1] "catty"

[[21]]
[1] "eggie"

[[22]]
[1] "gun_powder"

答案 2 :(得分:0)

您应该阅读有关正则表达式like this的教程 它们并不容易,但如果你使用字符串它们非常有用。这里有代码regexp

matched <- vector("list", length(mylist))
  for(i in 1:length(mylist))
  {
    index <- NULL
    indexx <- vector("list", length(mylist[[i]]))
    for(j in 1:length(mylist[[i]]))
    {
      dummy <- NULL
      for(k in 1:length(mylist))
      {
        c <- grep(paste("^",mylist[[i]][j],"$",sep=""),mylist[[k]],perl = TRUE, value = TRUE)
        ind <- c(dummy, c)
        dummy <- ind
      }
      indexx[[j]] <- ind
    }
    matched[[i]] <- indexx
  }

^ simbol表示字符串的开头,$表示结束。所以它会得到完全匹配。