返回R中grep的原始搜索词

时间:2013-05-09 19:00:43

标签: regex r search grep

我有一个项目列表和一个搜索词列表,我正在尝试做两件事:

  1. 搜索与任何搜索字词匹配的项目,然后返回true iff找到匹配。
  2. 对于返回true的所有项目(即匹配),我想    还会返回在步骤1中匹配的原始搜索字词。
  3. 因此,给出以下数据框:

                 items
    1             alex
    2 alex is a person
    3   this is a test
    4            false
    5    this is cathy
    

    以及以下搜索字词列表:

    "alex"      "bob"       "cathy"     "derrick"   "erica"     "ferdinand"
    

    我想创建以下输出:

                 items matches original
    1             alex    TRUE     alex
    2 alex is a person    TRUE     alex
    3   this is a test   FALSE     <NA>
    4            false   FALSE     <NA>
    5    this is cathy    TRUE     cathy
    

    第1步非常简单,但我遇到了第(2)步的问题。要创建“匹配”#39;如果grepl()中的行位于搜索字词列表中,则使用TRUE创建d$items变量,否则FALSEgrep()

    对于第2步,我的想法是我应该能够在指定value = T时使用 items matches original 1 alex TRUE alex 2 alex is a person TRUE alex is a person 3 this is a test FALSE <NA> 4 false FALSE <NA> 5 this is cathy TRUE this is cathy ,如下面的代码所示。但是,这会返回错误的值:而不是返回与grep匹配的原始搜索词,而是返回匹配项的值。所以我得到以下输出:

    # Dummy data and search terms
    d = data.frame(items = c("alex", "alex is a person", "this is a test", "false", "this is cathy"))
    searchTerms = c("alex", "bob", "cathy", "derrick", "erica", "ferdinand")
    
    # Return true iff search term is found in items column, not between letters
    d$matches = grepl(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
        searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
        collapse = "|"), d[,1], ignore.case = TRUE
    )
    
    # Subset data
    dMatched = d[d$matches==T,]   
    
    # This is where the problem is: return the value that was originally matched with grepl above
    dMatched$original = grep(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
        searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
        collapse = "|"), dMatched[,1], ignore.case = TRUE, value = TRUE
    )
    
    
    d$original[d$matches==T] = dMatched$original
    

    这是我现在正在使用的代码。任何想法都会非常感激!

    {{1}}

2 个答案:

答案 0 :(得分:3)

感谢Dason的帮助提示!我能够使用regmatches()来解决我的问题。这是我的代码,从最初的问题开始:

# This is where the problem is: return the value that was originally matched with grepl above
m = regexpr(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), dMatched[,1], ignore.case = TRUE 
)

dMatched$original = regmatches(dMatched[,1], m)

d$original[d$matches==T] = dMatched$original

这将返回以下输出,这正是我想要的:

             items matches original
1             alex    TRUE     alex
2 alex is a person    TRUE    alex 
3   this is a test   FALSE     <NA>
4            false   FALSE     <NA>
5    this is cathy    TRUE    cathy

答案 1 :(得分:2)

不完全符合您的要求,但您可以使用qdap的{​​{1}}功能来执行此操作。如果您在同一个句子中有两个名字,这将有所帮助:

termco

要获得qdap所需的内容,您可以使用:

library(qdap)
termco(d$items, 1:nrow(d), searchTerms)

## > termco(d$items, 1:nrow(d), searchTerms)
##   nrow(d word.count       alex bob     cathy derrick erica ferdinand
## 1      1          1 1(100.00%)   0         0       0     0         0
## 2      2          4  1(25.00%)   0         0       0     0         0
## 3      3          4          0   0         0       0     0         0
## 4      4          1          0   0         0       0     0         0
## 5      5          3          0   0 1(33.33%)       0     0         0