在R中具有子集/ grepl的查找表

时间:2015-07-08 20:35:11

标签: regex r dplyr

我正在分析使用抓取工具提取的一组网址和值。虽然我可以从URL中提取子字符串,但我真的不愿意使用正则表达式这样做 - 是否有一种简单的方法可以使用subset / grepl进行查找表样式替换而不需要使用dplyr(对条件变异进行操作) vairables)?

我目前的流程:

test <- data.frame(
  url = c('google.com/testing/duck', 'google.com/evaluating/dog', 'google.com/analyzing/cat'),
  content = c(1, 2, 3),
  subdir = NA
)

test[grepl('testing', test$url), ]$subdir <- 'testing'
test[grepl('evaluating', test$url), ]$subdir <- 'evaluating'
test[grepl('analyzing', test$url), ]$subdir <- 'analyzing'

显然,这有点笨拙并且不能很好地扩展。使用dplyr,我可以使用条件来执行某些操作:

test %<>% tbl_df() %>% 
  mutate(subdir = ifelse(
    grepl('testing', subdir), 
    'test r', 
    ifelse(
      grepl('evaluating', subdir), 
      'eval r', 
      ifelse(
        grepl('analyzing', subdir), 
        'anal r', 
        NA
      ))))

但是,再次,真的很傻,如果可能的话,我不想产生包依赖。有没有办法用某种查找表进行基于正则表达式的子集化?

修改:只需进行一些澄清:

  1. 对于提取子目录,是的,正则表达式是最有效的;但是,我希望有一个更通用的模式,可以匹配类似字典的字符串结构与其他任意值。
  2. 当然,嵌套的ifelse很丑陋且容易出错 - 只是希望在dplyr向上获得一个快速而肮脏的示例。
  3. 编辑2:以为我会根据BondedDust的方法回送并发布我最终的结果。决定在其中练习一些映射和非标准eval:

    test <- data.frame(
      url = c(
        'google.com/testing/duck',
        'google.com/testing/dog',
        'google.com/testing/cat',
        'google.com/evaluating/duck', 
        'google.com/evaluating/dog', 
        'google.com/evaluating/cat', 
        'google.com/analyzing/duck',
        'google.com/analyzing/dog',
        'google.com/analyzing/cat',
        'banana'
      ),
      content = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
      subdir = NA
    )
    
    # List used for key/value lookup, names can be regex
    lookup <- c(
      "testing" = "Testing is important",
      "Eval.*" = 'eval in R',
      "analy(z|s)ing" = 'R is fun'
    )
    
    # Dumb test for error handling:
    # lookup <- c('test', 'hey')
    
    # Defining new lookup function
    regexLookup <- function(data, dict, searchColumn, targetColumn, ignore.case = TRUE){
      # Basic check—need to separate errors/handling
      if(is.null(names(dict)) || is.null(dict[[1]])) {
        stop("Not a valid replacement value; use a key/value store for `dict`.")
      }
    
      # Non-standard eval for the column names; not sure if I should
      # add safetytype/checks for these
      searchColumn <- eval(substitute(searchColumn), data)
      targetColumn <- deparse(substitute(targetColumn))
    
      # Define find-and-replace utility
      findAndReplace <- function (key, val){
        data[grepl(key, searchColumn, ignore.case = ignore.case), targetColumn] <- val
        data <<- data
      }
    
      # Map over the key/value store
      mapply(findAndReplace, names(dict), dict)
    
      # Return result, with non-matching rows preserved
      return(data)
    }
    
    regexLookup(test, lookup, url, subdir, ignore.case = FALSE)
    

2 个答案:

答案 0 :(得分:3)

 for (target in  c('testing','evaluating','analyzing') ) {
                    test[grepl(target, test$url),'subdir' ] <- target }

 test
                        url content     subdir
1   google.com/testing/duck       1    testing
2 google.com/evaluating/dog       2 evaluating
3  google.com/analyzing/cat       3  analyzing

目标矢量可能是工作空间中矢量的名称。

targets <-   c('testing','evaluating','analyzing') 
for( target in targets ) { ...}

答案 1 :(得分:2)

试试这个:

test$subdir<-gsub('.*\\/(.*)\\/.*','\\1',test$url)