R中的变量参数快速文本搜索功能

时间:2014-09-12 17:08:05

标签: r search data.table text-search

我有来自大数据的示例数据。表:

ddf = structure(list(id = 1:5, country = c("United States of America", 
 "United Kingdom", "United Arab Emirates", "Saudi Arabia", "Brazil"
 ), area = c("North America", "Europe", "Arab", "Arab", "South America"
 ), city = c("first", "second", "second", "first", "third")), .Names = c("id", 
 "country", "area", "city"), class = c("data.table", "data.frame"
 ), row.names = c(NA, -5L))

ddf
   id                  country          area   city
1:  1 United States of America North America  first
2:  2           United Kingdom        Europe second
3:  3     United Arab Emirates          Arab second
4:  4             Saudi Arabia          Arab  first
5:  5                   Brazil South America  third
> 

我必须创建一个函数,我可以发送可变数量的文本参数,函数应该对数据执行AND搜索并输出具有所有文本搜索参数的所有行。不同的搜索字符串可以位于不同的列中。

例如searchfn(ddf,'brazil','third')应仅打印出最后一行。

案件需要忽略。

数据量很大,因此搜索需要快速和速度优化(因此使用data.table)。

我试过了:

searchfn = function(ddf, ...){
    ll = list(...)
    print(sapply(ll, function(x) grep(x, ddf, ignore.case=T)))
}

它会获取所有已发送的搜索字符串并输出搜索到的数字,但搜索不正确。

1 个答案:

答案 0 :(得分:2)

这似乎有效,但我怀疑这是一个最佳解决方案:

searchfn = function(ddf, ...){
  ll = list(...)
  pat <- paste(unlist(ll), collapse = "|")
  X <- do.call(paste, ddf)
  Y <- regmatches(X, gregexpr(pat, X, ignore.case = TRUE))
  ddf[which(vapply(Y, function(x) length(unique(x)), 1L) == length(ll)), ]
}

以下是一些尝试的测试:

searchfn(ddf, 'brazil', 'third')
searchfn(ddf, 'arab', 'first')
searchfn(ddf, "united", "second")
searchfn(ddf, "united", "second", "2")
searchfn(ddf, "united", "second", "Euro")