过滤包含R中所有多个部分字符串的列表

时间:2018-01-29 17:03:39

标签: r string filter grepl

我正在尝试根据用户在闪亮应用中选择的一组关键字过滤文件名列表,最终列表应该只包含所有部分关键字的文件

到目前为止,我一直在尝试使用此代码:

sapply(filenames, grepl, keywords)

但是如何从那个输出到那些完全正确的输出。 我尝试了这个related SO question的解决方案,但是

all(sapply(filenames, grepl, keywords)

当然对我的清单给出了错误。我可以编写一个列表应用函数来将sapply(....)应用到每个元素,但也许有一种更有效的方法可以同时实现所有元素?

我还查看了grepgrepl个选项,但它们只接受OR个参数,而不是AND

示例关键字:

keywords <- c("Syn", "2017") 

示例列表:

 filenames <- 
c("AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2016-06-18 13u22.csv",       "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2016-06-19 13u26.csv",      
"AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2017-06-19 13u27.csv",       "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2017-06-20 13u11.csv",      
"AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2018-06-21 13u12.csv",       "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2018-06-22 16u00.csv",      
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2016-06-18 13u25.csv", "AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2016-06-19 13u29.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2017-06-20 13u14.csv", "AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2017-06-21 13u15.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2018-06-22 16u03.csv", "AdditionalListMode_M1bI Syn 60 90 90 110 2016-06-18 13u31.csv",            
"AdditionalListMode_M1bI Syn 60 90 90 110 2016-06-19 13u35.csv",             "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv",           
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv",             "AdditionalListMode_M1bI Syn 60 90 90 110 2018-06-22 16u09.csv")

预期结果:

"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"           
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"

可能会发布一个稍微重复的问题,但我在SO和google上进行了长时间搜索后无法找到真正的解决方案

编辑结果: 我使用了359个文件名的数据集来获取所有有效答案的微基准测试结果(包括关键字顺序敏感的那些:

Unit: microseconds
                                                                                                                       expr      min       lq      mean    median       uq       max neval
 filesshort <- filenames[apply(sapply(keywords, function(x) grepl(x,      filenames)), 1, function(y) sum(y) == length(y))] 1220.588 1318.093 1691.7377 1366.2530 1635.477  5718.049    50
                               filesshort <- filenames[Reduce("&", lapply(keywords, function(x) grepl(x,      filenames)))]  532.922  568.055  640.7301  591.5435  637.137  1971.415    50
                                            filesshort <- grep(paste(keywords, collapse = ".*"), filenames,      value = T)  302.779  331.991  379.9144  343.4390  380.941   790.303    50
                             filesshort <- regmatches(filenames, regexpr(paste(keywords, collapse = ".*"),      filenames)) 2244.587 2310.905 2668.2153 2456.9655 2708.820  5758.314    50
                    filesshort <- unlist(regmatches(filenames, gregexpr(paste(keywords,      collapse = ".*"), filenames))) 3768.742 3985.463 5491.8536 4654.5750 5322.109 42538.964    50

使用grep的等式3是迄今为止最快的,但是那个也是关键字顺序敏感的。 如果我们同时考虑速度和对关键字顺序的容忍度,那么带有reduce的等式2是与其他4个答案相比的距离的赢家。

4 个答案:

答案 0 :(得分:2)

filenames[Reduce("&", lapply(keywords, function(x) grepl(x, filenames)))]
#[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
#[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"

答案 1 :(得分:1)

filenames[apply(sapply(keywords, function(x) grepl(x, filenames)), 1, function(y) sum(y) == length(y))]
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"

答案 2 :(得分:1)

keywords <- c("Syn.*2017")

> filenames[grep(keywords,filenames)]
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"

答案 3 :(得分:1)

 grep("Syn.*?2017",filenames,value = T)
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"

regmatches(filenames,regexpr("(.*Syn).*?2017(.*)",filenames)))
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"

unlist(regmatches(filenames,gregexpr("(.*Syn).*?2017(.*)",filenames)))
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"

您可以使用任何套装手头的工作。