从Tibble数据框中按部分字符串匹配对列表进行分组

时间:2019-06-12 10:32:47

标签: r regex dplyr

我的文件列表为listOfCELfiles

listOfCELfiles <- c(
"GSE20489/GSE20489_RAW//GSM514737.CEL.gz",
"GSE20489/GSE20489_RAW//GSM514738.CEL.gz",
"GSE20489/GSE20489_RAW//GSM514739.CEL.gz",
"GSE20489/GSE20489_RAW//GSM514740.CEL.gz",
"GSE20489/GSE20489_RAW//GSM514741.CEL.gz", 
"GSE20489/GSE20489_RAW//GSM514742.CEL.gz",
"GSE20489/GSE20489_RAW//GSM514743.CEL.gz", 
"GSE20489/GSE20489_RAW//GSM514744.CEL.gz", 
"GSE20489/GSE20489_RAW//GSM514745.CEL.gz"
)

数据框为timepoint_table

timepoint_table <- tibble(SampleID = c("GSM514737","GSM514738","GSM514739","GSM514740","GSM514741","GSM514742","GSM514743","GSM514744","GSM514745"),
SampleName = c("Blood_alcohol_T1_S13", "Blood_alcohol_T2_S13","Blood_OJalcohol_T3_S13","Blood_alcohol_T4_S13","Blood_OJalcohol_T5_S13","Blood_alcohol_T1_S15","Blood_alcohol_T2_S15","Blood_OJalcohol_T3_S15","Blood_OJalcohol_T4_S15"))

所以timepoint_table看起来像这样:

# A tibble: 9 x 2
  SampleID  SampleName            
  <chr>     <chr>                 
1 GSM514737 Blood_alcohol_T1_S13  
2 GSM514738 Blood_alcohol_T2_S13  
3 GSM514739 Blood_OJalcohol_T3_S13
4 GSM514740 Blood_alcohol_T4_S13  
5 GSM514741 Blood_OJalcohol_T5_S13
6 GSM514742 Blood_alcohol_T1_S15  
7 GSM514743 Blood_alcohol_T2_S15  
8 GSM514744 Blood_OJalcohol_T3_S15
9 GSM514745 Blood_OJalcohol_T4_S15

SampleIDlistOfCELfiles中文件名的一部分,现在我想通过与{{1}进行匹配,从Blood_alcohol中删除除listOfCELfiles之外的所有示例。 }的timepoint_table。以下代码从SampleName中选择匹配的SampleID

timepoint_table

但是我无法用匹配的timepoint_table %>% filter(str_detect(SampleName, "^Blood_alcohol")) %>% select(SampleID) (使用listOfCELfilesSampleID)来过滤grepl

我的预期输出将是一个包含以下内容的列表:

str_detect

1 个答案:

答案 0 :(得分:0)

您可以获取匹配的ID,然后使用grep

对其进行过滤
library(tidyverse)

ids <- timepoint_table %>%  
         filter(str_detect(SampleName, "^Blood_alcohol")) %>%  
         pull(SampleID)

grep(paste0(ids, collapse = "|"), listOfCELfiles, value = TRUE)

#[1] "GSE20489/GSE20489_RAW//GSM514737.CEL.gz" "GSE20489/GSE20489_RAW//GSM514738.CEL.gz"
#[3] "GSE20489/GSE20489_RAW//GSM514740.CEL.gz" "GSE20489/GSE20489_RAW//GSM514742.CEL.gz"
#[5] "GSE20489/GSE20489_RAW//GSM514743.CEL.gz"