将正则表达式与数据帧中的任何列匹配

时间:2014-06-14 09:47:06

标签: regex r dataframe subset

从数据框中我想要包含某些模式的所有行,例如“A”或“36”或“1?2”。我不关心哪个列与模式匹配,只要该行中某处存在匹配。

数据帧:

aName   bName   pName   call  alleles   logRatio    strength
AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801

我的实际数据框包含很多行,我不想硬编码他们的名字。模式可能更复杂,所以我想使用正则表达式。

在R中读取此数据框的代码

data <- read.table(textConnection("
aName   bName   pName   call  alleles   logRatio    strength
AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801
"), header = TRUE)

2 个答案:

答案 0 :(得分:2)

您可以使用grepl applyrowSums

> rowSums(apply(data, 2, grepl, pattern = "A")) > 0
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> rowSums(apply(data, 2, grepl, pattern = "1?2")) > 0
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> rowSums(apply(data, 2, grepl, pattern = "36")) > 0
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

> out <- rowSums(apply(data, 2, grepl, pattern = "36")) > 0
> data[out,]
        aName      bName          pName call alleles logRatio  strength
1 AX-11086564 F08_ADN103 2011-02-10_R10   AB      CG 0.363371 10.184215
6 AX-11086564 B05_CD3630  2011-02-02_R7   AA      CC 2.349906  9.153076

注意apply将强制as.vector

答案 1 :(得分:2)

在这里,我在data.frame中定义了一个grep包装器来搜索:

search_data_frame <- 
  function(patt,data)
    unlist(lapply (seq_len(nrow(data)),function(i) grep(patt,data[i,])))

然后你使用它:

  data[search_data_frame('36',data),]

        aName      bName          pName call alleles  logRatio strength
6 AX-11086564 B05_CD3630  2011-02-02_R7   AA      CC  2.349906 9.153076
2 AX-11086564 A01_CD1919 2011-02-24_R11   BB      GG -1.352707 9.549090

请注意,我使用stringsAsFactors=FALSE读取您的数据,否则您应该将您的因素强制转换为字符。 `