如果string包含vector的任何元素

时间:2014-04-11 10:36:12

标签: regex string r extract

我有这样的问题: 我有2个txt文件。 一个看起来像这样:

ABCG1
ABLIM1
ABP1
ACOT11
ACP5

包含700多个字符串,第二个字符串如下:

1       2       3       4       5       6       GENE_NAME
0.01857 0.02975 0.02206 0.01847 0.01684 0.01588 NIPA2;NIPA2;NIPA2;NIPA2
0.81992 0.8168  0.76963 0.83116 0.78114 0.85544 MAN1B1
0.13053 0.12308 0.10654 0.11675 0.13664 0.10312 TSEN34;TSEN34
0.91888 0.93095 0.91498 0.91558 0.91126 0.91569 LRRC16A

它的尺寸是90 + x640 000 +

我想提取第二个制表符分隔文件的字符串,其中包含第一个的任何值。我想到了类似的东西:

data=x[1,]
data=data[-1,]
for (i in 1:nrow(test)){
    if (grepl("gene_name",test[i,]$GENE_NAME=="TRUE")){
    data_temp=x[i,]
    data=rbind(data,data_temp)
    rm(data_temp)
    }

但问题是我必须重复这段代码700多次。有没有办法写这样的smth:

value= c(vector that contains my gene names)
string= (one of srings of my table)
grepl(any(value),string)

我遇到了any的问题,因为它使向量逻辑而不是字符。 先感谢您。

1 个答案:

答案 0 :(得分:0)

这会对你有用吗?

value <- c("ABCG1",
          "ABLIM1",
          "ABP1",
          "ACOT11",
          "ACP5")


GENE_NAME <- c("ABCG1;NIPA2;NIPA2",
           "ABLIM1",
           "ABP1;ABCG1",
           "ACOT11",
           "TSEN34;TSEN34",
           "ACP5",
           "LRRC16A") # This is the test$GENE_NAME column

lapply(value, function(x) GENE_NAME[grepl(x, GENE_NAME)])
# [[1]]
# [1] "ABCG1;NIPA2;NIPA2" "ABP1;ABCG1"       
# 
# [[2]]
# [1] "ABLIM1"
# 
# [[3]]
# [1] "ABP1;ABCG1"
# 
# [[4]]
# [1] "ACOT11"
# 
# [[5]]
# [1] "ACP5"

如果您愿意,可以将其取消列出

unlist(lapply(value, function(x) GENE_NAME[grepl(x, GENE_NAME)]))
# [1] "ABCG1;NIPA2;NIPA2" "ABP1;ABCG1"        "ABLIM1"            "ABP1;ABCG1"        "ACOT11"           
# [6] "ACP5"