使用gsub拆分R中data.frame的列

时间:2015-03-03 12:55:04

标签: regex r substr gsub strsplit

我有一个名为rbp的data.frame,它包含一个如下列的列:

 >rbp
          V1
    dd_smadV1_39992_0_1
    Protein: AGBT(Dm)
    Sequence Position
    234
    290
    567
    126
    Protein: ATF1(Dm)
    Sequence Position
    534
    890
    105
    34
    128
    301
    Protein: Pox(Dm)
    201
    875
    453
    *********************
    dd_smadv1_9_02
    Protein: foxc2(Mm)
    Sequence Position
    145
    987
    345
    907
    Protein: Lor(Hs)
    876
    512

我想丢弃序列位置并仅提取特定细节,如序列名称和相应的蛋白质名称,如下所示:

dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)
dd_smadv1_9_02 foxc2(Mm);Lor(Hs)  

我在R中尝试了以下代码,但失败了:

library(gsubfn)
Sub(rbp$V1,"Protein:(.*?) ")

请有人指导我。

1 个答案:

答案 0 :(得分:1)

这是一种方法:

m <- gregexpr("Protein: (.*?)\n", x <- strsplit(paste(rbp$V1, collapse = "\n"), "*********************", fixed = TRUE)[[1]])
proteins <- lapply(regmatches(x, m), function(x) sub("Protein: (.*)\n", "\\1", x))
names <- sub(".*?([A-z0-9_]+)\n.*", "\\1", x)
sprintf("%s %s", names, sapply(proteins, paste, collapse = ";"))
# [1] "dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)"
# [2] "dd_smadv1_9_02 foxc2(Mm);Lor(Hs)