将列移动到行r

时间:2017-03-28 15:42:05

标签: r dataframe

我正在尝试移动数据框中的所有列,以便它们彼此匹配。这意味着重复第一列(下面的命名位置)列的次数与列数相同。

            location                   160095-T_S2_L001_R1_001.bam   160096-N_S4_L001_R1_001.bam     160094-T_S12_L001_R1_001.bam   160095-N_S1_L001_R1_001.bam
            1:1-100000  NA  NA  NA  NA
            1:100001-200000 2   2   4   1
            1:200001-300000 1   NA  NA  NA
            1:300001-400000 3   3   3   3
            2:1-100000  NA  NA  NA  NA
            2:100001-200000 1   1   NA  NA

所以它看起来像这样:

            location    sample_id   number
            1:1-100000                 160095-T_S2_L001_R1_001.bam  NA
            1:100001-200000                160095-T_S2_L001_R1_001.bam  2
            1:200001-300000                160095-T_S2_L001_R1_001.bam  1
            1:300001-400000                160095-T_S2_L001_R1_001.bam  3
            2:1-100000                 160095-T_S2_L001_R1_001.bam  NA
            2:100001-200000                160095-T_S2_L001_R1_001.bam  1
            1:1-100000   160096-N_S4_L001_R1_001.bam    NA
            1:100001-200000  160096-N_S4_L001_R1_001.bam    2
            1:200001-300000  160096-N_S4_L001_R1_001.bam    NA
            1:300001-400000  160096-N_S4_L001_R1_001.bam    3
            2:1-100000   160096-N_S4_L001_R1_001.bam    NA
            2:100001-200000  160096-N_S4_L001_R1_001.bam    1
            1:1-100000   160094-T_S12_L001_R1_001.bam   NA
            1:100001-200000  160094-T_S12_L001_R1_001.bam   4
            1:200001-300000  160094-T_S12_L001_R1_001.bam   NA
            1:300001-400000  160094-T_S12_L001_R1_001.bam   3
            2:1-100000   160094-T_S12_L001_R1_001.bam   NA
            2:100001-200000  160094-T_S12_L001_R1_001.bam   NA
            1:1-100000  160095-N_S1_L001_R1_001.bam NA
            1:100001-200000 160095-N_S1_L001_R1_001.bam 1
            1:200001-300000 160095-N_S1_L001_R1_001.bam NA
            1:300001-400000 160095-N_S1_L001_R1_001.bam 3
            2:1-100000  160095-N_S1_L001_R1_001.bam NA
            2:100001-200000 160095-N_S1_L001_R1_001.bam NA

我试过转置t(数据帧),但这只是转换整个数据帧而不是我想要的列。

我还想拆分位置列,使其首先用冒号分割,然后用破折号分割成三个单独的列。

            chromosome  start   stop    sample_id   number
            1   1   100000                 160095-T_S2_L001_R1_001.bam  NA
            1   100001  200000                 160095-T_S2_L001_R1_001.bam  2
            1   200001  300000                 160095-T_S2_L001_R1_001.bam  1
            1   300001  400000                 160095-T_S2_L001_R1_001.bam  3
            2   1   100000                 160095-T_S2_L001_R1_001.bam  NA
            2   100001  200000                 160095-T_S2_L001_R1_001.bam  1
            1   1   100000   160096-N_S4_L001_R1_001.bam    NA
            1   100001  200000   160096-N_S4_L001_R1_001.bam    2
            1   200001  300000   160096-N_S4_L001_R1_001.bam    NA
            1   300001  400000   160096-N_S4_L001_R1_001.bam    3
            2   1   100000   160096-N_S4_L001_R1_001.bam    NA
            2   100001  200000   160096-N_S4_L001_R1_001.bam    1
            1   1   100000   160094-T_S12_L001_R1_001.bam   NA
            1   100001  200000   160094-T_S12_L001_R1_001.bam   4
            1   200001  300000   160094-T_S12_L001_R1_001.bam   NA
            1   300001  400000   160094-T_S12_L001_R1_001.bam   3
            2   1   100000   160094-T_S12_L001_R1_001.bam   NA
            2   100001  200000   160094-T_S12_L001_R1_001.bam   NA
            1   1   100000  160095-N_S1_L001_R1_001.bam NA
            1   100001  200000  160095-N_S1_L001_R1_001.bam 1
            1   200001  300000  160095-N_S1_L001_R1_001.bam NA
            1   300001  400000  160095-N_S1_L001_R1_001.bam 3
            2   1   100000  160095-N_S1_L001_R1_001.bam NA
            2   100001  200000  160095-N_S1_L001_R1_001.bam NA

1 个答案:

答案 0 :(得分:1)

这是使用基础R的解决方案:

这用于从您的示例中创建data.frame以及其他人使用。

d <- structure(list(location = c("1:1-100000", "1:100001-200000", 
  "1:200001-300000", "1:300001-400000", "2:1-100000", "2:100001-200000"
), `160095-T_S2_L001_R1_001.bam` = c(NA, 2L, 1L, 3L, NA, 1L), 
  `160096-N_S4_L001_R1_001.bam` = c(NA, 2L, NA, 3L, NA, 1L), 
  `160094-T_S12_L001_R1_001.bam` = c(NA, 4L, NA, 3L, NA, NA
  ), `160095-N_S1_L001_R1_001.bam` = c(NA, 1L, NA, 3L, NA, 
    NA)), .Names = c("location", "160095-T_S2_L001_R1_001.bam", 
      "160096-N_S4_L001_R1_001.bam", "160094-T_S12_L001_R1_001.bam", 
      "160095-N_S1_L001_R1_001.bam"), class = "data.frame", row.names = c(NA, 
        -6L))

首先,使用reshape将数据放入长格式

long <- reshape(d, varying=2:5, v.names="number", timevar="sample_id",
  times=names(d)[2:5], direction="long")

此功能不是非常直观,需要大量的实验才能使我的体验正确。

> head(long)
                                     location                   sample_id number id
1.160095-T_S2_L001_R1_001.bam      1:1-100000 160095-T_S2_L001_R1_001.bam     NA  1
2.160095-T_S2_L001_R1_001.bam 1:100001-200000 160095-T_S2_L001_R1_001.bam      2  2
3.160095-T_S2_L001_R1_001.bam 1:200001-300000 160095-T_S2_L001_R1_001.bam      1  3
4.160095-T_S2_L001_R1_001.bam 1:300001-400000 160095-T_S2_L001_R1_001.bam      3  4
5.160095-T_S2_L001_R1_001.bam      2:1-100000 160095-T_S2_L001_R1_001.bam     NA  5
6.160095-T_S2_L001_R1_001.bam 2:100001-200000 160095-T_S2_L001_R1_001.bam      1  6

接下来,使用strsplt和分隔冒号和破折号的正则表达式将位置字符串拆分为三个部分。结果是一个字符矩阵,但它需要是数字,所以我改变了矩阵的模式。

splt <- do.call(rbind, strsplit(long$location, "(:|-|\\s+)"))
mode(splt) <- "numeric"

colnames(splt) <- c("chromosome", "start", "stop")

> head(splt)
     chromosome  start   stop
[1,]          1      1 100000
[2,]          1 100001 200000
[3,]          1 200001 300000
[4,]          1 300001 400000
[5,]          2      1 100000
[6,]          2 100001 200000

最后一步是创建一个包含您需要的所有字段的data.frame

result <- data.frame(splt, long[c("sample_id","number")], row.names = NULL)

> head(result)
  chromosome  start   stop                   sample_id number
1          1      1 100000 160095-T_S2_L001_R1_001.bam     NA
2          1 100001 200000 160095-T_S2_L001_R1_001.bam      2
3          1 200001 300000 160095-T_S2_L001_R1_001.bam      1
4          1 300001 400000 160095-T_S2_L001_R1_001.bam      3
5          2      1 100000 160095-T_S2_L001_R1_001.bam     NA
6          2 100001 200000 160095-T_S2_L001_R1_001.bam      1