我正在尝试移动数据框中的所有列,以便它们彼此匹配。这意味着重复第一列(下面的命名位置)列的次数与列数相同。
location 160095-T_S2_L001_R1_001.bam 160096-N_S4_L001_R1_001.bam 160094-T_S12_L001_R1_001.bam 160095-N_S1_L001_R1_001.bam
1:1-100000 NA NA NA NA
1:100001-200000 2 2 4 1
1:200001-300000 1 NA NA NA
1:300001-400000 3 3 3 3
2:1-100000 NA NA NA NA
2:100001-200000 1 1 NA NA
所以它看起来像这样:
location sample_id number
1:1-100000 160095-T_S2_L001_R1_001.bam NA
1:100001-200000 160095-T_S2_L001_R1_001.bam 2
1:200001-300000 160095-T_S2_L001_R1_001.bam 1
1:300001-400000 160095-T_S2_L001_R1_001.bam 3
2:1-100000 160095-T_S2_L001_R1_001.bam NA
2:100001-200000 160095-T_S2_L001_R1_001.bam 1
1:1-100000 160096-N_S4_L001_R1_001.bam NA
1:100001-200000 160096-N_S4_L001_R1_001.bam 2
1:200001-300000 160096-N_S4_L001_R1_001.bam NA
1:300001-400000 160096-N_S4_L001_R1_001.bam 3
2:1-100000 160096-N_S4_L001_R1_001.bam NA
2:100001-200000 160096-N_S4_L001_R1_001.bam 1
1:1-100000 160094-T_S12_L001_R1_001.bam NA
1:100001-200000 160094-T_S12_L001_R1_001.bam 4
1:200001-300000 160094-T_S12_L001_R1_001.bam NA
1:300001-400000 160094-T_S12_L001_R1_001.bam 3
2:1-100000 160094-T_S12_L001_R1_001.bam NA
2:100001-200000 160094-T_S12_L001_R1_001.bam NA
1:1-100000 160095-N_S1_L001_R1_001.bam NA
1:100001-200000 160095-N_S1_L001_R1_001.bam 1
1:200001-300000 160095-N_S1_L001_R1_001.bam NA
1:300001-400000 160095-N_S1_L001_R1_001.bam 3
2:1-100000 160095-N_S1_L001_R1_001.bam NA
2:100001-200000 160095-N_S1_L001_R1_001.bam NA
我试过转置t(数据帧),但这只是转换整个数据帧而不是我想要的列。
我还想拆分位置列,使其首先用冒号分割,然后用破折号分割成三个单独的列。
chromosome start stop sample_id number
1 1 100000 160095-T_S2_L001_R1_001.bam NA
1 100001 200000 160095-T_S2_L001_R1_001.bam 2
1 200001 300000 160095-T_S2_L001_R1_001.bam 1
1 300001 400000 160095-T_S2_L001_R1_001.bam 3
2 1 100000 160095-T_S2_L001_R1_001.bam NA
2 100001 200000 160095-T_S2_L001_R1_001.bam 1
1 1 100000 160096-N_S4_L001_R1_001.bam NA
1 100001 200000 160096-N_S4_L001_R1_001.bam 2
1 200001 300000 160096-N_S4_L001_R1_001.bam NA
1 300001 400000 160096-N_S4_L001_R1_001.bam 3
2 1 100000 160096-N_S4_L001_R1_001.bam NA
2 100001 200000 160096-N_S4_L001_R1_001.bam 1
1 1 100000 160094-T_S12_L001_R1_001.bam NA
1 100001 200000 160094-T_S12_L001_R1_001.bam 4
1 200001 300000 160094-T_S12_L001_R1_001.bam NA
1 300001 400000 160094-T_S12_L001_R1_001.bam 3
2 1 100000 160094-T_S12_L001_R1_001.bam NA
2 100001 200000 160094-T_S12_L001_R1_001.bam NA
1 1 100000 160095-N_S1_L001_R1_001.bam NA
1 100001 200000 160095-N_S1_L001_R1_001.bam 1
1 200001 300000 160095-N_S1_L001_R1_001.bam NA
1 300001 400000 160095-N_S1_L001_R1_001.bam 3
2 1 100000 160095-N_S1_L001_R1_001.bam NA
2 100001 200000 160095-N_S1_L001_R1_001.bam NA
答案 0 :(得分:1)
这是使用基础R的解决方案:
这用于从您的示例中创建data.frame
以及其他人使用。
d <- structure(list(location = c("1:1-100000", "1:100001-200000",
"1:200001-300000", "1:300001-400000", "2:1-100000", "2:100001-200000"
), `160095-T_S2_L001_R1_001.bam` = c(NA, 2L, 1L, 3L, NA, 1L),
`160096-N_S4_L001_R1_001.bam` = c(NA, 2L, NA, 3L, NA, 1L),
`160094-T_S12_L001_R1_001.bam` = c(NA, 4L, NA, 3L, NA, NA
), `160095-N_S1_L001_R1_001.bam` = c(NA, 1L, NA, 3L, NA,
NA)), .Names = c("location", "160095-T_S2_L001_R1_001.bam",
"160096-N_S4_L001_R1_001.bam", "160094-T_S12_L001_R1_001.bam",
"160095-N_S1_L001_R1_001.bam"), class = "data.frame", row.names = c(NA,
-6L))
首先,使用reshape将数据放入长格式
long <- reshape(d, varying=2:5, v.names="number", timevar="sample_id",
times=names(d)[2:5], direction="long")
此功能不是非常直观,需要大量的实验才能使我的体验正确。
> head(long)
location sample_id number id
1.160095-T_S2_L001_R1_001.bam 1:1-100000 160095-T_S2_L001_R1_001.bam NA 1
2.160095-T_S2_L001_R1_001.bam 1:100001-200000 160095-T_S2_L001_R1_001.bam 2 2
3.160095-T_S2_L001_R1_001.bam 1:200001-300000 160095-T_S2_L001_R1_001.bam 1 3
4.160095-T_S2_L001_R1_001.bam 1:300001-400000 160095-T_S2_L001_R1_001.bam 3 4
5.160095-T_S2_L001_R1_001.bam 2:1-100000 160095-T_S2_L001_R1_001.bam NA 5
6.160095-T_S2_L001_R1_001.bam 2:100001-200000 160095-T_S2_L001_R1_001.bam 1 6
接下来,使用strsplt
和分隔冒号和破折号的正则表达式将位置字符串拆分为三个部分。结果是一个字符矩阵,但它需要是数字,所以我改变了矩阵的模式。
splt <- do.call(rbind, strsplit(long$location, "(:|-|\\s+)"))
mode(splt) <- "numeric"
colnames(splt) <- c("chromosome", "start", "stop")
> head(splt)
chromosome start stop
[1,] 1 1 100000
[2,] 1 100001 200000
[3,] 1 200001 300000
[4,] 1 300001 400000
[5,] 2 1 100000
[6,] 2 100001 200000
最后一步是创建一个包含您需要的所有字段的data.frame
。
result <- data.frame(splt, long[c("sample_id","number")], row.names = NULL)
> head(result)
chromosome start stop sample_id number
1 1 1 100000 160095-T_S2_L001_R1_001.bam NA
2 1 100001 200000 160095-T_S2_L001_R1_001.bam 2
3 1 200001 300000 160095-T_S2_L001_R1_001.bam 1
4 1 300001 400000 160095-T_S2_L001_R1_001.bam 3
5 2 1 100000 160095-T_S2_L001_R1_001.bam NA
6 2 100001 200000 160095-T_S2_L001_R1_001.bam 1