我有一个名为my mydf的数据框。我想根据ASM
列中给出的格式拆分GPM
和FORMAT
列中的内容并获取result
。基本上,ASM和GPM列的列数将会多,因为FORMAT列中的总独特元素(即5个不同的唯一元素)由:
分隔,以便在result
中展开。然后需要在FORMAT
列中指示的右侧列(.GT,.FT等)中放置正确的值。
mydf <- structure(list(`#CHROM` = c(1L, 1L, 1L), POS = c(10490L, 10493L,
10494L), FORMAT = c("GT:FT:GQ", "GT:PS:GL", "GT:PS:FT"), ASM = c("1/1:TRUE:4,2,333",
"./.:.:.", "0/1:.:VQLOW"), GPM = c("./.:.:.", "1/1:4:2,233",
"0/1:22:VQHIGH")), .Names = c("#CHROM", "POS", "FORMAT", "ASM",
"GPM"), class = "data.frame", row.names = c(NA, -3L))
结果:
result <- structure(list(`#CHROM` = c(1L, 1L, 1L), POS = c(10490L, 10493L,
10494L), FORMAT = c("GT:FT:GQ", "GT:PS:GL", "GT:PS:FT"), ASM.GT = c("1/1",
"./.", "0/1"), ASM.FT = c("TRUE", NA, "VQLOW"), ASM.GQ = c("4,2,333",
NA, NA), ASM.PS = c(NA, NA, NA), ASM.GL = c(NA, NA, NA), GPM.GT = c("./.",
"1/1", "0/1"), GPM.FT = c(NA, NA, "VQHIGH"), GPM.GQ = c(NA, NA,
NA), GPM.PS = c(NA, 4L, 22L), GPM.GL = c(NA, 2233L, NA)), .Names = c("#CHROM",
"POS", "FORMAT", "ASM.GT", "ASM.FT", "ASM.GQ", "ASM.PS", "ASM.GL",
"GPM.GT", "GPM.FT", "GPM.GQ", "GPM.PS", "GPM.GL"), class = "data.frame", row.names = c(NA,
-3L))
答案 0 :(得分:2)
由于看起来要拆分的每个列中的值的数量相同,我们可以利用dcast
in&#34; data.table&#34;处理多个value.var
s。
分割可以通过我的&#34; splitstackshape&#34;中的cSplit
来完成。封装
library(splitstackshape)
dcast(cSplit(mydf, c("FORMAT", "ASM", "GPM"), ":", "long"),
`#CHROM` + POS ~ FORMAT, value.var = c("ASM", "GPM"))
# #CHROM POS ASM_FT ASM_GL ASM_GQ ASM_GT ASM_PS GPM_FT GPM_GL GPM_GQ GPM_GT GPM_PS
# 1: 1 10490 TRUE NA 4,2,333 1/1 NA . NA . ./. NA
# 2: 1 10493 NA . NA ./. . NA 2,233 NA 1/1 4
# 3: 1 10494 VQLOW NA NA 0/1 . VQHIGH NA NA 0/1 22
请注意"#CHROM"
是一个非常不友好的列名,因为#
是注释字符。
如果您需要添加回&#34; FORMAT&#34;列,在[, FORMAT:= mydf$FORMAT][]
上方的dcast
末尾添加.
。
我假设你可以从这里处理进一步的清理工作(例如,用NA
替换unboxed
并删除千位逗号分隔符。