我有一个名为mydf
的数据框,其中包含以GS0000XXXX-ASM
开头的示例行,其中包含两部分high confidence
数据和low confidence
数据。我想分离每个样本行的高和低置信度数据,并得到如下所示的结果。
mydf<-structure(list(assembly_id = c("GS000038075-ASM", "High confidence t(2:Y), t(5:7)",
NA, "Low confidence t(2:Y), t(5:7)", NA, NA, "GS000038040-ASM",
"High confidence t(1:17), t(2:6)", NA, "Low confidence t(1:17), t(2:6)",
NA, NA), sample_id = c("GS02589-DNA_E06", NA, NA, NA, NA, NA,
"GS02589-DNA_F01", NA, NA, NA, NA, NA), customer_sample_id = c("AMLM12001KP",
NA, NA, NA, NA, NA, "1114002", NA, NA, NA, NA, NA), `>Id` = c(NA,
"4264", NA, "217", "4264", "219", NA, "3329", "3764", "790",
"1586", "3329"), LeftChr = c(NA, "chr2", NA, "chr2", "chr2",
"chr2", NA, "chr1", "chr2", "chr1", "chr1", "chr1"), LeftPosition = c(NA,
"133017438", NA, "133012293", "133017438", "133018715", NA, "207868617",
"156528197", "91852788", "91852976", "207868617")), .Names = c("assembly_id",
"sample_id", "customer_sample_id", ">Id", "LeftChr", "LeftPosition"
), row.names = c(1L, 3L, 5L, 6L, 7L, 8L, 17L, 19L, 20L, 22L,
23L, 24L), class = "data.frame")
结果
result <- structure(list(assembly_id = c("GS000038075-ASM", "High confidence t(2:Y), t(5:7)",
NA, "GS000038040-ASM", "High confidence t(1:17), t(2:6)", NA,
"GS000038075-ASM", "Low confidence t(2:Y), t(5:7)", NA, NA, "GS000038040-ASM",
"Low confidence t(1:17), t(2:6)", NA, NA), sample_id = c("GS02589-DNA_E06",
NA, NA, "GS02589-DNA_F01", NA, NA, "GS02589-DNA_E06", NA, NA,
NA, "GS02589-DNA_F01", NA, NA, NA), customer_sample_id = c("AMLM12001KP",
NA, NA, "1114002", NA, NA, "AMLM12001KP", NA, NA, NA, "1114002",
NA, NA, NA), `>Id` = c(NA, "4264", NA, NA, "3329", "3764", NA,
"217", "4264", "219", NA, "790", "1586", "3329"), LeftChr = c(NA,
"chr2", NA, NA, "chr1", "chr2", NA, "chr2", "chr2", "chr2", NA,
"chr1", "chr1", "chr1"), LeftPosition = c(NA, "133017438", NA,
NA, "207868617", "156528197", NA, "133012293", "133017438", "133018715",
NA, "91852788", "91852976", "207868617")), .Names = c("assembly_id",
"sample_id", "customer_sample_id", ">Id", "LeftChr", "LeftPosition"
), row.names = c("1", "3", "5", "17", "19", "20", "1.1", "6",
"7", "8", "17.1", "22", "23", "24"), class = "data.frame")
答案 0 :(得分:1)
我们通过基于“sample_id”列中的非NA值进行分组,将数据集split
转换为list
。
lst <- split(mydf, cumsum(!is.na(mydf$sample_id)))
然后,遍历list
,创建另一个分组变量(可以在第一步中完成,但为了清楚起见),使用非NA值'assembly_id',split
{每个list
元素中的{1}}元素和rbind
第一行,使用list
折叠列表列表,最后,我们do.call(rbind
全部rbind
1}}元素在一起。
list