如何将行中任何位置的数据帧匹配关键字分成两个数据帧

时间:2015-12-13 04:00:12

标签: regex r sorting

我有一个名为mydf的数据框,其中包含以GS0000XXXX-ASM开头的示例行,其中包含两部分high confidence数据和low confidence数据。我想分离每个样本行的高和低置信度数据,并得到如下所示的结果。

mydf<-structure(list(assembly_id = c("GS000038075-ASM", "High confidence t(2:Y), t(5:7)", 
NA, "Low confidence t(2:Y), t(5:7)", NA, NA, "GS000038040-ASM", 
"High confidence t(1:17), t(2:6)", NA, "Low confidence t(1:17), t(2:6)", 
NA, NA), sample_id = c("GS02589-DNA_E06", NA, NA, NA, NA, NA, 
"GS02589-DNA_F01", NA, NA, NA, NA, NA), customer_sample_id = c("AMLM12001KP", 
NA, NA, NA, NA, NA, "1114002", NA, NA, NA, NA, NA), `>Id` = c(NA, 
"4264", NA, "217", "4264", "219", NA, "3329", "3764", "790", 
"1586", "3329"), LeftChr = c(NA, "chr2", NA, "chr2", "chr2", 
"chr2", NA, "chr1", "chr2", "chr1", "chr1", "chr1"), LeftPosition = c(NA, 
"133017438", NA, "133012293", "133017438", "133018715", NA, "207868617", 
"156528197", "91852788", "91852976", "207868617")), .Names = c("assembly_id", 
"sample_id", "customer_sample_id", ">Id", "LeftChr", "LeftPosition"
), row.names = c(1L, 3L, 5L, 6L, 7L, 8L, 17L, 19L, 20L, 22L, 
23L, 24L), class = "data.frame")

结果

result <- structure(list(assembly_id = c("GS000038075-ASM", "High confidence t(2:Y), t(5:7)", 
NA, "GS000038040-ASM", "High confidence t(1:17), t(2:6)", NA, 
"GS000038075-ASM", "Low confidence t(2:Y), t(5:7)", NA, NA, "GS000038040-ASM", 
"Low confidence t(1:17), t(2:6)", NA, NA), sample_id = c("GS02589-DNA_E06", 
NA, NA, "GS02589-DNA_F01", NA, NA, "GS02589-DNA_E06", NA, NA, 
NA, "GS02589-DNA_F01", NA, NA, NA), customer_sample_id = c("AMLM12001KP", 
NA, NA, "1114002", NA, NA, "AMLM12001KP", NA, NA, NA, "1114002", 
NA, NA, NA), `>Id` = c(NA, "4264", NA, NA, "3329", "3764", NA, 
"217", "4264", "219", NA, "790", "1586", "3329"), LeftChr = c(NA, 
"chr2", NA, NA, "chr1", "chr2", NA, "chr2", "chr2", "chr2", NA, 
"chr1", "chr1", "chr1"), LeftPosition = c(NA, "133017438", NA, 
NA, "207868617", "156528197", NA, "133012293", "133017438", "133018715", 
NA, "91852788", "91852976", "207868617")), .Names = c("assembly_id", 
"sample_id", "customer_sample_id", ">Id", "LeftChr", "LeftPosition"
), row.names = c("1", "3", "5", "17", "19", "20", "1.1", "6", 
"7", "8", "17.1", "22", "23", "24"), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

我们通过基于“sample_id”列中的非NA值进行分组,将数据集split转换为list

lst <- split(mydf, cumsum(!is.na(mydf$sample_id)))

然后,遍历list,创建另一个分组变量(可以在第一步中完成,但为了清楚起见),使用非NA值'assembly_id',split {每个list元素中的{1}}元素和rbind第一行,使用list折叠列表列表,最后,我们do.call(rbind全部rbind 1}}元素在一起。

list