数据操作和R循环的性能

时间:2015-09-04 14:42:01

标签: r performance statistics data-manipulation

我有一个由896 SNP in a genome行组成的数据集。然后我有来自特定创始人的每个SNP的概率值。创始人分为8组。

我想检查每个SNP概率,如果最大值有一个平局,那么检查每组8位创始人的最高值,看看是否重复。我只对双向关系感兴趣。然后我想得到平局的价值,它出现的SNP以及创始人拥有它。

问题在于我以一种方式向我提供了数据,即所有创建者组都在列中重复排列,其布局如下:

SNP P1_1 P1_2 P1_3 P1_4 P1_5 P1_6 P1_7 P1_8 ... Pn_1 Pn_2 Pn_3 Pn_4 Pn_5 Pn_6 Pn7_ Pn8
Asd 
FDA
FDG
GDE
GBD
SDF

我已经通过使用循环来解决这个问题,但总的来说,我需要花费17个小时(!!!)来完成它,896个SNP和600个以上的8个创建者组。

有没有提高效率?我在考虑像melt()这样的东西,并将最终数据集分成8列组?这是否可能,即便如此,它会使它更有效率?

到目前为止,我已经解决了这个问题:

require(reshape2)
A4 <- read.csv("sample.csv", header=TRUE)


SNP_count <-length(A4[,1])

A5 <- melt(A4)

#Give single index from 1-8 to each FOUNDER
A5[,4]<-rep(rep(seq(1,8,by=1),each=SNP_count),times=length(A5[,1])/SNP_count/8)

#Give a groupind number to each row so we know in which FOUNDER group it belongs
A5[,5]<-rep(seq(1,length(A5[,1])/SNP_count/8,by=1),each=SNP_count*8)

A5[,6]<-rep(seq(1,SNP_count))

colnames(A5) <- c("SNP","FOUNDER","PROB","FOUNDER_ID","GROUP_ID", "SNP_ID")

maxes<-NULL
x=1
i=1

for (x in 1:length(unique(A5$GROUP_ID))) { 

 for (i in 1:length(unique(A5$SNP_ID))) {
   e<-NULL

    e <- A5$PROB[A5$GROUP_ID==x & A5$SNP_ID==i] #save the rows of values here for each founder group

     if(length(which(e == max(e))) == 2){
       len <- length(which(e == max(e)))
       max1 <- max(e)
       founder <- as.character(A5[,2][A5$GROUP_ID==x & A5$SNP_ID==i])[1]
       SNP_INFO <- as.character(A5[,1][A5$GROUP_ID==x & A5$SNP_ID==i][1])
       ties <- length(which(e == max(e)))
       tie1 <- which(e==max(e))[1]
       tie2 <- which(e==max(e))[2]
       maxes <- rbind(maxes, c(e, SNP_INFO, ties, tie1, tie2, max1, founder))
       print(e)

}
 }
  }

从测试来看,似乎让它变得非常慢的那条线就是:

e<-A5$PROB[A5$GROUP_ID==x & A5$SNP_ID==i]

任何人都可以得到任何帮助。

下面给出的输出结果

structure(list(SNP = structure(c(571L, 690L, 7L, 204L, 317L, 
39L, 559L, 657L, 221L, 470L, 576L, 49L, 413L, 460L, 606L, 348L, 
111L, 129L, 30L, 290L, 291L, 767L, 96L, 172L, 38L, 48L, 54L, 
144L, 273L, 644L), .Label = c("BobWhite_c11298_512", "BobWhite_c11935_137", 
"BobWhite_c12428_371", "BobWhite_c1740_97", "BobWhite_c17852_511", 
"BobWhite_c17879_519", "BobWhite_c18256_105", "BobWhite_c18408_199", 
"BobWhite_c18593_955", "BobWhite_c1895_1953", "BobWhite_c21397_148", 
"BobWhite_c28950_147", "BobWhite_c29419_116", "BobWhite_c30009_285", 
"BobWhite_c30232_154", "BobWhite_c31500_144", "BobWhite_c34548_96", 
"BobWhite_c34866_232", "BobWhite_c35093_176", "BobWhite_c35303_192", 
"BobWhite_c42102_237", "BobWhite_c43213_184", "BobWhite_c43681_334", 
"BobWhite_c45534_535", "BobWhite_c46361_331", "BobWhite_c47722_613", 
"BobWhite_c5337_225", "BobWhite_c8340_511", "BobWhite_c9704_273", 
"BobWhite_c9961_402", "BobWhite_c9992_862a", "BobWhite_rep_c49102_169", 
"BobWhite_rep_c55551_53", "BobWhite_s65081_93", "BobWhite_s67516_159", 
"BS00000445_51", "BS00001478_51", "BS00003733_51", "BS00003932_51", 
"BS00003935_51", "BS00003956_51", "BS00003964_51", "BS00003971_51", 
"BS00006822_51", "BS00007502_51", "BS00009649_51", "BS00009657_51", 
"BS00009793_51", "BS00010059_51", "BS00010204_51", "BS00010531_51", 
"BS00010854_51", "BS00011231_51", "BS00011516_51", "BS00011612_51", 
"BS00012551_51", "BS00013584_51", "BS00017988_51", "BS00020459_51", 
"BS00021694_51", "BS00021871_51", "BS00021909_51", "BS00021981_51", 
"BS00022016_51", "BS00022029_51", "BS00022148_51", "BS00022159_51", 
"BS00022182_51", "BS00022256_51", "BS00022273_51", "BS00022368_51", 
"BS00022419_51", "BS00022424_51", "BS00022459_51", "BS00022516_51", 
"BS00022528_51", "BS00022586_51", "BS00022703_51", "BS00022746_51", 
"BS00022804_51", "BS00022845_51", "BS00022882_51", "BS00022884_51", 
"BS00022908_51", "BS00022968_51", "BS00022989_51", "BS00023026_51", 
"BS00023028_51", "BS00023189_51", "BS00023222_51", "BS00023337_51", 
"BS00023511_51", "BS00025191_51", "BS00026189_51", "BS00026396_51", 
"BS00027516_51", "BS00028660_51", "BS00029361_51", "BS00029569_51", 
"BS00030652_51", "BS00030876_51", "BS00032025_51", "BS00032524_51", 
"BS00032525_51", "BS00033658_51", "BS00034334_51", "BS00035364_51", 
"BS00035576_51", "BS00036168_51", "BS00036492_51", "BS00037189_51", 
"BS00037400_51", "BS00037537_51", "BS00038663_51", "BS00038664_51", 
"BS00039357_51", "BS00039498_51", "BS00040798_51", "BS00041121_51", 
"BS00041462_51", "BS00041742_51", "BS00042191_51", "BS00045327_51", 
"BS00047668_51", "BS00047836_51", "BS00048031_51", "BS00048633_51", 
"BS00048757_51", "BS00049420_51", "BS00049636_51", "BS00049637_51", 
"BS00049977_51", "BS00055211_51", "BS00056089_51", "BS00057444_51", 
"BS00057445_51", "BS00057523_51", "BS00057524_51", "BS00060029_51", 
"BS00060391_51", "BS00061173_51", "BS00061179_51", "BS00062645_51", 
"BS00062673_51", "BS00062808_51", "BS00063163_51", "BS00063300_51", 
"BS00063696_51", "BS00064039_51", "BS00064422_51", "BS00064487_51", 
"BS00064692_51", "BS00064703_51", "BS00065468_51", "BS00065673_51", 
"BS00065734_51", "BS00065840_51", "BS00065932_51", "BS00065956_51", 
"BS00066230_51", "BS00066475_51", "BS00066585_51", "BS00066714_51", 
"BS00066800_51", "BS00067190_51", "BS00067215_51", "BS00067216_51", 
"BS00067224_51", "BS00067499_51", "BS00067711_51", "BS00067748_51", 
"BS00067777_51", "BS00067940_51", "BS00067997_51", "BS00068094_51", 
"BS00068508_51", "BS00068520_51", "BS00069355_51", "BS00070464_51", 
"BS00070511_51", "BS00070870_51", "BS00071422_51", "BS00071823_51", 
"BS00071948_51", "BS00072186_51", "BS00072296_51", "BS00073009_51", 
"BS00073525_51", "BS00073854_51", "BS00074617_51", "BS00074926_51", 
"BS00075119_51", "BS00075515_51", "BS00075598_51", "BS00076772_51", 
"BS00077819_51", "BS00078430_51", "BS00078669_51", "BS00080546_51", 
"BS00080879_51", "BS00081418_51", "BS00081475_51", "BS00081610_51", 
"BS00081981_51", "BS00082982_51", "BS00083279_51", "BS00083398_51", 
"BS00084158_51", "BS00084159_51", "BS00085967_51", "BS00086050_51", 
"BS00086051_51", "BS00086247_51", "BS00088755_51", "BS00088756_51", 
"BS00089400_51", "BS00090225_51", "BS00090405_51", "BS00091002_51", 
"BS00091887_51", "BS00092728_51", "BS00092735_51", "BS00093252_51", 
"BS00093377_51", "BS00093871_51", "BS00093889_51", "BS00094366_51", 
"BS00094523_51", "BS00095228_51", "BS00095423_51", "BS00096642_51", 
"BS00097265_51", "BS00098840_51", "BS00099980_51", "BS00100626_51", 
"BS00101363_51", "BS00101401_51", "BS00104401_51", "BS00105963_51", 
"BS00105964_51", "BS00106008_51", "BS00106932_51", "BS00107444_51", 
"BS00109084_51", "BS00109700_51", "BS00109935_51", "BS00110266_51", 
"BS00110405_51", "BS00110452_51", "BS00110550_51", "BS00110564_51", 
"BS00110638_51a", "CAP11_c1022_117", "CAP11_c1022_66", "CAP11_c3472_60", 
"CAP11_c6193_232", "CAP11_rep_c6920_161", "CAP12_c1840_108", 
"CAP12_c1860_280", "CAP7_c1238_51", "CAP7_c2141_129", "CAP7_c3178_52", 
"CAP7_c3367_68", "CAP7_c915_121", "CAP7_rep_c12537_81", "CAP8_c1393_327", 
"CAP8_c359_95", "CAP8_c665_242", "CAP8_c665_409", "CAP8_rep_c3652_80", 
"CAP8_rep_c6382_237", "D_contig22919_290", "D_F5XZDLF02GN47Z_208", 
"D_GBB4FNX02FX302_98", "Ex_c104581_457", "Ex_c15087_564", "Ex_c17586_108", 
"Ex_c18484_2026", "Ex_c18484_2048", "Ex_c24554_1583", "Ex_c45438_377", 
"Ex_c5858_1681", "Ex_c5858_953", "Ex_c66357_866", "Ex_c67794_487", 
"Ex_c8871_1318", "Ex_c9685_1264", "Excalibur_c10079_1585", "Excalibur_c10383_432", 
"Excalibur_c11079_749", "Excalibur_c11079_923", "Excalibur_c11505_155", 
"Excalibur_c12875_864", "Excalibur_c13242_1178", "Excalibur_c13709_2568", 
"Excalibur_c1529_1644", "Excalibur_c18089_1116", "Excalibur_c20250_313", 
"Excalibur_c21051_515", "Excalibur_c24123_1597", "Excalibur_c24123_165", 
"Excalibur_c2419_531", "Excalibur_c24354_465", "Excalibur_c24829_189", 
"Excalibur_c25239_283", "Excalibur_c26754_433", "Excalibur_c29205_537", 
"Excalibur_c29600_173", "Excalibur_c31571_136", "Excalibur_c33545_134", 
"Excalibur_c39248_485", "Excalibur_c39309_171", "Excalibur_c39508_88", 
"Excalibur_c39808_453", "Excalibur_c40694_473", "Excalibur_c41477_1272", 
"Excalibur_c42951_136", "Excalibur_c42978_149", "Excalibur_c46082_440", 
"Excalibur_c47078_512", "Excalibur_c47907_517", "Excalibur_c49743_97", 
"Excalibur_c5097_1468", "Excalibur_c52446_519", "Excalibur_c5561_1013", 
"Excalibur_c56240_176", "Excalibur_c60452_196", "Excalibur_c60581_62", 
"Excalibur_c62042_175", "Excalibur_c64302_103", "Excalibur_c6501_477", 
"Excalibur_c9000_209", "Excalibur_c91154_164", "Excalibur_c91154_175", 
"Excalibur_c96921_206", "Excalibur_c9811_131", "Excalibur_c98205_83", 
"Excalibur_rep_c103091_266", "Excalibur_rep_c103408_632a", "Excalibur_rep_c104884_1325", 
"Excalibur_rep_c104884_417", "Excalibur_rep_c105085_102", "Excalibur_rep_c105978_544", 
"Excalibur_rep_c110501_525", "Excalibur_rep_c112985_337", "Excalibur_rep_c116073_259", 
"Excalibur_rep_c67269_496", "Excalibur_rep_c69004_1008", "Excalibur_s113043_59", 
"GENE_1464_73", "GENE_1549_110", "GENE_1606_307", "GENE_1634_405", 
"GENE_1661_228", "GENE_1674_631", "GENE_1820_661", "GENE_1826_718", 
"GENE_1838_79", "GENE_2052_186", "GENE_2087_300", "GENE_3343_183", 
"GENE_3665_61", "GENE_3939_576", "GENE_3939_653", "GENE_4252_246", 
"GENE_4795_75", "IAAV1155", "IAAV1334", "IAAV1410", "IAAV1523", 
"IAAV2646", "IAAV3826", "IAAV3900", "IAAV4190", "IAAV4286", "IAAV4781", 
"IAAV488", "IAAV5030", "IAAV5207", "IAAV5370", "IAAV5507", "IAAV5729", 
"IAAV5821", "IAAV5984", "IAAV6317", "IAAV6474", "IAAV6676", "IAAV6974", 
"IAAV7454", "IAAV7930", "IAAV8170", "IAAV8218", "IAAV8351", "IAAV8407", 
"IAAV8768", "IAAV8924", "IAAV902", "IAAV9044", "IAAV9068", "IACX11403", 
"IACX2728", "IACX2831", "IACX464", "IACX5909", "IACX5968", "IACX6054", 
"IACX6065", "IACX6092", "IACX6277", "IACX8519", "Jagger_c367_427", 
"Jagger_c5842_90", "Jagger_c6722_104", "Jagger_c765_61", "Jagger_c791_62", 
"JD_c1187_1398", "JG_c3077_225", "Ku_c103671_362", "Ku_c12191_1202", 
"Ku_c13307_1116", "Ku_c14313_1194", "Ku_c17569_905", "Ku_c18096_552", 
"Ku_c19456_645", "Ku_c26872_269", "Ku_c28597_514", "Ku_c41007_116", 
"Ku_c4231_940", "Ku_c56370_1155", "Ku_c64839_240", "Ku_c68484_1276", 
"Ku_c69970_624", "Ku_c69970_631", "Ku_c70534_1215", "Ku_c73010_143", 
"Kukri_c10210_1387", "Kukri_c10210_1700", "Kukri_c10511_94", 
"Kukri_c10751_1031", "Kukri_c10751_158", "Kukri_c10751_264", 
"Kukri_c10751_860", "Kukri_c10977_990", "Kukri_c11709_874", "Kukri_c12079_204", 
"Kukri_c12212_182", "Kukri_c13793_1152", "Kukri_c14012_358", 
"Kukri_c14029_117", "Kukri_c14971_367", "Kukri_c15950_244", "Kukri_c17313_667", 
"Kukri_c17966_561", "Kukri_c20889_526", "Kukri_c2123_1254", "Kukri_c23388_695", 
"Kukri_c23833_181", "Kukri_c2454_59", "Kukri_c25564_185", "Kukri_c28650_111", 
"Kukri_c28917_96", "Kukri_c30370_79", "Kukri_c31546_66", "Kukri_c33640_640", 
"Kukri_c34195_357", "Kukri_c34845_242", "Kukri_c35426_507", "Kukri_c36418_392", 
"Kukri_c3673_2442", "Kukri_c41361_186", "Kukri_c4324_74", "Kukri_c4568_1708", 
"Kukri_c46833_105", "Kukri_c47643_920", "Kukri_c49927_151", "Kukri_c51247_322", 
"Kukri_c51666_401", "Kukri_c5459_334", "Kukri_c54593_543", "Kukri_c54729_181", 
"Kukri_c55381_67", "Kukri_c62467_362", "Kukri_c6288_364", "Kukri_c63259_151", 
"Kukri_c63361_97", "Kukri_c64268_101", "Kukri_c68006_282", "Kukri_c80104_809", 
"Kukri_c82097_197", "Kukri_c82117_279", "Kukri_c8242_730", "Kukri_c8400_2315", 
"Kukri_c8465_54", "Kukri_c97654_90", "Kukri_rep_c102953_304", 
"Kukri_rep_c103783_1380", "Kukri_rep_c105819_114", "Kukri_rep_c107195_111", 
"Kukri_rep_c109051_116", "Kukri_rep_c110312_376", "Kukri_rep_c112061_617", 
"Kukri_rep_c114028_94", "Kukri_rep_c69028_139", "Kukri_rep_c69028_1398", 
"Kukri_rep_c69028_347", "Kukri_rep_c69970_717", "Kukri_rep_c70441_132", 
"Kukri_rep_c70479_411", "Kukri_rep_c71916_1548", "Kukri_rep_c73936_154", 
"Kukri_rep_c86903_184", "Kukri_rep_c87640_135", "Kukri_rep_c89183_282", 
"Kukri_rep_c89509_83", "Kukri_s110037_62", "Kukri_s117946_404", 
"Ra_c11263_2353", "Ra_c14565_1056", "Ra_c1619_250", "Ra_c1619_432", 
"Ra_c20211_1415", "Ra_c23771_496", "Ra_c38505_544", "Ra_c38505_555", 
"Ra_c3994_598", "Ra_c42858_503", "Ra_c42858_91", "Ra_c48730_581", 
"Ra_c5515_1723", "Ra_c5515_2396", "Ra_c5515_2469", "Ra_c70087_485", 
"Ra_c72106_227", "Ra_c73278_1234", "Ra_c88203_1055", "Ra_c88203_455", 
"Ra_c88203_468", "RAC875_c10194_673", "RAC875_c10194_910", "RAC875_c10286_1247", 
"RAC875_c10430_672", "RAC875_c14684_1128", "RAC875_c15109_106", 
"RAC875_c1619_120", "RAC875_c1666_126", "RAC875_c17907_181", 
"RAC875_c18570_893", "RAC875_c19138_402", "RAC875_c19570_671", 
"RAC875_c20134_535", "RAC875_c22333_504", "RAC875_c23470_220", 
"RAC875_c2464_274", "RAC875_c24982_202", "RAC875_c26514_127", 
"RAC875_c27323_867", "RAC875_c2786_1097", "RAC875_c30443_966", 
"RAC875_c3141_214", "RAC875_c31572_251", "RAC875_c31572_281", 
"RAC875_c3427_734", "RAC875_c34649_380", "RAC875_c35074_452", 
"RAC875_c36922_829", "RAC875_c371_251", "RAC875_c37118_595", 
"RAC875_c39728_187", "RAC875_c45016_79", "RAC875_c46323_1996", 
"RAC875_c46403_277", "RAC875_c47550_437", "RAC875_c47976_291", 
"RAC875_c50787_113", "RAC875_c50864_1921", "RAC875_c51781_238", 
"RAC875_c52195_324", "RAC875_c5310_1729", "RAC875_c53864_310", 
"RAC875_c55313_89", "RAC875_c58159_521", "RAC875_c6445_275", 
"RAC875_c66845_466", "RAC875_c66892_58", "RAC875_c67309_634", 
"RAC875_c67998_96", "RAC875_c68056_81", "RAC875_c75663_115", 
"RAC875_c787_431", "RAC875_c79551_167", "RAC875_c8010_155", "RAC875_c8174_1034", 
"RAC875_c82493_67", "RAC875_c842_1476", "RAC875_c842_1516", "RAC875_c99055_69", 
"RAC875_rep_c105428_270", "RAC875_rep_c108615_475", "RAC875_rep_c109228_400", 
"RAC875_rep_c109288_122", "RAC875_rep_c109433_782", "RAC875_rep_c111243_125", 
"RAC875_rep_c111698_383", "RAC875_rep_c111952_436", "RAC875_rep_c117959_132", 
"RAC875_rep_c69730_522", "RAC875_rep_c70746_582", "RAC875_rep_c75557_177", 
"RAC875_rep_c75768_245", "RAC875_rep_c77065_1532", "RAC875_rep_c77148_311", 
"RAC875_rep_c81701_424", "RAC875_rep_c90531_283", "RAC875_s120038_124", 
"RFL_Contig102_119", "RFL_Contig1175_354", "RFL_Contig1435_886", 
"RFL_Contig1488_671", "RFL_Contig1798_1606", "RFL_Contig2394_439", 
"RFL_Contig4282_1420", "RFL_Contig4399_956", "RFL_Contig4403_1034", 
"RFL_Contig4921_2420", "RFL_Contig497_1114", "RFL_Contig5153_2667", 
"RFL_Contig5153_958", "RFL_Contig5871_1650", "TA001038_0975", 
"TA001383_0516", "TA002103_1262", "TA002702_0616", "TA003179_0872", 
"TA003281_2379", "TA003550_0145", "TA003589_0518", "TA005161_0899", 
"TA005199_0585", "TA005216_0272", "TA015264_0958", "Tdurum_contig10307_375", 
"Tdurum_contig11121_739", "Tdurum_contig11121_834", "Tdurum_contig12008_842", 
"Tdurum_contig13011_381", "Tdurum_contig13548_158", "Tdurum_contig25520_363", 
"Tdurum_contig27982_568", "Tdurum_contig28699_54", "Tdurum_contig31235_99", 
"Tdurum_contig33100_127", "Tdurum_contig42495_389", "Tdurum_contig45726_1116", 
"Tdurum_contig5009_349", "Tdurum_contig5009_392", "Tdurum_contig5009_735", 
"Tdurum_contig50376_375", "Tdurum_contig50392_1355", "Tdurum_contig50577_620", 
"Tdurum_contig50617_713", "Tdurum_contig51995_377", "Tdurum_contig51995_474", 
"Tdurum_contig55443_1361", "Tdurum_contig56373_348", "Tdurum_contig56731_335", 
"Tdurum_contig57139_318", "Tdurum_contig57693_581", "Tdurum_contig57753_138", 
"Tdurum_contig58326_467", "Tdurum_contig59531_892", "Tdurum_contig59531_914", 
"Tdurum_contig59585_576", "Tdurum_contig59585_656", "Tdurum_contig59782_86", 
"Tdurum_contig61299_55", "Tdurum_contig62502_90", "Tdurum_contig62557_263", 
"Tdurum_contig63129_238", "Tdurum_contig67686_1149", "Tdurum_contig67686_1204", 
"Tdurum_contig67686_633", "Tdurum_contig67686_792", "Tdurum_contig67686_851", 
"Tdurum_contig68305_703", "Tdurum_contig68305_796", "Tdurum_contig68855_91", 
"Tdurum_contig697_73", "Tdurum_contig76105_124", "Tdurum_contig76105_201", 
"Tdurum_contig83209_316", "Tdurum_contig83663_371", "Tdurum_contig91865_242", 
"Tdurum_contig93570_848", "Tdurum_contig9912_216", "Tdurum_contig9912_228", 
"Tdurum_contig9912_451", "tplb0050h15_1287", "tplb0055d10_1024", 
"wsnp_BE426222A_Ta_2_1", "wsnp_BE443568A_Ta_2_1", "wsnp_BE443568A_Ta_2_2", 
"wsnp_BE443995B_Ta_2_2", "wsnp_BE490613A_Ta_2_1", "wsnp_BE494474A_Ta_2_2", 
"wsnp_BE604885A_Ta_2_1", "wsnp_BE604885A_Ta_2_2", "wsnp_BF293133A_Ta_2_2", 
"wsnp_BF429272A_Ta_2_1", "wsnp_BG262734A_Ta_2_3", "wsnp_BG262734A_Ta_2_7", 
"wsnp_BM137927A_Ta_2_1", "wsnp_BQ171931A_Ta_2_1", "wsnp_CAP11_c2438_1258747", 
"wsnp_CAP11_c318_261649", "wsnp_CAP11_rep_c4157_1965583", "wsnp_CAP11_rep_c4226_1995152", 
"wsnp_CAP12_c15_9559", "wsnp_CAP12_rep_c8867_3720285", "wsnp_CAP8_c6939_3242530", 
"wsnp_Ex_c10014_16477392", "wsnp_Ex_c10272_16842803", "wsnp_Ex_c10630_17338753", 
"wsnp_Ex_c10667_17387885", "wsnp_Ex_c11039_17902115", "wsnp_Ex_c11085_17973016", 
"wsnp_Ex_c11229_18163892", "wsnp_Ex_c11397_18400400", "wsnp_Ex_c1141_2191485", 
"wsnp_Ex_c1149_2206471", "wsnp_Ex_c11807_18960045", "wsnp_Ex_c12269_19597341", 
"wsnp_Ex_c12269_19597415", "wsnp_Ex_c12341_19693570", "wsnp_Ex_c12354_19711297", 
"wsnp_Ex_c12750_20243224", "wsnp_Ex_c12948_20511479", "wsnp_Ex_c1335_2556442", 
"wsnp_Ex_c13802_21639096", "wsnp_Ex_c14202_22144844", "wsnp_Ex_c14202_22145136", 
"wsnp_Ex_c14202_22145805", "wsnp_Ex_c14340_22315611", "wsnp_Ex_c14400_22381382", 
"wsnp_Ex_c14400_22381548", "wsnp_Ex_c14420_22402673", "wsnp_Ex_c15100_23284023", 
"wsnp_Ex_c15269_23491104", "wsnp_Ex_c15269_23492289", "wsnp_Ex_c1533_2930233", 
"wsnp_Ex_c1538_2937905", "wsnp_Ex_c15475_23756906", "wsnp_Ex_c15475_23757972", 
"wsnp_Ex_c15674_24004513", "wsnp_Ex_c15674_24005648", "wsnp_Ex_c16079_24507688", 
"wsnp_Ex_c1660_3159173", "wsnp_Ex_c16615_25147492", "wsnp_Ex_c18596_27457344", 
"wsnp_Ex_c18637_27508578", "wsnp_Ex_c20273_29326769", "wsnp_Ex_c2148_4035913", 
"wsnp_Ex_c22435_31629303", "wsnp_Ex_c22888_32105519", "wsnp_Ex_c24085_33332723", 
"wsnp_Ex_c24731_33983680", "wsnp_Ex_c2502_4675968", "wsnp_Ex_c25132_34396655", 
"wsnp_Ex_c26887_36107413", "wsnp_Ex_c27150_36365659", "wsnp_Ex_c28679_37784954", 
"wsnp_Ex_c28930_38008757", "wsnp_Ex_c29742_38738725", "wsnp_Ex_c32003_40728918", 
"wsnp_Ex_c3478_6369892", "wsnp_Ex_c35073_43285821", "wsnp_Ex_c35457_43602830", 
"wsnp_Ex_c361_707953", "wsnp_Ex_c361_708712", "wsnp_Ex_c37208_45002588", 
"wsnp_Ex_c41283_48140956", "wsnp_Ex_c41283_48141201", "wsnp_Ex_c47763_52874806", 
"wsnp_Ex_c47907_52974924", "wsnp_Ex_c48136_53140385", "wsnp_Ex_c5047_8963671", 
"wsnp_Ex_c51776_55603135", "wsnp_Ex_c53364_56625806", "wsnp_Ex_c55051_57706127", 
"wsnp_Ex_c5623_9891427", "wsnp_Ex_c5623_9891516", "wsnp_Ex_c5623_9891584", 
"wsnp_Ex_c56525_58609595", "wsnp_Ex_c57322_59083238", "wsnp_Ex_c57322_59084809", 
"wsnp_Ex_c57322_59084950", "wsnp_Ex_c5929_10402147", "wsnp_Ex_c5997_10512308", 
"wsnp_Ex_c60462_60905848", "wsnp_Ex_c6217_10848574", "wsnp_Ex_c6833_11782875", 
"wsnp_Ex_c742_1458743", "wsnp_Ex_c763_1503467", "wsnp_Ex_c8409_14170476", 
"wsnp_Ex_c88767_80001420", "wsnp_Ex_c8884_14841846", "wsnp_Ex_c9145_15214903", 
"wsnp_Ex_c943_1808232", "wsnp_Ex_c943_1808577", "wsnp_Ex_c9458_15679797", 
"wsnp_Ex_c9468_15696542", "wsnp_Ex_c9483_15722127", "wsnp_Ex_c9510_15761235", 
"wsnp_Ex_rep_c101340_86719115", "wsnp_Ex_rep_c101340_86719239", 
"wsnp_Ex_rep_c101942_87217430", "wsnp_Ex_rep_c102478_87635370", 
"wsnp_Ex_rep_c106152_90334299", "wsnp_Ex_rep_c108072_91444417", 
"wsnp_Ex_rep_c66357_64540369", "wsnp_Ex_rep_c66357_64540428", 
"wsnp_Ex_rep_c66685_65003087", "wsnp_Ex_rep_c66685_65003254", 
"wsnp_Ex_rep_c66685_65003625", "wsnp_Ex_rep_c66867_65267909", 
"wsnp_Ex_rep_c66907_65324299", "wsnp_Ex_rep_c67349_65914945", 
"wsnp_Ex_rep_c67460_66057400", "wsnp_Ex_rep_c67588_66227926", 
"wsnp_Ex_rep_c67635_66291944", "wsnp_Ex_rep_c67635_66292308", 
"wsnp_Ex_rep_c67635_66292689", "wsnp_Ex_rep_c67727_66398596", 
"wsnp_Ex_rep_c67786_66472568", "wsnp_Ex_rep_c69034_67934852", 
"wsnp_Ex_rep_c69034_67935465", "wsnp_Ex_rep_c69314_68244036", 
"wsnp_Ex_rep_c69314_68244502", "wsnp_Ex_rep_c69577_68526990", 
"wsnp_Ex_rep_c69752_68711460", "wsnp_Ex_rep_c69864_68823765", 
"wsnp_Ex_rep_c69864_68824236", "wsnp_Ex_rep_c69864_68824319", 
"wsnp_JD_c1187_1731186", "wsnp_JD_c29019_23208078", "wsnp_JD_c3034_4017676", 
"wsnp_JD_c5699_6859527", "wsnp_JD_c7532_8615717", "wsnp_JD_c8207_9234643", 
"wsnp_JD_c9434_10274598a", "wsnp_JD_c968_1427139", "wsnp_Ku_c10468_17301042", 
"wsnp_Ku_c11052_18135847", "wsnp_Ku_c18497_27803432", "wsnp_Ku_c19456_28944589", 
"wsnp_Ku_c23901_33846711", "wsnp_Ku_c30545_40369365", "wsnp_Ku_c32404_42016343", 
"wsnp_Ku_c3286_6111360", "wsnp_Ku_c38911_47455674", "wsnp_Ku_c38911_47455924", 
"wsnp_Ku_c40218_48484410", "wsnp_Ku_c4568_8243646", "wsnp_Ku_c4568_8243775", 
"wsnp_Ku_c4886_8753646", "wsnp_Ku_c5243_9344536", "wsnp_Ku_c5359_9531713", 
"wsnp_Ku_c5378_9559013", "wsnp_Ku_c7811_13387117", "wsnp_Ku_c8400_14280021", 
"wsnp_Ku_c9433_15811664", "wsnp_Ku_rep_c104691_91086871", "wsnp_Ku_rep_c69970_69476502", 
"wsnp_Ku_rep_c70479_70079622", "wsnp_Ku_rep_c71761_71496470", 
"wsnp_Ra_c10669_17515792", "wsnp_Ra_c11291_18338838", "wsnp_Ra_c132_291198", 
"wsnp_Ra_c16053_24607526", "wsnp_Ra_c16278_24893033", "wsnp_Ra_c16846_25598885", 
"wsnp_Ra_c19079_28210829", "wsnp_Ra_c19079_28210937", "wsnp_Ra_c27831_37346894", 
"wsnp_Ra_c29280_38672141", "wsnp_Ra_c35889_44345459", "wsnp_Ra_c44141_50623811", 
"wsnp_Ra_c4858_8709000", "wsnp_Ra_c7280_12576178", "wsnp_Ra_rep_c106523_90273922", 
"wsnp_Ra_rep_c72670_70836439a", "wsnp_RFL_Contig2011_1216801", 
"wsnp_RFL_Contig2767_2518373", "wsnp_RFL_Contig3866_4228783", 
"wsnp_RFL_Contig429_4978628", "wsnp_RFL_Contig4734_5671036", 
"wsnp_RFL_Contig4814_5829093"), class = "factor"), MEL_004.3_haplotype1 = c(0.00033, 
1e-05, 0, 0, 0, 0, 0, 0, 0.00013, 0.00013, 0.00014, 0.00022, 
0.00022, 0.00022, 0.00022, 0.00027, 0.00027, 0.00023, 0.00023, 
0.00023, 0.00022, 0.00022, 0.00015, 0.00014, 3e-05, 3e-05, 2e-05, 
1e-05, 1e-05, 0), MEL_004.3_haplotype2 = c(0.00033, 1e-05, 0, 
0, 0, 0, 0, 0, 0.00013, 0.00013, 0.00014, 0.00022, 0.00022, 0.00022, 
0.00022, 0.00027, 0.00027, 0.00023, 0.00023, 0.00023, 0.00022, 
0.00022, 0.00015, 0.00014, 3e-05, 3e-05, 2e-05, 1e-05, 1e-05, 
0), MEL_004.3_haplotype3 = c(0.00033, 1e-05, 0, 0, 0, 0, 0, 0, 
0.00013, 0.00013, 0.00014, 0.00022, 0.00022, 0.00022, 0.00022, 
0.00027, 0.00027, 0.00023, 0.00023, 0.00023, 0.00022, 0.00022, 
0.00015, 0.00014, 3e-05, 3e-05, 2e-05, 1e-05, 1e-05, 0), MEL_004.3_haplotype4 = c(0.00033, 
1e-05, 0, 0, 0, 4e-05, 4e-05, 4e-05, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00021), MEL_004.3_haplotype5 = c(0.00033, 
1e-05, 0, 0, 0, 4e-05, 4e-05, 4e-05, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), MEL_004.3_haplotype6 = c(0.96403, 
0.96583, 0.96858, 0.97124, 0.97155, 0.97417, 0.97446, 0.97749, 
0.98022, 0.9804, 0.98058, 0.98303, 0.98316, 0.9833, 0.98345, 
0.98632, 0.98946, 0.99245, 0.99259, 0.99272, 0.99286, 0.993, 
0.9958, 0.9961, 0.99927, 0.99939, 0.99948, 0.9996, 0.99972, 0.99973
), MEL_004.3_haplotype7 = c(0, 0, 0, 0, 0, 4e-05, 4e-05, 4e-05, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
2e-05), MEL_004.3_haplotype8 = c(0.03431, 0.03407, 0.03141, 0.02873, 
0.02844, 0.0257, 0.0254, 0.02237, 0.0194, 0.01919, 0.01897, 0.01631, 
0.01616, 0.01602, 0.01586, 0.01286, 0.00971, 0.00683, 0.0067, 
0.00658, 0.00646, 0.00633, 0.00374, 0.00346, 0.00065, 0.00055, 
0.00045, 0.00034, 0.00024, 3e-05), MEL_005.3_haplotype1 = c(0.21684, 
0.21698, 0.21721, 0.21746, 0.21741, 0.21704, 0.21694, 0.21582, 
0.21465, 0.21458, 0.21448, 0.21345, 0.2134, 0.21333, 0.21328, 
0.21213, 0.21094, 0.20988, 0.20983, 0.20977, 0.20973, 0.20969, 
0.20874, 0.20864, 0.20762, 0.20758, 0.20754, 0.20752, 0.20747, 
0.20744), MEL_005.3_haplotype2 = c(0.03821, 0.03799, 0.03566, 
0.03332, 0.03305, 0.0306, 0.03031, 0.02743, 0.02462, 0.02441, 
0.02421, 0.02168, 0.02154, 0.0214, 0.02124, 0.01837, 0.01534, 
0.01254, 0.01241, 0.01229, 0.01218, 0.01205, 0.00953, 0.00925, 
0.00652, 0.0064, 0.0063, 0.0062, 0.0061, 0.00599), MEL_005.3_haplotype3 = c(0.72948, 
0.73078, 0.73836, 0.74606, 0.74668, 0.75236, 0.75276, 0.75674, 
0.76033, 0.76059, 0.76085, 0.76416, 0.76434, 0.76454, 0.76474, 
0.76859, 0.77275, 0.77666, 0.77684, 0.777, 0.77718, 0.77736, 
0.78097, 0.78137, 0.7854, 0.78554, 0.7857, 0.78584, 0.78599, 
0.78614), MEL_005.3_haplotype4 = c(0.0089, 0.0086, 0.00586, 0.00311, 
0.0028, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), MEL_005.3_haplotype5 = c(0.00588, 0.00558, 
0.00281, 1e-05, 2e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), MEL_005.3_haplotype6 = c(0.00033, 
1e-05, 0, 0, 0, 0, 0, 0, 0.00018, 0.00019, 2e-04, 3e-04, 3e-04, 
0.00031, 0.00032, 0.00038, 0.00038, 0.00033, 0.00032, 0.00031, 
0.00031, 3e-04, 0.00021, 2e-04, 4e-05, 2e-05, 2e-05, 1e-05, 1e-05, 
0), MEL_005.3_haplotype7 = c(3e-05, 5e-05, 6e-05, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), MEL_005.3_haplotype8 = c(0.00033, 1e-05, 0, 0, 0, 0, 0, 
0, 0.00022, 0.00024, 0.00025, 4e-04, 0.00041, 0.00042, 0.00042, 
0.00053, 6e-04, 6e-04, 6e-04, 6e-04, 6e-04, 6e-04, 0.00054, 0.00053, 
0.00043, 0.00043, 0.00042, 0.00042, 0.00041, 0.00041), MEL_005.4_haplotype1 = c(0.16705, 
0.16711, 0.16662, 0.16614, 0.16603, 0.16508, 0.16492, 0.16332, 
0.16169, 0.16156, 0.16144, 0.16, 0.15991, 0.15983, 0.15975, 0.15811, 
0.1564, 0.15485, 0.15479, 0.15472, 0.15464, 0.15458, 0.1532, 
0.15304, 0.15154, 0.15148, 0.15142, 0.15138, 0.15132, 0.15126
), MEL_005.4_haplotype2 = c(0.03823, 0.03801, 0.03569, 0.03335, 
0.03308, 0.0306, 0.03032, 0.02746, 0.02463, 0.02443, 0.02423, 
0.0217, 0.02156, 0.02141, 0.02126, 0.01837, 0.01535, 0.01256, 
0.01243, 0.0123, 0.01218, 0.01206, 0.00955, 0.00925, 0.0065, 
0.00641, 0.00632, 0.0062, 0.0061, 0.006), MEL_005.4_haplotype3 = c(0.77923, 
0.78063, 0.78893, 0.79736, 0.79805, 0.80432, 0.80476, 0.80922, 
0.81328, 0.81357, 0.81386, 0.8176, 0.81781, 0.81801, 0.81825, 
0.82258, 0.82726, 0.83167, 0.83187, 0.83205, 0.83224, 0.83244, 
0.8365, 0.83695, 0.84147, 0.84164, 0.84181, 0.84198, 0.84215, 
0.8423), MEL_005.4_haplotype4 = c(0.00889, 0.00861, 0.00587, 
0.00312, 0.0028, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), MEL_005.4_haplotype5 = c(0.00589, 
0.00559, 0.00282, 1e-05, 2e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), MEL_005.4_haplotype6 = c(0.00033, 
1e-05, 0, 0, 0, 0, 0, 0, 0.00018, 0.00019, 2e-04, 3e-04, 3e-04, 
0.00031, 0.00031, 0.00037, 0.00037, 0.00032, 0.00031, 0.00031, 
0.00031, 3e-04, 0.00021, 2e-04, 4e-05, 2e-05, 2e-05, 1e-05, 1e-05, 
0), MEL_005.4_haplotype7 = c(3e-05, 5e-05, 6e-05, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0)), .Names = c("SNP", "MEL_004.3_haplotype1", "MEL_004.3_haplotype2", 
"MEL_004.3_haplotype3", "MEL_004.3_haplotype4", "MEL_004.3_haplotype5", 
"MEL_004.3_haplotype6", "MEL_004.3_haplotype7", "MEL_004.3_haplotype8", 
"MEL_005.3_haplotype1", "MEL_005.3_haplotype2", "MEL_005.3_haplotype3", 
"MEL_005.3_haplotype4", "MEL_005.3_haplotype5", "MEL_005.3_haplotype6", 
"MEL_005.3_haplotype7", "MEL_005.3_haplotype8", "MEL_005.4_haplotype1", 
"MEL_005.4_haplotype2", "MEL_005.4_haplotype3", "MEL_005.4_haplotype4", 
"MEL_005.4_haplotype5", "MEL_005.4_haplotype6", "MEL_005.4_haplotype7"
), row.names = c(NA, 30L), class = "data.frame")

谢谢

1 个答案:

答案 0 :(得分:3)

在首次将所有概率收集到一个列中之后,您可以使用dplyr包有效地执行此操作(非常类似于您在使用熔化时所执行的操作):

library(dplyr)
library(tidyr)

pairs <- A4 %>%
  gather(individual, value, -SNP) %>%
  separate(individual, c("group", "founder"), sep = "_haplotype") %>%
  group_by(SNP, group) %>%
  filter(value == max(value)) %>%
  filter(n() == 2) %>%
  ungroup() %>%
  arrange(SNP)

(顺便说一下,ungrouparrange在这里并不是绝对必要的:在查看输出时按SNP排序是很好的。

您发布的数据子集与第一名没有双向关联。但我们可以人为地添加5个关系:

A4[1:5, 2:3] <- .999

此时,输出看起来像:

Source: local data frame [10 x 4]

                      SNP     group founder value
                   (fctr)     (chr)   (chr) (dbl)
1     BobWhite_c18256_105 MEL_004.3       1 0.999
2     BobWhite_c18256_105 MEL_004.3       2 0.999
3           BS00081981_51 MEL_004.3       1 0.999
4           BS00081981_51 MEL_004.3       2 0.999
5    Excalibur_c42951_136 MEL_004.3       1 0.999
6    Excalibur_c42951_136 MEL_004.3       2 0.999
7       RAC875_c46403_277 MEL_004.3       1 0.999
8       RAC875_c46403_277 MEL_004.3       2 0.999
9  Tdurum_contig83209_316 MEL_004.3       1 0.999
10 Tdurum_contig83209_316 MEL_004.3       2 0.999

这显示了在SNP和组内绑定的概率对。顺便提一下,您可能更愿意将其重新组织为每行一行,这可以通过tidyr的spread完成:

pair_ties <- pairs %>%
  group_by(SNP, group) %>%
  mutate(column = c("founder1", "founder2")) %>%
  spread(column, founder)

哪个会输出:

Source: local data frame [5 x 5]

                     SNP     group value founder1 founder2
                  (fctr)     (chr) (dbl)    (chr)    (chr)
1    BobWhite_c18256_105 MEL_004.3 0.999        1        2
2          BS00081981_51 MEL_004.3 0.999        1        2
3   Excalibur_c42951_136 MEL_004.3 0.999        1        2
4      RAC875_c46403_277 MEL_004.3 0.999        1        2
5 Tdurum_contig83209_316 MEL_004.3 0.999        1        2

作为对其工作方式的快速解释 - 它依赖于dplyr package,它根据顺序步骤定义数据操作的语法,与%>%“管道”运算符一起管道传输。 One of the best intros is here。对于每个函数的作用,这些都值得一读。在这种情况下,想象它发生在四个阶段:

x1 <- A4 %>%
  gather(individual, value, -SNP) %>%
  separate(individual, c("group", "founder"), sep = "_haplotype")

这些步骤与代码中的melt执行相同的操作:它们将所有概率收集到名为value的单个列中,并将列名称移动到名为individual的列中它。然后,它将个人分为“组”列(您的600多个组)和“创始人”列(1-8)。

x2 <- x1 %>%
  group_by(SNP, group) %>%
  filter(value == max(value))

这表示您希望查看由独特的SNP和group组成的每个组。这取代了上面嵌套的for循环。在每个组中,filter列适用于value列与最大值一样大的行。

x3 <- x2 %>%
  filter(n() == 2)

数据框仍然是分组的,并且只有第一个位置的关系,这就是说你想只包括大小为2的组。此时,工作完成了,最后两步只是对它进行排序通过SNP使其更具可读性:

result <- x3 %>%
  ungroup() %>%
  arrange(SNP)

顺便说一下,为了估计这对你的数据需要多长时间,我模拟了一些与你的数据大小相同的数据:

groups <- 600
founders <- 8
SNPs <- 896

vals <- as.data.frame(replicate(groups * founders, sample(10, SNPs, replace = TRUE) / 10))
names(vals) <- paste0("P", rep(seq_len(groups), each = founders), "_", seq_len(founders))

SNPnames <- paste("SNP", seq_len(SNPs))
A4 <- cbind(SNP = SNPnames, vals)

在这个A4上,代码在我的机器上运行大约需要30秒。