基于一行标准的数据框列上的唯一组合

时间:2016-05-03 23:37:04

标签: r bioinformatics

我有超过200列的#include "hashTable.h" #include <algorithm> hashTable::hashTable() : tableArray(181667) { } int hashTable::hash(const std::string &nodeData) { int multiplier = 1; int total = 0; int l = nodeData.length(); for(int i = l - 1; i > -1; --i) { int temp = (nodeData[i] - '0') * multiplier; total += temp; multiplier *= 10; } return total % length; } void hashTable::insertNode(const std::string &nodeData) { int index = hash(nodeData); tableArray[index].push_back(nodeData); } bool hashTable::removeNode(const string &nodeData) { int index = hash(nodeData); std::list<std::string>::iterator iter = std::find(tableArray[index].begin(), tableArray[index].end(), nodeData); if (iter != tableArray[index].end()) { tableArray[index].erase(iter); return true; } return false; } bool hashTable::checkForDuplicate(const std::string &nodeData) { int index = hash(nodeData); std::list<std::string>::iterator iter = std::find(tableArray[index].begin(), tableArray[index].end(), nodeData); return (iter != tableArray[index].end()); } ,并在下面包含了一个子集,其中包含与此问题相关的列:

data.frame

我想:

1. 制作样本1-样本10列的所有可能组合,其中每个组合包含来自每个F编号的一个样本,即每个组合包含5个样本,每个样本来自F1,F2,F3, F4,F5。

所以在上面的例子中会有18种组合,例如:

第一个组合是sample1,sample4,sample6,sample7,sample10

第二个组合是sample1,sample4,sample6,sample8,sample10

第三种组合是sample1,sample4,sample6,sample9,sample10

在阅读相关帖子后,我已经使用了>df Variant Pos ID DB.0.count DB.1.count sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 variant5 1234567 A 5 5 1/0 1/0 1/0 1/1 1/1 0/0 1/0 0/0 1/0 1/1 . . . . . F1 F1 F1 F2 F2 F3 F4 F4 F4 F5 uniqueduplicated,但却无处可去。

然后,我想将每个唯一组合输出到新的distinct,对样本中的每个变量执行计数,并将结果输出到新列,并执行如下的渔民精确测试并输出到一个新列,如下所示,以下代码应该如下所示:(在这里学习的渔夫代码:Fisher's exact test on values from large dataframe and bypassing errors

data.frame

2。最后,我想创建一个df.combo.1$pop.0/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE) ) df.combo.1$pop.1/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/0",u))==TRUE) ) df.combo.1$pop.1/1.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE) ) df.combo.1$pop.0.count <- ( 2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE) )) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) ) df.combo.1$pop.1.count <- ( 2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE) )) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) ) res <- NULL for (i in 1:nrow(df.combo.1)){ table <- matrix(c(df.combo.1[i, 4], df.combo.1[i, 5], df.combo.1[i, 14], df.combo.1[i, 15]), ncol = 2, byrow = TRUE) # if any NA occurs in your table save an error in p else run the fisher test if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value # save all p values in a vector res <- c(res,p) } df.combo.1$fishers <- res >df.combo.1 Variant Pos ID DB.0.count DB.1.count sample1 sample4 sample6 sample7 sample10 pop.0/0.count pop.1/0.count pop.1/1.count pop.0.count pop.1.count fishers variant5 1234567 A 5 5 1/0 1/1 0/0 1/0 1/1 1 2 2 4 6 1.0000 . . . . . F1 F2 F3 F4 F5 ,列出每个独特组合的渔民精确p值,如下所示:

data.frame

我认为整个练习可能需要某种循环?

1 个答案:

答案 0 :(得分:1)

我想我已经掌握了你想要的东西。对于我认为你在第1部分中苦苦挣扎的那一点,我使用了它和expand.grid的组合进行整理。

对于第2部分,一旦数据按每个观察点排列在1行,这是一个相当容易的cbind。

看起来你每次观察使用2行(除非那只是格式化的东西),这使得它非常难(但并非不可能,只需要更多的杂耍),所以我将数据合并到一行。这应该是一个非常简单的转换,只需将每个“第二”行中的相应列附加到每个“第一”行,然后删除每一行。

这可以更有效和整洁地完成,但我认为这是有效的,并且应该相当容易扩展到其他情况。

此致 约什

# provided demo data
# Variant   Pos     ID    DB.0.count    DB.1.count    sample1    sample2    sample3    sample4    sample5    sample6    sample7    sample8    sample9    sample10 
# variant5  1234567 A     5             5             1/0        1/0        1/0        1/1        1/1        0/0        1/0        0/0        1/0        1/1
# .         .       .     .             .             F1         F1         F1         F2         F2         F3         F4         F4         F4         F5


# create data frame in long format
test.df <- as.data.frame(t(c("variant5",1234567,"A",5,5,"1/0","1/0","1/0","1/1","1/1","0/0","1/0","0/0","1/0","1/1","F1", "F1", "F1", "F2", "F2", "F3", "F4", "F4", "F4", "F5")))
# ensure as character format
test.df[] <- lapply(test.df, as.character)
# get postions of "F" data
F1.var <- which(test.df =="F1")
F2.var <- which(test.df =="F2")
F3.var <- which(test.df =="F3")
F4.var <- which(test.df =="F4")
F5.var <- which(test.df =="F5")
# get all combinations of the 5 F positions
Fcode.combinations <- expand.grid(F1.var,F2.var,F3.var,F4.var,F5.var)
# create results data frame
df.combo.1 <- as.data.frame(matrix(NA,ncol = 21, nrow = nrow(Fcode.combinations)))
# name variables
names(df.combo.1) <- c("Variant","Pos","ID","DB.0.count","DB.1.count",
                              "F1.sample.pos","F1.result",
                              "F2.sample.pos","F2.result",
                              "F3.sample.pos","F3.result",
                              "F4.sample.pos","F4.result",
                              "F5.sample.pos","F5.result",
                              "pop.0_0.count","pop.1_0.count","pop.1_1.count",
                              "pop.0.count","pop.1.count",
                              "fishers")
# copy in common data
df.combo.1[,1:5] <- test.df[,1:5]
# setup variables based on combination data
for(i in 1:nrow(Fcode.combinations)){
  df.combo.1[i,c(6,8,10,12,14)] <- Fcode.combinations[i,]
  # -10 to correct for the position of the results not the 'F type' data
  cycle.results <- as.numeric(Fcode.combinations[i,] -10)
  df.combo.1[i,c(7,9,11,13,15)] <- test.df[cycle.results]
}

# this is essentially your code with the column reference changed

df.combo.1$pop.0_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE) )    
df.combo.1$pop.1_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/0",u))==TRUE) )  
df.combo.1$pop.1_1.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE) )  

df.combo.1$pop.0.count <- ( 2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE) )) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
df.combo.1$pop.1.count <- ( 2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE) )) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )

res <- NULL
for (i in 1:nrow(df.combo.1)){
  table <- matrix(as.numeric(c(df.combo.1[i, 4],    df.combo.1[i, 5], df.combo.1[i, 16], df.combo.1[i, 17])), ncol = 2, byrow = TRUE)
  # if any NA occurs in your table save an error in p else run the fisher test
  if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value
  # save all p values in a vector
  res <- c(res,p)
}
df.combo.1$fishers <- res 

# create results data
df.combo.1.results <- as.data.frame(cbind(1:nrow(df.combo.1),df.combo.1$fishers))
names(df.combo.1.results) <- c("combo","fishers")