我正在分析成对基因相互作用(> 300K)对果蝇行为的影响,并且已经遇到了问题。我想计算我的177行中某些基因状态(即等位基因)的出现频率。
一些比赛表:
state1 <-c("A","B","C","A","B","A")
state2 <- c("B","C","D","D","D","C")
df1 <- data.frame(state1,state2)
state <- c("A","B","C","D")
line_1 <- c(0,0,2,0)
line_2 <- c(0,0,0,2)
line_3 <- c(2,2,2,0)
line_4 <- c(0,2,2,2)
line_5 <- c(2,0,0,0)
df2 <- data.frame(state,line_1,line_2,line_3,line_4,line_5, stringsAsFactors = F)
我希望得到一个输出,返回每个状态组合的行数(state1 = 0和state2 = 0,state1 = 0和state2 = 2,state1 = 2和state2 = 0,state1 = 2和state2 = 2)和每个组合的行:
> resultdf
state1 state2 state10state20 state10state22 state12state20 state12state22 lines00 lines02 lines20 lines22
1 A B 1 1 2 1 line_1,line_2 line_4 line_5 line_3
2 B C 1 2 0 2 line_5 line_1,line_2 NA line_3,line_4
3 C D 1 0 2 2 line_5 NA line_1,line_3 line_2,line_4
4 A D 1 1 2 1 line_1 line_4 line_3,line_5 line_2
5 B D 2 1 1 1 line_1,line_5 line_2 line_3 line_4
6 A C 0 2 1 2 NA line_1,line_4 line_5 line_2,line_3
我开始研究for循环和if语句,但发现R做了更好的事情。我是R的新手(并且编码一般),所以我不确定下一步该转向何处。提前感谢您的帮助。
答案 0 :(得分:0)
这应该这样做:
首先让我们创建一个包含所有基因组合的表格。 (你已经在df1中拥有它,但我们可以像这样自动化而不是硬代码):
com = combn(unique(df2$state), 2)
现在有一些辅助函数可以测试每种可能的状态组合
both_true = function(x) x[1,] & x[2,]
neither_true = function(x) !x[1,] & !x[2,]
first_true = function(x) x[1,] & !x[2,]
second_true = function(x) !x[1,] & x[2,]
另一个辅助函数,它使用前面的每个函数并将结果的计数结合在一起
all_pos = function(x) c(
sum(both_true(x)),
sum(neither_true(x)),
sum(first_true(x)),
sum(second_true(x)))
现在我们可以将这些功能应用于您的数据集
res = apply(com, 2, function(x) all_pos(df2[df2[,1] %in% x, -1]))
并设置列/行名称,以便我们可以解释输出
colnames(res) = apply(com, 2, paste0, collapse='')
rownames(res) = c('both', 'neither', 'first', 'second')
res
# AB AC AD BC BD CD
# both 1 1 0 2 1 1
# neither 2 1 1 2 2 1
# first 1 1 2 0 1 2
# second 1 2 2 1 1 1
与获取行号相似:
which_pos = function(x) c(
paste(which(both_true(x)), collapse=','),
paste(which(neither_true(x)), collapse=','),
paste(which(first_true(x)), collapse=','),
paste(which(second_true(x)), collapse=','))
res2 = apply(com, 2, function(x) which_pos(df2[df2[,1] %in% x, -1]))
colnames(res2) = apply(com, 2, paste0, collapse='')
rownames(res2) = c('both', 'neither', 'first', 'second')
res2
#
# AB AC AD BC BD CD
# both "3" "3" "" "3,4" "4" "4"
# neither "1,2" "2" "1" "2,5" "1,5" "5"
# first "5" "5" "3,5" "" "3" "1,3"
# second "4" "1,4" "2,4" "1" "2" "2"