Question

我正在分析成对基因相互作用（> 300K）对果蝇行为的影响，并且已经遇到了问题。我想计算我的177行中某些基因状态（即等位基因）的出现频率。

一些比赛表：

state1 <-c("A","B","C","A","B","A")
state2 <- c("B","C","D","D","D","C")

df1 <- data.frame(state1,state2)

state <- c("A","B","C","D")
line_1 <- c(0,0,2,0)
line_2 <- c(0,0,0,2)
line_3 <- c(2,2,2,0)
line_4 <- c(0,2,2,2)
line_5 <- c(2,0,0,0)

df2 <- data.frame(state,line_1,line_2,line_3,line_4,line_5, stringsAsFactors = F)

我希望得到一个输出，返回每个状态组合的行数（state1 = 0和state2 = 0，state1 = 0和state2 = 2，state1 = 2和state2 = 0，state1 = 2和state2 = 2）和每个组合的行：

> resultdf
  state1 state2 state10state20 state10state22 state12state20 state12state22       lines00       lines02       lines20       lines22
1      A      B              1              1              2              1        line_1,line_2 line_4        line_5        line_3
2      B      C              1              2              0              2        line_5 line_1,line_2            NA line_3,line_4
3      C      D              1              0              2              2        line_5            NA line_1,line_3 line_2,line_4
4      A      D              1              1              2              1        line_1        line_4 line_3,line_5        line_2
5      B      D              2              1              1              1  line_1,line_5        line_2        line_3        line_4
6      A      C              0              2              1              2            NA line_1,line_4        line_5 line_2,line_3

我开始研究for循环和if语句，但发现R做了更好的事情。我是R的新手（并且编码一般），所以我不确定下一步该转向何处。提前感谢您的帮助。

Answer 1

这应该这样做：

首先让我们创建一个包含所有基因组合的表格。（你已经在df1中拥有它，但我们可以像这样自动化而不是硬代码）：

com = combn(unique(df2$state), 2)

现在有一些辅助函数可以测试每种可能的状态组合

both_true    = function(x)  x[1,] &  x[2,]
neither_true = function(x) !x[1,] & !x[2,]
first_true   = function(x)  x[1,] & !x[2,]
second_true  = function(x) !x[1,] &  x[2,]

另一个辅助函数，它使用前面的每个函数并将结果的计数结合在一起

all_pos = function(x) c(
   sum(both_true(x)), 
   sum(neither_true(x)), 
   sum(first_true(x)), 
   sum(second_true(x)))

现在我们可以将这些功能应用于您的数据集

res = apply(com, 2, function(x) all_pos(df2[df2[,1] %in% x, -1]))

并设置列/行名称，以便我们可以解释输出

colnames(res) = apply(com, 2, paste0, collapse='')
rownames(res) = c('both', 'neither', 'first', 'second')

res
#         AB AC AD BC BD CD
# both     1  1  0  2  1  1
# neither  2  1  1  2  2  1
# first    1  1  2  0  1  2
# second   1  2  2  1  1  1

与获取行号相似：

which_pos = function(x) c(
  paste(which(both_true(x)), collapse=','), 
  paste(which(neither_true(x)), collapse=','), 
  paste(which(first_true(x)), collapse=','), 
  paste(which(second_true(x)), collapse=','))
res2 = apply(com, 2, function(x) which_pos(df2[df2[,1] %in% x, -1]))
colnames(res2) = apply(com, 2, paste0, collapse='')
rownames(res2) = c('both', 'neither', 'first', 'second')
res2
# 
#         AB    AC    AD    BC    BD    CD   
# both    "3"   "3"   ""    "3,4" "4"   "4"  
# neither "1,2" "2"   "1"   "2,5" "1,5" "5"  
# first   "5"   "5"   "3,5" ""    "3"   "1,3"
# second  "4"   "1,4" "2,4" "1"   "2"   "2"

R中表的条件计数

1 个答案: