Question

我尝试使用apply或aggregate函数来计算某些发行版之间的Jensen-Shannon散度（JS.dist）。

我在四种不同的模型下模拟一些数据，对于每个数据，我计算了一系列统计数据。

想象一下，我有以下data.frame：

dataframe1:
Model Factor1 Factor2 stats1 stats2
M1    0.0001  0.2     -1.0   0.9
M1    0.0001  0.2     -1.3   0.5
M1    0.0002  0.3     -1.9   0.2
M2    0.0001  0.2     -2.0   0.2
M2    0.0001  0.2     -2.0   0.2
M2    0.0002  0.3     -2.1   0.4
M3    0.0001  0.2      9.9   0.4
M3    0.0001  0.2      8.3   0.4
M3    0.0002  0.3      8.0   0.4
M4    0.0001  0.2      3.0   0.1
M4    0.0001  0.2      3.5   0.3
M4    0.0002  0.3      3.2   0.3

计算JS.dist的功能如下：

在日志中将Inf或-Inf更改为零的功能。它将数字的对数作为参数

 test.logs <- function(log.num){

  log.num[log.num == -Inf | log.num == Inf] <- 0
  return (log.num)

}

＃函数计算kl.dist（Kullback-Leibler发散）。它将两个分布（x.p和y.p）

的概率向量（见下文）作为参数

kl.dist <- function(x.p, y.p) {
  # x.p, y.p: probability vectors for x and y distributions

  log.x <- test.logs(log(x.p))
  log.y <- test.logs(log(y.p))

  sum(x.p * (log.x - log.y))

}

＃函数计算js.dist。它采用x，y和M的概率向量作为参数.M是中间分布

js.dist <- function(x.p, y.p, M.p){
  0.5 * kl.dist(x.p, M.p) + .5 * kl.dist(y.p, M.p)
}

要使用上述功能，我必须计算我的分布的密度曲线（按模型和因子计算统计数据1和统计数据2）。

为了计算这个，我必须设置一个最小值和最大值，密度曲线将为每个统计数据计算。

例如：

x.d <- density(x, n=512, from=min, to=max)
y.d <- density(y, n=512, from=min, to=max)
M.d <- (x.d$y + y.d$y)/2

# width of the histogram
w <- x.d$x[2] - x.d$x[1]

# probability of x value in n-th bin
x.p <- x.d$y * w # (hist hight) * (bin width)
y.p <- y.d$y * w
M.p <- M.d * w

我试着编写一个代码，其中我有两个for循环（对于每个因子），我按模型对数据进行子集化，并计算每个统计数据的最小值和最大值。最后我计算了密度和概率，并且只有在我能够计算JS.dist之后才能计算出来。

例如R代码：

density_js.dist <- function(data.df){
# gets the unique values for mutation rate
factor1 <- unique(data.df$Factor1)
# gets the unique values for rate of new copies
factor2 <- unique(data.df$factor2)

# calculates the minimum and maximum value for each of the statistics
# showing only for stats1
stats1.min <- min(data.df$stats1)
stats1.max <- max(data.df$stats1)



# for loop to calculate the densities and probabilities and JS distance for each combination of factor1 and factor2

for (f1 in factor1){
  for (f2 in factor2){

  new.df <- subset(data.df, factor1 == f1 & factor2 == f2)

  # subsetting data. One data frame for each of the four models
  MM.df <- subset(new.df, Model == "M1")
  TM.df <- subset(new.df, Model == "M2")

  MI.df <- subset(new.df, Model == "M3")
  TI.df <- subset(new.df, Model == "M4")

  # densitiy and probability for each stats

  #1.stats1
  # calculating densities for model M1 and M2
  MM1.d <- density(MM.df$stats1, n=512, from=stats1.min, to=stats1.max)
  TM1.d <- density(TM.df$stats1, n=512, from=stats1.min, to=stats1.max)

  # Density for the middle distribution between models M1 and M2 
  Middle12.d <- (MM1.d$y + TM1.d$y)/2

  # width for models
  w12 <- MM1.d$x[2] - MM1.d$x[1]

  # calculating probabilities for each models
  MM1.p <- MM1.d$y * w12 # (hist hight) * (bin width)
  TM1.p <- TM1.d$y * w12
  Middle12.p <- Middle12.d * w12 

  # calculating densities for models M3 and M4
  MI1.d <- density(MI.df$stats1, n=512, from=stats1.min, to=stats1.max)
  TI1.d <- density(TI.df$stats1, n=512, from=stats1.min, to=stats1.max)
  Middle34.d <- (MI1.d$y + TI1.d$y)/2

  w34 <- MI1.d$x[2] - MI1.d$x[1]

  # calculating probabilities for M3 and M4 models
  MI1.p <- MM1.d$y * w34 
  TI1.p <- TM1.d$y * w34
  Middle34.p <- Middle34.d * w34 


 js.dist(MM1.p, TM1.p, Middle12.p)
 js.dist(MI1.p, TI1.p, Middle34.p)
  }
 }
}

我的问题是：

我曾尝试使用apply或aggregate，但是我无法弄清楚如何将每个统计数据的min和max作为参数传递，以便能够创建密度曲线？请注意，此min和max是针对因子和模型的所有组合而不是针对每个子集计算的。例如，为了进行比较，我无法通过因子和模型计算子集的最小值和最大值。

我的数据实际上要复杂得多。我有10个不同的统计信息，我想按因子计算两个分布之间的JS.dist。我的两个发行版是M1和M2，M3和M4。上面的代码有效，但它需要我写更多的700行，我真的认为它不是很聪明。

如果有人能帮助我，我真的很感激。

Answer 1

这是一种使用列表一次计算所有10个系列的黑客方法。由于代码的长度和冗长，如果您需要一个功能解决方案，则需要完全重写。只能测试前两个系列的代码（甚至不完全，因为多个因子1：因子2组合只有1个观察点，因此无法进行密度计算）。还删除了该功能，因为它绝对没有任何功能。

library(dplyr)
L = list()

  # gets the unique values for mutation rate
  factor1 <- unique(data.df$Factor1)
  # gets the unique values for rate of new copies
  factor2 <- unique(data.df$Factor2)

  # calculates the minimum and maximum value for each of the statistics
  # Store all 10 min and max in a vector
  vector.min <- lapply(data.df %>% select(stats1:stats10), min)
  vector.max <- lapply(data.df %>% select(stats1:stats10), max)

  # for loop to calculate the densities and probabilities and JS distance for each combination of factor1 and factor2

  for (f1 in factor1){
    for (f2 in factor2){
      new.df <- subset(data.df, factor1 == f1 & factor2 == f2)    
      # subsetting data. One data frame for each of the four models
      MM.df <- subset(new.df, Model == "M1")
      TM.df <- subset(new.df, Model == "M2")
      MI.df <- subset(new.df, Model == "M3")
      TI.df <- subset(new.df, Model == "M4")

      # densitiy and probability for each stats

      # calculating densities for model M1 and M2
      MM.d = lapply(1:10, function(i) density(MM.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[i]]))
      TM.d = lapply(1:10, function(i) density(TM.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[i]]))

      # Density for the middle distribution between models M1 and M2 
      Middle12.d <- mapply(function(d1, d2) (d1$y+d2$y)/2, MM.d, TM.d, SIMPLIFY = F)

      # width for models
      w12 = lapply(MM.d, function(y) {y$x[2] - y$x[1]})

      # calculating probabilities for each models
      MM1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, MM.d, w12)  # (hist hight) * (bin width)
      TM1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, TM.d, w12)
      Middle12.p = mapply("*", Middle12.d, w12)

      # calculating densities for models M3 and M4
      MI.d = lapply(1:10, function(i) density(MI.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[2]]))
      TI.d = lapply(1:10, function(i) density(TI.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[2]]))
      Middle34.d <- mapply(function(d1, d2) (d1$y+d2$y)/2, MI.d, TI.d)

      w34 = lapply(MI.d, function(y) {y$x[2] - y$x[1]})      

      # calculating probabilities for M3 and M4 models
      MI1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, MI.d, w34)  # (hist hight) * (bin width)
      TI1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, TI.d, w34)
      Middle34.p = mapply("*", Middle34.d, w34)

      L = c(L, list(mapply(js.dist, MM1.p, TM1.p, Middle12.p), mapply(js.dist, MI1.p, TI1.p, Middle34.p)))
    }
  }

编写循环的聪明方式来计算Jensen-Shannon分歧

1 个答案: