编写循环的聪明方式来计算Jensen-Shannon分歧

时间:2015-08-20 16:12:10

标签: r aggregate apply

我尝试使用apply或aggregate函数来计算某些发行版之间的Jensen-Shannon散度(JS.dist)。

我在四种不同的模型下模拟一些数据,对于每个数据,我计算了一系列统计数据。

想象一下,我有以下data.frame:

dataframe1:
Model Factor1 Factor2 stats1 stats2
M1    0.0001  0.2     -1.0   0.9
M1    0.0001  0.2     -1.3   0.5
M1    0.0002  0.3     -1.9   0.2
M2    0.0001  0.2     -2.0   0.2
M2    0.0001  0.2     -2.0   0.2
M2    0.0002  0.3     -2.1   0.4
M3    0.0001  0.2      9.9   0.4
M3    0.0001  0.2      8.3   0.4
M3    0.0002  0.3      8.0   0.4
M4    0.0001  0.2      3.0   0.1
M4    0.0001  0.2      3.5   0.3
M4    0.0002  0.3      3.2   0.3

计算JS.dist的功能如下:

在日志中将Inf或-Inf更改为零的功能。它将数字的对数作为参数

 test.logs <- function(log.num){

  log.num[log.num == -Inf | log.num == Inf] <- 0
  return (log.num)

}

#函数计算kl.dist(Kullback-Leibler发散)。它将两个分布(x.p和y.p)

的概率向量(见下文)作为参数
kl.dist <- function(x.p, y.p) {
  # x.p, y.p: probability vectors for x and y distributions

  log.x <- test.logs(log(x.p))
  log.y <- test.logs(log(y.p))

  sum(x.p * (log.x - log.y))

}

#函数计算js.dist。它采用x,y和M的概率向量作为参数.M是中间分布

js.dist <- function(x.p, y.p, M.p){
  0.5 * kl.dist(x.p, M.p) + .5 * kl.dist(y.p, M.p)
}

要使用上述功能,我必须计算我的分布的密度曲线(按模型和因子计算统计数据1和统计数据2)。

为了计算这个,我必须设置一个最小值和最大值,密度曲线将为每个统计数据计算。

例如:

x.d <- density(x, n=512, from=min, to=max)
y.d <- density(y, n=512, from=min, to=max)
M.d <- (x.d$y + y.d$y)/2

# width of the histogram
w <- x.d$x[2] - x.d$x[1]

# probability of x value in n-th bin
x.p <- x.d$y * w # (hist hight) * (bin width)
y.p <- y.d$y * w
M.p <- M.d * w

我试着编写一个代码,其中我有两个for循环(对于每个因子),我按模型对数据进行子集化,并计算每个统计数据的最小值和最大值。最后我计算了密度和概率,并且只有在我能够计算JS.dist之后才能计算出来。

例如R代码:

density_js.dist <- function(data.df){
# gets the unique values for mutation rate
factor1 <- unique(data.df$Factor1)
# gets the unique values for rate of new copies
factor2 <- unique(data.df$factor2)

# calculates the minimum and maximum value for each of the statistics
# showing only for stats1
stats1.min <- min(data.df$stats1)
stats1.max <- max(data.df$stats1)



# for loop to calculate the densities and probabilities and JS distance for each combination of factor1 and factor2

for (f1 in factor1){
  for (f2 in factor2){

  new.df <- subset(data.df, factor1 == f1 & factor2 == f2)

  # subsetting data. One data frame for each of the four models
  MM.df <- subset(new.df, Model == "M1")
  TM.df <- subset(new.df, Model == "M2")

  MI.df <- subset(new.df, Model == "M3")
  TI.df <- subset(new.df, Model == "M4")

  # densitiy and probability for each stats

  #1.stats1
  # calculating densities for model M1 and M2
  MM1.d <- density(MM.df$stats1, n=512, from=stats1.min, to=stats1.max)
  TM1.d <- density(TM.df$stats1, n=512, from=stats1.min, to=stats1.max)

  # Density for the middle distribution between models M1 and M2 
  Middle12.d <- (MM1.d$y + TM1.d$y)/2

  # width for models
  w12 <- MM1.d$x[2] - MM1.d$x[1]

  # calculating probabilities for each models
  MM1.p <- MM1.d$y * w12 # (hist hight) * (bin width)
  TM1.p <- TM1.d$y * w12
  Middle12.p <- Middle12.d * w12 

  # calculating densities for models M3 and M4
  MI1.d <- density(MI.df$stats1, n=512, from=stats1.min, to=stats1.max)
  TI1.d <- density(TI.df$stats1, n=512, from=stats1.min, to=stats1.max)
  Middle34.d <- (MI1.d$y + TI1.d$y)/2

  w34 <- MI1.d$x[2] - MI1.d$x[1]

  # calculating probabilities for M3 and M4 models
  MI1.p <- MM1.d$y * w34 
  TI1.p <- TM1.d$y * w34
  Middle34.p <- Middle34.d * w34 


 js.dist(MM1.p, TM1.p, Middle12.p)
 js.dist(MI1.p, TI1.p, Middle34.p)
  }
 }
}

我的问题是:

我曾尝试使用apply或aggregate,但是我无法弄清楚如何将每个统计数据的min和max作为参数传递,以便能够创建密度曲线? 请注意,此min和max是针对因子和模型的所有组合而不是针对每个子集计算的。例如,为了进行比较,我无法通过因子和模型计算子集的最小值和最大值。

我的数据实际上要复杂得多。我有10个不同的统计信息,我想按因子计算两个分布之间的JS.dist。我的两个发行版是M1和M2,M3和M4。 上面的代码有效,但它需要我写更多的700行,我真的认为它不是很聪明。

如果有人能帮助我,我真的很感激。

1 个答案:

答案 0 :(得分:1)

这是一种使用列表一次计算所有10个系列的黑客方法。由于代码的长度和冗长,如果您需要一个功能解决方案,则需要完全重写。只能测试前两个系列的代码(甚至不完全,因为多个因子1:因子2组合只有1个观察点,因此无法进行密度计算)。还删除了该功能,因为它绝对没有任何功能。

library(dplyr)
L = list()

  # gets the unique values for mutation rate
  factor1 <- unique(data.df$Factor1)
  # gets the unique values for rate of new copies
  factor2 <- unique(data.df$Factor2)

  # calculates the minimum and maximum value for each of the statistics
  # Store all 10 min and max in a vector
  vector.min <- lapply(data.df %>% select(stats1:stats10), min)
  vector.max <- lapply(data.df %>% select(stats1:stats10), max)

  # for loop to calculate the densities and probabilities and JS distance for each combination of factor1 and factor2

  for (f1 in factor1){
    for (f2 in factor2){
      new.df <- subset(data.df, factor1 == f1 & factor2 == f2)    
      # subsetting data. One data frame for each of the four models
      MM.df <- subset(new.df, Model == "M1")
      TM.df <- subset(new.df, Model == "M2")
      MI.df <- subset(new.df, Model == "M3")
      TI.df <- subset(new.df, Model == "M4")

      # densitiy and probability for each stats

      # calculating densities for model M1 and M2
      MM.d = lapply(1:10, function(i) density(MM.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[i]]))
      TM.d = lapply(1:10, function(i) density(TM.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[i]]))

      # Density for the middle distribution between models M1 and M2 
      Middle12.d <- mapply(function(d1, d2) (d1$y+d2$y)/2, MM.d, TM.d, SIMPLIFY = F)

      # width for models
      w12 = lapply(MM.d, function(y) {y$x[2] - y$x[1]})

      # calculating probabilities for each models
      MM1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, MM.d, w12)  # (hist hight) * (bin width)
      TM1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, TM.d, w12)
      Middle12.p = mapply("*", Middle12.d, w12)

      # calculating densities for models M3 and M4
      MI.d = lapply(1:10, function(i) density(MI.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[2]]))
      TI.d = lapply(1:10, function(i) density(TI.df %>% select(i+3) %>% unlist, n = 512, from = vector.min[[i]], to = vector.min[[2]]))
      Middle34.d <- mapply(function(d1, d2) (d1$y+d2$y)/2, MI.d, TI.d)

      w34 = lapply(MI.d, function(y) {y$x[2] - y$x[1]})      

      # calculating probabilities for M3 and M4 models
      MI1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, MI.d, w34)  # (hist hight) * (bin width)
      TI1.p = mapply(function(arg1, arg2) {arg1$y * arg2}, TI.d, w34)
      Middle34.p = mapply("*", Middle34.d, w34)

      L = c(L, list(mapply(js.dist, MM1.p, TM1.p, Middle12.p), mapply(js.dist, MI1.p, TI1.p, Middle34.p)))
    }
  }