Question

我已经实现了一种基于等式1-3的等位基因丰富度计算算法：

计数稀有性等位基因：私人等位基因和分层抽样设计-Steven T. Kalinowski，Link to PDF

和以下公式相同：

欧洲山毛榉在冰川后定居期间杂合性和等位基因丰富度之间的变化趋势-B. Comps，Link to paper

，需要帮助来进行全面优化。

等位基因丰富度是衡量遗传多样性的指标，但受样本量的影响。该公式使我们无需重新采样即可在较小的样本量下估计预期的等位基因丰富度。我用它来估计所有可能的子采样大小的等位基因富集度，从而绘制出稀疏曲线。

作为算法的第一步，我计算了在每个子采样级别的每个计数级别上未观察到等位基因的概率，从而创建了一个用于计算实际概率的查找表。我利用尽可能多的值仅为1-（以前的计算值）这一事实来计算尽可能少的值（我认为）。这仍然是最慢的部分，但我认为我可以按n * log（n）进行缩放。我最想知道是否有一种更有效的方法来创建向量并将它们连接到表（数据框架）中。

第二步是使用查找表来计算每个子采样级别的预期等位基因丰富度。我将这部分更改为更快的实现，并更新了以下代码。

这是当前的代码。您可以在my GitHub, DiDeoxy/PGDA: calc_allele_richness.R上找到它。

要运行代码，您可以使用以下软件包安装软件包：devtools::install_github("https://github.com/DiDeoxy/PGDA")并与library(PGDA)一起使用

#' Calculate mean allele richness at all sampling levels
#'
#' Calculates the mean allele richness across all markers for a sample at all
#' sampling levels. Missing data is not allowed. Based on the formula presented
#' in https://www.genetics.org/content/157/1/389
#'
#' @param pop a data frame with individuals in columns and markers in rows,
#' there must be atleast two individuals
#' @param allele_coding the coding used for indicating the different alleles
#' @param num_cores the number of cores to use, must be 1 on windows, can use 
#' detectCores() of the parallel package on linux
#'
#' @importFrom magrittr %>%
#' @importFrom parallel mclapply
#' @importFrom scrime rowTables
#' 
#' @return a table of expected allele richness for each marker at each 
#' subsampling level with markers in rows and sampling levels in columns
#'
#' @export

allele_richness <- function (pop, allele_coding = 1:2, num_cores = 1) {
  # the total number of alleles observed at each marker
  n <- ncol(pop)
  # probs contains the probability of not observing allele i at each 
  # sub-sampling level (n - k) for each possible count of allele i with
  # allele count in rows and k in columns
  #
  # for each subsampling level
  probs <- mclapply(0:(n - 1), function (k) {
    # a vector for containng the probs of not observing allele i at each count
    # level at each subsampling (n - k) level
    inter <- rep(0, n)
    # if n - k <= 1 then the prob of not observing allele i is 0 at all count
    # levels, the smaller k is compared to n the more levels will have probs of
    # not observing allel i greater than 0
    if (n - k > 1) {
      # probs of not observing allele i are linear decreasing, therefore the top
      # half and bottom half are 1 - mirrors, we can use this fact to skip a lot
      # of computation
      temp <- lapply(1:floor((n - k) / 2), function (n_i) {
        (n - n_i - k) / (n - k)
      }) %>% do.call(c, .)
      # concatenating the calced probs with their 1 - mirror, if n - k is odd
      # the middle value will equal 0.5 which we do not need to mirror
      temp <- c(temp, rev(1 - temp[which(temp != 0.5)]))
      inter[1:length(temp)] <- temp
      inter
    } else {
      inter
    }
  }, mc.cores = num_cores) %>% do.call(cbind, .)
  # creates a data frame containg the counts of each allele for each marker
  marker_allele_counts <- rowTables(pop, allele_coding)
  # we calcuate the mean allele richness across all markers at each subsampling
  # level (n - k) by calculating the product of not observing each allele at
  # each sub-sampling level then taking the sum of these for each marker and 
  # then taking the mean across all markers
  #
  # for each marker
  mclapply(1:nrow(marker_allele_counts), function (marker) {
    (1 - lapply(1:length(marker_allele_counts[marker, ]), function (allele) {
      # for each allele, calc the probability of not observing the allele at
      # each sub-sampling level
      cumprod(probs[marker_allele_counts[[marker, allele]], ])
    # rbind the probabilities for each allele at each sub-smapling level,
    # subtract from one to turn them into probabilities of observing the allele,
    # and sum the alleles together
    }) %>% do.call(rbind, .)) %>% colSums()
  # return a table with markers in rows and sub-sampling levels in columns
  }, mc.cores = num_cores) %>% do.call(rbind, .)
}

感谢您提供的任何帮助，这是我第一次编写这样的内容。

干杯

DiDeoxy。

优化算法，用于计算不同子采样级别的预期等位基因丰富度

0 个答案: