Question

我尝试使用ggplot2包和用户将混合分布图与已识别的组件分布的图表叠加在一起其stat_function()的定义函数。我尝试了两种方法。在以下两种情况下，分发标识都是正常的：

number of iterations= 11 
summary of normalmixEM object:
         comp 1  comp 2
lambda 0.348900 0.65110
mu     2.019878 4.27454
sigma  0.237472 0.43542
loglik at estimate:  -276.3643

A）但是，在第一种方法中，输出包含以下错误：

Error in eval(expr, envir, enclos) : object 'comp.number' not found

此方法的可重现示例如下（忠实是内置 R数据集）：

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(mix, comp.number) {
  g <- stat_function(fun = function(mix, comp.number) 
  {mix$lambda[comp.number] *
     dnorm(x, mean = mix$mu[comp.number],
           sd = mix$sigma[comp.number])}, 
  geom = "line", aes(colour = DISTRIB_COLORS[comp.number]))
  return (g)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS),
                    function(i) plot.components(mix.info, i))
print(g + distComps)

B） 第二种方法不会产生任何错误。但是，唯一可见的情节是混合分布。 其组件分布图未生成或可见（在某种程度上我看来，直线水平线y = 0也是可见的，但我并非100％确定）：

enter image description here

以下是此方法的可重现示例：

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(x, mix, comp.number, ...) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number],
          sd = mix$sigma[comp.number], ...)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)))
print(g + distComps)

问题：每种方法有哪些问题？哪些方法更正确？

更新：发布后几分钟，我意识到我忘了为第二种方法添加stat_function()的线条绘制部分，以便相应的行如下：

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)),
  geom = "line", aes(colour = DISTRIB_COLORS[i]))

但是，此更新会产生错误，其来源我不太了解：

Error in FUN(1:2[[1L]], ...) : 
  unused arguments (geom = "line", list(colour = DISTRIB_COLORS[i]))

Answer 1

最后，我已经想出了如何做我想做的事情，并重新设计了我的解决方案。我已经为@Spacedman和@jlhoward修改了这个问题的部分答案（我在发布问题时没有看到）：Any suggestions for how I can plot mixEM type data using ggplot2。但是，我的解决方案有点不同。一方面，我使用了@Spacedman使用stat_function()的方法 - 我在原始版本中尝试使用的相同想法 - 我比其他人更喜欢它，这似乎有点过于复杂（而更多灵活）。另一方面，与@ jlhoward的方法类似，我简化了参数传递。我还介绍了一些视觉改进，例如自动选择差异化颜色以便更轻松地识别组件。对于我的EDA，我将此代码重构为R模块。但是，仍有一个问题，我仍然试图弄清楚：为什么组件分布图位于预期的< em>密度图，如下所示。对此问题的任何建议将不胜感激！

更新最后，我已经找出 scaling 的问题，并相应地更新了代码和数字 - y值需要< em>乘以<{em>乘以binwidth的值（在这种情况下，它是0.5）来计算每个bin的观察数量。

enter image description here

以下是完整的可重复使用的可重复解决方案：

library(ggplot2)
library(RColorBrewer)
library(mixtools)

NUM_COMPONENTS <- 2

set.seed(12345) # for reproducibility

data <- faithful$waiting # use R built-in data

# extract 'k' components from mixed distribution 'data'
mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

numComponents <- length(mix.info$sigma)
message("Extracted number of component distributions: ",
        numComponents)

calc.components <- function(x, mix, comp.number) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}

g <- ggplot(data.frame(x = data)) +
  geom_histogram(aes(x = data, y = 0.5 * ..density..),
                 fill = "white", color = "black", binwidth = 0.5)

# we could select needed number of colors randomly:
#DISTRIB_COLORS <- sample(colors(), numComponents)

# or, better, use a palette with more color differentiation:
DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")

distComps <- lapply(seq(numComponents), function(i)
  stat_function(fun = calc.components,
                arg = list(mix = mix.info, comp.number = i),
                geom = "line", # use alpha=.5 for "polygon"
                size = 2,
                color = DISTRIB_COLORS[i]))
print(g + distComps)

将ggplot2与用户定义的stat_function（）集成

1 个答案: