将ggplot2与用户定义的stat_function()集成

时间:2014-08-29 12:57:54

标签: r plot ggplot2 distribution data-visualization

我尝试使用ggplot2包和用户将混合分布图与已识别的组件分布的图表叠加在一起其stat_function()的定义函数。我尝试了两种方法。在以下两种情况下,分发标识都是正常的:

number of iterations= 11 
summary of normalmixEM object:
         comp 1  comp 2
lambda 0.348900 0.65110
mu     2.019878 4.27454
sigma  0.237472 0.43542
loglik at estimate:  -276.3643 

A)但是,在第一种方法中,输出包含以下错误

Error in eval(expr, envir, enclos) : object 'comp.number' not found

此方法的可重现示例如下(忠实内置 R数据集):

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(mix, comp.number) {
  g <- stat_function(fun = function(mix, comp.number) 
  {mix$lambda[comp.number] *
     dnorm(x, mean = mix$mu[comp.number],
           sd = mix$sigma[comp.number])}, 
  geom = "line", aes(colour = DISTRIB_COLORS[comp.number]))
  return (g)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS),
                    function(i) plot.components(mix.info, i))
print(g + distComps)

B) 第二种方法不会产生任何错误。但是,唯一可见的情节是混合分布。 其组件分布图未生成或可见(在某种程度上我看来,直线水平线y = 0也是可见的,但我并非100%确定):

enter image description here

以下是此方法的可重现示例

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(x, mix, comp.number, ...) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number],
          sd = mix$sigma[comp.number], ...)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)))
print(g + distComps)

问题:每种方法有哪些问题?哪些方法更正确?

更新:发布后几分钟,我意识到我忘了为第二种方法添加stat_function()的线条绘制部分,以便相应的行如下:

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)),
  geom = "line", aes(colour = DISTRIB_COLORS[i]))

但是,此更新会产生错误,其来源我不太了解:

Error in FUN(1:2[[1L]], ...) : 
  unused arguments (geom = "line", list(colour = DISTRIB_COLORS[i]))

1 个答案:

答案 0 :(得分:3)

最后,我已经想出了如何做我想做的事情,并重新设计了我的解决方案。我已经为@Spacedman和@jlhoward修改了这个问题的部分答案(我在发布问题时没有看到):Any suggestions for how I can plot mixEM type data using ggplot2。但是,我的解决方案有点不同。一方面,我使用了@Spacedman使用stat_function()的方法 - 我在原始版本中尝试使用的相同想法 - 我比其他人更喜欢它,这似乎有点过于复杂(而更多灵活)。另一方面,与@ jlhoward的方法类似,我简化了参数传递。我还介绍了一些视觉改进,例如自动选择差异化颜色以便更轻松地识别组件。对于我的EDA,我将此代码重构为R模块。但是,仍有一个问题,我仍然试图弄清楚:为什么组件分布图位于 预期的< em>密度图,如下所示。对此问题的任何建议将不胜感激!

更新最后,我已经找出 scaling 的问题,并相应地更新了代码和数字 - y值需要< em>乘以<{em>乘以binwidth的值(在这种情况下,它是0.5)来计算每个bin的观察数量。

enter image description here

以下是完整的可重复使用的可重复解决方案

library(ggplot2)
library(RColorBrewer)
library(mixtools)

NUM_COMPONENTS <- 2

set.seed(12345) # for reproducibility

data <- faithful$waiting # use R built-in data

# extract 'k' components from mixed distribution 'data'
mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

numComponents <- length(mix.info$sigma)
message("Extracted number of component distributions: ",
        numComponents)

calc.components <- function(x, mix, comp.number) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}

g <- ggplot(data.frame(x = data)) +
  geom_histogram(aes(x = data, y = 0.5 * ..density..),
                 fill = "white", color = "black", binwidth = 0.5)

# we could select needed number of colors randomly:
#DISTRIB_COLORS <- sample(colors(), numComponents)

# or, better, use a palette with more color differentiation:
DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")

distComps <- lapply(seq(numComponents), function(i)
  stat_function(fun = calc.components,
                arg = list(mix = mix.info, comp.number = i),
                geom = "line", # use alpha=.5 for "polygon"
                size = 2,
                color = DISTRIB_COLORS[i]))
print(g + distComps)