我尝试使用ggplot2
包和用户将混合分布图与已识别的组件分布的图表叠加在一起其stat_function()
的定义函数。我尝试了两种方法。在以下两种情况下,分发标识都是正常的:
number of iterations= 11
summary of normalmixEM object:
comp 1 comp 2
lambda 0.348900 0.65110
mu 2.019878 4.27454
sigma 0.237472 0.43542
loglik at estimate: -276.3643
A)但是,在第一种方法中,输出包含以下错误:
Error in eval(expr, envir, enclos) : object 'comp.number' not found
此方法的可重现示例如下(忠实是内置 R
数据集):
library(ggplot2)
library(mixtools)
DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2
set.seed(12345)
mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
maxit = 100, epsilon = 0.01)
summary(mix.info)
plot.components <- function(mix, comp.number) {
g <- stat_function(fun = function(mix, comp.number)
{mix$lambda[comp.number] *
dnorm(x, mean = mix$mu[comp.number],
sd = mix$sigma[comp.number])},
geom = "line", aes(colour = DISTRIB_COLORS[comp.number]))
return (g)
}
g <- ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = 0.5)
distComps <- lapply(seq(NUM_COMPONENTS),
function(i) plot.components(mix.info, i))
print(g + distComps)
B) 第二种方法不会产生任何错误。但是,唯一可见的情节是混合分布。 其组件分布图未生成或可见(在某种程度上我看来,直线水平线y = 0也是可见的,但我并非100%确定):
以下是此方法的可重现示例:
library(ggplot2)
library(mixtools)
DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2
set.seed(12345)
mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
maxit = 100, epsilon = 0.01)
summary(mix.info)
plot.components <- function(x, mix, comp.number, ...) {
mix$lambda[comp.number] *
dnorm(x, mean = mix$mu[comp.number],
sd = mix$sigma[comp.number], ...)
}
g <- ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = 0.5)
distComps <- lapply(seq(NUM_COMPONENTS), function(i)
stat_function(fun = plot.components,
args = list(mix = mix.info, comp.number = i)))
print(g + distComps)
问题:每种方法有哪些问题?哪些方法更正确?
更新:发布后几分钟,我意识到我忘了为第二种方法添加stat_function()
的线条绘制部分,以便相应的行如下:
distComps <- lapply(seq(NUM_COMPONENTS), function(i)
stat_function(fun = plot.components,
args = list(mix = mix.info, comp.number = i)),
geom = "line", aes(colour = DISTRIB_COLORS[i]))
但是,此更新会产生错误,其来源我不太了解:
Error in FUN(1:2[[1L]], ...) :
unused arguments (geom = "line", list(colour = DISTRIB_COLORS[i]))
答案 0 :(得分:3)
最后,我已经想出了如何做我想做的事情,并重新设计了我的解决方案。我已经为@Spacedman和@jlhoward修改了这个问题的部分答案(我在发布问题时没有看到):Any suggestions for how I can plot mixEM type data using ggplot2。但是,我的解决方案有点不同。一方面,我使用了@Spacedman使用stat_function()
的方法 - 我在原始版本中尝试使用的相同想法 - 我比其他人更喜欢它,这似乎有点过于复杂(而更多灵活)。另一方面,与@ jlhoward的方法类似,我简化了参数传递。我还介绍了一些视觉改进,例如自动选择差异化颜色以便更轻松地识别组件。对于我的EDA,我将此代码重构为R模块。但是,仍有一个问题,我仍然试图弄清楚:为什么组件分布图位于 预期的< em>密度图,如下所示。对此问题的任何建议将不胜感激!
更新最后,我已经找出 scaling 的问题,并相应地更新了代码和数字 - y
值需要< em>乘以<{em>乘以binwidth
的值(在这种情况下,它是0.5
)来计算每个bin的观察数量。
以下是完整的可重复使用的可重复解决方案:
library(ggplot2)
library(RColorBrewer)
library(mixtools)
NUM_COMPONENTS <- 2
set.seed(12345) # for reproducibility
data <- faithful$waiting # use R built-in data
# extract 'k' components from mixed distribution 'data'
mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
maxit = 100, epsilon = 0.01)
summary(mix.info)
numComponents <- length(mix.info$sigma)
message("Extracted number of component distributions: ",
numComponents)
calc.components <- function(x, mix, comp.number) {
mix$lambda[comp.number] *
dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}
g <- ggplot(data.frame(x = data)) +
geom_histogram(aes(x = data, y = 0.5 * ..density..),
fill = "white", color = "black", binwidth = 0.5)
# we could select needed number of colors randomly:
#DISTRIB_COLORS <- sample(colors(), numComponents)
# or, better, use a palette with more color differentiation:
DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")
distComps <- lapply(seq(numComponents), function(i)
stat_function(fun = calc.components,
arg = list(mix = mix.info, comp.number = i),
geom = "line", # use alpha=.5 for "polygon"
size = 2,
color = DISTRIB_COLORS[i]))
print(g + distComps)