Question

我有一些右倾斜的数据，我希望使用ggplot在视觉上将分布拟合与常规规模和对数规模的数据进行比较。但是，当我使用scale_x_continuous（）或scale_x_log10（）转换分布曲线时，转换无法正确转换。

x <- rlnorm(1000, meanlog = -4, sdlog = 1)
ggplot(data.frame(x)) +
  geom_histogram(aes(x, y = ..density.. * 25)) +
  scale_x_log10() +
  stat_function(fun = "dlnorm",
                args = list(meanlog = -4,
                            sdlog = 1))

注意对数正态曲线的平均值与直方图的平均值不匹配。为什么不？有没有办法让它们匹配？

在另一篇相关的帖子中，建议的答案是包含参数inherit.aes = FALSE，但这对此没有帮助。

我使用的是R版本3.4.3和ggplot2版本2.2.1。

Answer 1

首先，使用日志正态分布回想一下，默认情况下使用自然对数，不 base 10 对数。上图中的部分问题是由于对数基数的混合。

让我们首先使用meanlog -4和sdlog 1生成对数正态随机变量 $X$ 的示例观察，即

$f1$

library(ggplot2)
library(gridExtra)

set.seed(42)

dat <- data.frame(x = rlnorm(1000, meanlog = -4, sdlog = 1))

我们将首先绘制标准x轴上的密度。我将geom_histogram与stat = "density"一起使用，以便缩放条形图并且不需要使用美学y = ..density..这与您原来的情节非常相似，只是没有尝试缩放x轴。

ggplot(dat) +
  geom_histogram(mapping = aes(x = x), stat = "density")  +
  stat_function(fun = "dlnorm",
                args = list(meanlog = -4, sdlog = 1),
                n = 501,
                color = "red")

现在，回想一下，如果

$f1$

然后

$f2$

其中log是自然对数。

在日志规模上绘制生成的数据示例的一种方法如下。请注意，日志转换在geom_historgram的映射中是明确的，而stat_function正在使用dnorm 而不是 dlnorm。

ggplot(dat) +
  geom_histogram(mapping = aes(x = log(x)), stat = "density")  +
  stat_function(fun = "dnorm",
                args = list(mean = -4, sd = 1),
                n = 501,
                color = "red")

现在，要转换x轴，您需要将ggplot2::scale_x_continuous与trans = "log"参数一起使用。将此变换应用于图形时，将修改x轴的比例，并对变换后的x值（而非原始值）进行stat_function的评估。因此，您需要定义使用dnorm(log(x))的函数，如下所示：

ggplot(dat) +
  geom_histogram(mapping = aes(x = x), stat = "density")  +
  stat_function(fun = function(x, ...) {dnorm(log(x), ...) },
                args = list(mean = -4, sd = 1),
                n = 501,
                color = "red") +
  scale_x_continuous(trans = "log",
                     breaks = exp(seq(-6, 0, by = 2)),
                     labels = paste("exp(", seq(-6, 0, by = 2), ")"))

值得注意的是，第二个图中x轴刻度的标签是整数值，x轴标签是log（x），而在第三个图形中，x轴刻度是表达式，标签是计划＆＃34; x。＆＃34;确保使用描述性刻度线和轴标签。

Answer 2

目标（如果没有这么说的话）仍然是在log10规模上查看对数正态数据和分布。为了达到这个目标，密度（pdf）需要为log10规模。（感谢分享以下代码的同事！）

## generate data:
x <- rlnorm(1000, meanlog = -4, sdlog = 1)

## generate sequence of x values for the curve.
xx <- seq(min(x), max(x), length = 1000)
## Calculated the density for each xx value.
## Here, density is based on the lognormal distribution.
pdf <- dlnorm(xx, -4, 1)

## Repeat for log(xx).
xx_ln <- log(xx)
## This density is based on the normal distribution.
pdf_norm <- dnorm(xx_ln, -4, 1)

## As a reminder, the pdf's for the distributions are different:
head(cbind(pdf, pdf_norm))

在查看log10规模的数据时，它也会有不同的pdf。下面的函数和代码将普通的pdf转换为log10规模的pdf。

## Function: numerical integration stuff for log10 distribution plots
## essentially transforms pdf_norm to log10 base.
## step_size = Riemann sum-- step size to integrate over.
## x_10 = x values after a log10-transformation
## pdf_norm == pdf values for normal distribution (see above)
num_int <- function(df){
  df$step_size <- c(diff(df$xx_10), NA)
  int <- sum(df$step_size * df$pdf_norm, na.rm = T)
  return(data.frame(int))
}

## to complete the numerical integration, need log10(values)
xx_10 <- log10(xx)
curve_df <- data.frame(xx, xx_10, pdf, pdf_norm)
int <- num_int(curve_df) 
curve_df$pdf_10 <- curve_df$pdf_norm / as.numeric(int)

## replace Inf rows with NA
## (not necessary with the example code)
curve_df %<>%
  mutate(pdf = replace(pdf, pdf == Inf, NA),
         pdf_norm = replace(pdf_norm, pdf_norm == Inf, NA),
         pdf_10 = replace(pdf_10, pdf_10 == Inf, NA))


ggplot() +
  geom_histogram(data = data.frame(x), aes(x = x, y = ..density..)) + 
  geom_line(data = curve_df,
            aes(xx, pdf_10), col="blue", size = I(1.2), linetype = 1) +
  scale_x_log10()

ggplot比例转换对于stat_function是不准确的

2 个答案: