Compute stat_summary with lines drawn between points, not raw data in ggplot2

时间:2017-04-06 17:12:36

标签: r ggplot2 mean

I have some data that are missing some observations, e.g.,

library(dplyr)
library(ggplot2)

dframe <- data.frame(height = c(1, 2, NA, 4, 
                                 1.2, 2.5, 3.8, 4.4,
                                 3, NA, 5, 7),
                     name = rep(c("A", "B", "C"), each = 4),
                     date = rep(c(1, 2, 3, 4), 3))  

So data look like this:

   height name date
1    1.0    A    1
2    2.0    A    2
3     NA    A    3
4    4.0    A    4

But, in my data, NA values don't actually exist, so dframe is actually:

dframe <- dframe %>% 
  filter(!height %in% NA)

I'd like to create a plot for the data where I show the raw data for each "name" - A, B, and C - and also have a "mean height" line. I try using:

ggplot(dframe, aes(date, height)) +
  geom_point() +
  geom_line(aes(group = name), color = "blue") +
  stat_summary(fun.y="mean", geom="line", size = 1) +
  theme_bw()

But, as you can see, because of the missing values, ggplot's "mean" line appears jagged and misleading.

ggplot points and mean line

Is there a way to force ggplot to calculate the mean based on the LINES that it drew, not the raw data?

1 个答案:

答案 0 :(得分:2)

您可以插入缺失的值,然后绘制:

library(tidyverse)

# Starting data frame
dframe = dframe %>% filter(!is.na(height))

dframe %>% 
  complete(date, nesting(name)) %>% 
  arrange(name, date) %>%
  group_by(name) %>%
  mutate(heightImp = approx(height, xout=date)$y,
         imputed.flag = ifelse(is.na(height), "Imputed", "Measured")) %>%
  ggplot(aes(date, heightImp)) +
   geom_line(aes(group = name), color = "blue") +
   geom_point(aes(colour=imputed.flag)) +
   stat_summary(fun.y="mean", geom="line", size = 1) +
   scale_colour_manual(values=c("red","blue")) +
   labs(colour="") +
   theme_bw()

enter image description here

您还可以绘制回归线,即每个x值的条件均值,受制于回归线实际上是直线的约束,而不是连接平均值时得到的分段线性结果在每个x值单独计算:

ggplot(dframe, aes(date, height)) +
  geom_line(aes(group = name), color = "blue") +
  geom_point() +
  geom_smooth(method="lm", colour="black", se=FALSE) +
  theme_bw()

enter image description here

您还可以使用更复杂的回归函数。下面的代码显示了三阶多项式和具有三个自由度的B样条。在这种情况下它们是相同的(三阶多项式的黑色曲线是&#34;&#34; B-样条曲线的红色曲线),由于时间点的数量很少,但是一般来说是不同的。关键在于您可以使用线性回归来适应各种函数,具体取决于您认为适合您的数据和主题的内容。 (在这种情况下,另一个因素是您为每个主题重复测量,因此适当的模型将通过使用分层模型(请参阅lme4nlme包))来考虑这一点:

ggplot(dframe, aes(date, height)) +
  geom_line(aes(group = name), color = "blue") +
  geom_point() +
  geom_smooth(method="lm", formula=y ~ poly(x, 3), colour="black", se=FALSE) +
  geom_smooth(method="lm", formula=y ~ splines::bs(x,df=3), colour="red", se=FALSE) +
  theme_bw()

enter image description here