ggplot无法使用facet_wrap和group asthetic绘制流畅的游戏

时间:2018-10-08 05:38:07

标签: r ggplot2 facet-wrap gam

我正在尝试使用具有组外观以及facet_wrap的ggplot绘制多面板和多线图。但是,当一组数据点太少时,geom_smooth对于构面图中的所有线都失败。

plot1 <- ggplot(data=df1, 
                aes(x=Year, y=Mean, group=Group2, linetype=Group2, shape=Group2)) +  
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=0.2) +  
  geom_smooth(method = "gam", se=F, formula = y ~ s(x, k=3), size = 1, colour="black") + 
  geom_point(position=pd, size=2, fill="white") +  
  scale_x_continuous(limits=c(min(df1$Year-0.1), max(df1$Year+0.1)), 
                     breaks=seq(min(df1$Year),max(df1$Year),5)) +  
  facet_wrap(~Group1, scales = "free", ncol=2) +  
  theme_bw() + 
  theme(axis.text.x = element_text(),
        axis.title.x = element_blank(), 
        strip.background = element_blank(), 
        axis.line.x = element_line(colour="black"),
        axis.line.y = element_line(colour="black"), 
        panel.grid.minor = element_blank(), 
        panel.grid.major = element_blank(),
        panel.border = element_blank(), 
        panel.background = element_blank(),
        legend.position="top",
        legend.title = element_blank())
plot(plot1)

产生以下图。这只是汇总数据,以使其更容易。好像错误使ggplot停止计算该特定平面上的序列平滑一样。

数据

Year    Group1      Group2      Mean        SE
2011    Factor A    Factor C    30.62089116 3.672624771
2011    Factor A    Factor D    54.99066324 2.822405771
2011    Factor B    Factor C    30.48859003 3.748388489
2011    Factor B    Factor D    45.70410611 4.284244405
2017    Factor A    Factor C    33.68256601 4.030964172
2017    Factor A    Factor D    53.43496462 4.687042033
2017    Factor B    Factor C    23.08799875 5.17753488
2001    Factor A    Factor C    23.79166667 2.837795432
2001    Factor A    Factor D    23.75925926 3.688185081
2001    Factor B    Factor C    29.05555556 4.08597798
2001    Factor B    Factor D    28.13333333 7.877429079
2008    Factor A    Factor C    23.3        2.383624691
2008    Factor A    Factor D    28.83333333 2.750959429
2008    Factor B    Factor C    34.01666667 5.340999698

并绘制

plot 1

很显然,有足够的数据可以为组factorB中的factorC线画一条平滑线。任何想法?

2 个答案:

答案 0 :(得分:2)

我认为这很棘手。经过对StatSmooth的{​​{3}}进行测试和阅读后,我总结了以下发现:

观察

    如果任何数据组对geom_smooth()的观察太少,则
  1. method = "gam"无法在绘图面板中绘制 all 条平滑线strong> AND formula = y ~ s(x, k = 3);
  2. 如果该图面分为多个面板,则仅会影响具有不良数据组的面板;
  3. formula = y ~ x(即默认公式)不会发生这种情况;
  4. 使用默认公式的其他某些方法(例如"lm""glm")不会发生这种情况,但是{strong>会发生的情况是method = "loess"
  5. 如果数据组只有1个观察值,则不会发生这种情况。

我们可以用一些简化的代码重现上面的内容:

# create sample data
n <- 30
set.seed(567)
df.1 <- data.frame( # there is only 1 observation for group == B
  x = rnorm(n), y = rnorm(n),
  group = c(rep("A", n - 1), rep("B", 1)),
  facet = sample(c("X", "Y"), size = n, replace = TRUE))    
set.seed(567)
df.2 <- data.frame( # there are 2 observations for group == B
  x = rnorm(n), y = rnorm(n),
  group = c(rep("A", n - 2), rep("B", 2)),
  facet = sample(c("X", "Y"), size = n, replace = TRUE))

# create base plot
p <- ggplot(df.2, aes(x = x, y = y, color = group)) + 
  geom_point() + theme_bw()

# problem: no smoothed line at all in the entire plot
p + geom_smooth(method = "gam", formula = y ~ s(x, k = 3))

# problem: no smoothed line in the affected panel
p + facet_wrap(~ facet) + 
  geom_smooth(method = "gam", formula = y ~ s(x, k = 3))

# no problem with default formula: smoothed lines in both facet panels
p + facet_wrap(~ facet) + geom_smooth(method = "gam")

# no problem with lm / glm, but problem with loess
p + facet_wrap(~ facet) + geom_smooth(method = "lm")
p + facet_wrap(~ facet) + geom_smooth(method = "glm")
p + facet_wrap(~ facet) + geom_smooth(method = "loess")

# no problem if there's only one observation (instead of two)
p %+% df.1 + geom_smooth(method = "gam", formula = y ~ s(x, k = 3))
p %+% df.1 + facet_wrap(~ facet) + 
  geom_smooth(method = "gam", formula = y ~ s(x, k = 3))

观察值1和2的解释

我认为问题出在StatSmooth的{​​{1}}函数的最后两行。第一行针对由compute_group映射指定的每个组在数据帧上调用模型函数(例如stats::glmstats::loessmgcv::gam),而第二行则调用以下函数之一aes(group = ...)周围的包装器,以获取模型的平滑值(以及置信区间,如果适用)。

stats::predict()

当参数model <- do.call(method, c(base.args, method.args)) predictdf(model, xseq, se, level) 用于只有两个观察值的数据帧时,将发生以下情况:

method = "gam", formula = y ~ s(x, k = 3)
  

smooth.construct.tp.smooth.spec(object,dk $ data,dk $ knots)中的错误:   一个字词的唯一协变量组合少于指定的最大值   自由度

model <- do.call(mgcv::gam, args = list(formula = y ~ s(x, k = 3), data = df.2 %>% filter(group == "B" & facet == "X"))) 是定义为承担model结果的对象,甚至尚未创建。代码do.call(...)的最后一行将引发错误,因为predictdf(...)不存在。 没有方面,这会影响model进行的所有计算,并且StatSmooth不会收到可用于在其图层中创建任何几何图形的可用数据。 使用构面,可以对每个构面分别进行上述计算,因此仅影响有问题的构面。

观察值3和4的解释:

此外,如果我们未指定替换默认geom_smooth()的公式,则会从y ~ x / gam / {{1 }},可以将其传递给ggplot2的未导出的lm函数,以获取预测值的数据帧:

glm

predictdf也将返回有效的对象,尽管带有大量警告。但是,将其传递给model <- do.call(mgcv::gam, # or stats::lm, stats::glm args = list(formula = y ~ x, data = df.2 %>% filter(group == "B" & facet == "X"))) result <- ggplot2:::predictdf( model, xseq = seq(-2, 1.5, length.out = 80), # pseudo range of x-axis values se = FALSE, level = 0.95) # default SE / level parameters 会导致错误:

loess
  

predLoess(object $ y,object $ x,newx = if(is.null(newdata))中的错误   object $ x else if(is.data.frame(newdata))   as.matrix(model.frame(delete.response(terms(object)),:NA / NaN / Inf   在外部函数调用中(arg 5)

观察说明5:

predictdf的{​​{1}}函数以以下内容开头:

model <- do.call(stats::loess,
                 args = list(formula = y ~ x,
                             data = df.2 %>% filter(group == "B" & facet == "X")))

result <- ggplot2:::predictdf(
  model, 
  xseq = seq(-2, 1.5, length.out = 80), # pseudo range of x-axis values
  se = FALSE, level = 0.95) # default SE / level parameters

换句话说,如果指定组中只有1个观察值,则StatSmooth将立即返回一个空白数据帧。因此,它将永远不会到达代码的后续部分以引发任何错误。

解决方法:

我们可以精确定位compute_group代码(请参阅带注释和注释掉的部分):

if (length(unique(data$x)) < 2) {
      # Not enough data to perform fit
      return(data.frame())
    }

定义一个使用此版本的新统计图层:

StatSmooth

结果:

我们可以像以前一样处理相同的情况,将compute_group替换为new.compute_group <- function( data, scales, method = "auto", formula = y~x, se = TRUE, n = 80, span = 0.75, fullrange = FALSE, xseq = NULL, level = 0.95, method.args = list(), na.rm = FALSE) { if (length(unique(data$x)) < 2) return(data.frame()) if (is.null(data$weight)) data$weight <- 1 if (is.null(xseq)) { if (is.integer(data$x)) { if (fullrange) { xseq <- scales$x$dimension() } else { xseq <- sort(unique(data$x)) } } else { if (fullrange) { range <- scales$x$dimension() } else { range <- range(data$x, na.rm = TRUE) } xseq <- seq(range[1], range[2], length.out = n) } } if (identical(method, "loess")) method.args$span <- span if (is.character(method)) method <- match.fun(method) base.args <- list(quote(formula), data = quote(data), weights = quote(weight)) # if modelling fails, return empty data frame # model <- do.call(method, c(base.args, method.args)) model <- try(do.call(method, c(base.args, method.args))) if(inherits(model, "try-error")) return(data.frame()) # if modelling didn't fail, but prediction returns NA, # also return empty data frame # predictdf(model, xseq, se, level) pred <- try(ggplot2:::predictdf(model, xseq, se, level)) if(inherits(pred, "try-error")) return(data.frame()) return(pred) } ,并验证在每种情况下都可以看到平滑的geom图层(请注意,有些图层仍会产生错误消息):

# same as stat_smooth() except that it uses stat = StatSmooth2, rather 
# than StatSmooth
stat_smooth_local <- function(
  mapping = NULL, data = NULL, geom = "smooth", position = "identity", ...,
  method = "auto", formula = y ~ x, se = TRUE, n = 80, span = 0.75,
  fullrange = FALSE, level = 0.95, method.args = list(), na.rm = FALSE,
  show.legend = NA, inherit.aes = TRUE) {
  layer(
    data = data, mapping = mapping, stat = StatSmooth2,
    geom = geom, position = position, show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      method = method, formula = formula, se = se, n = n,
      fullrange = fullrange, level = level, na.rm = na.rm,
      method.args = method.args, span = span, ...
    )
  )
}

# inherit from StatSmooth
StatSmooth2 <- ggproto(
  "StatSmooth2", ggplot2::StatSmooth,
  compute_group = new.compute_group
)

current GH code

答案 1 :(得分:2)

一种非常简单的处理方法是将导致在传递给geom_smooth的数据中出现问题的行子集化:

library(tidyverse)

df1 <- data_frame(
    Year = c(2011L, 2011L, 2011L, 2011L, 2017L, 2017L, 2017L, 2001L, 2001L, 2001L, 2001L, 2008L, 2008L, 2008L),
    Group1 = c("Factor A", "Factor A", "Factor B", "Factor B", "Factor A", "Factor A", "Factor B", "Factor A", "Factor A", "Factor B", "Factor B", "Factor A", "Factor A", "Factor B"),
    Group2 = c("Factor C", "Factor D", "Factor C", "Factor D", "Factor C", "Factor D", "Factor C", "Factor C", "Factor D", "Factor C", "Factor D", "Factor C", "Factor D", "Factor C"),
    Mean = c(30.62089116, 54.99066324, 30.48859003, 45.70410611, 33.68256601, 53.43496462, 23.08799875, 23.79166667, 23.75925926, 29.05555556, 28.13333333, 23.3, 28.83333333, 34.01666667),
    SE = c(3.672624771, 2.822405771, 3.748388489, 4.284244405, 4.030964172, 4.687042033, 5.17753488, 2.837795432, 3.688185081, 4.08597798, 7.877429079, 2.383624691, 2.750959429, 5.340999698)
)

ggplot(df1, aes(Year, Mean, color = Group2)) +  
    geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE)) +  
    geom_smooth(data = df1 %>% group_by(Group1, Group2) %>% filter(n() > 2),    # subset
                method = "gam", formula = y ~ s(x, k=3)) + 
    geom_point() + 
    facet_wrap(~Group1)