Question

是否有可能在ggplot2调用中过滤掉具有少量观察数据的子集？

例如，请采用以下图表：qplot(price,data=diamonds,geom="density",colour=cut)

Density plot

情节有点忙，我希望用少量观察值排除cut值，即

> xtabs(~cut,diamonds)
cut
     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551

Fair因素的Good和cut质量。

我想要一个能够适应任意数据集的解决方案，如果可能的话，不仅可以选择阈值数量的观测值，而且可以选择前3个。

Answer 1

ggplot(subset(diamonds, cut %in% arrange(count(diamonds, .(cut)), desc(freq))[1:3,]$cut),
  aes(price, colour=cut)) + 
  geom_density() + facet_grid(~cut)

count将每个元素计入data.frame。
arrange根据指定的列命令data.frame。
desc启用逆序排序。
最后通过%in%将其剪切包含在前3中的行进行子集化。

Answer 2

这是我的看法。首先创建一个返回更多obs类别的函数。

firstx <- function (category, data, x = 1:3) {
  tab <- xtabs(~category, data)

  dimnames(tab)$category[order(tab, decreasing = TRUE)[x]]
}

#Then use subset to subset the data and droplevels to drop unused levels
#so they don't clutter the legend.
ggplot(droplevels(subset(diamonds, cut %in% firstx(cut, diamonds))), 
       aes(price, color = cut)) + geom_density()

我希望有所帮助。

Answer 3

这似乎要求编写自己的子集函数，可能是这样的：

mySubset <- function(dat,largestK=3,thresh=NULL){
   if (is.null(thresh)){
      tbl <- sort(table(dat)) 
      return(dat %in% tail(names(tbl),largestK))
   }
   else{
      return(dat >= thresh)
   }
}

这可以在ggplot调用中使用，如下所示：

ggplot(diamonds[mySubset(diamonds$cut),],...)

此代码不涉及从因子中删除级别，因此请注意这一点。出于这个原因，我通常将分类变量保留为字符，除非我绝对需要它们进行排序。

Answer 4

## Top 3 cuts
tmp <- names(sort(summary(diamonds$cut), decreasing = T))[1:3]
tmp <- droplevels(subset(diamonds, cut == tmp))
ggplot(tmp, aes(price, color=cut)) + geom_density()

enter image description here

但你考虑过分面吗？

ggplot(diamonds, aes(price, color=cut)) + geom_density() + facet_grid(~cut)

enter image description here

按观察次数过滤ggplot2密度图

4 个答案: