Question

我的情况是，由于它的大小，我被迫将一个大文件拆分成块。我希望在所有文件中都有一个列的直方图，所以我不得不对每个块进行直方图，并将得到的直方图逐个bin地加到一起。直方图保存为列表，如下所示：

for (i in 1:8) {
    dataset <- read.csv(capture.output(cat("split1/", filelist[i], sep = "")))
    dataset.hist[[i]] <- ggplot(dataset, aes(x = Value)) 
    + geom_histogram(breaks = seq(1, 200, by=1), aes(fill = ..count..))
}

我试图像这样添加它们：

testHist <- dataset.hist[[1]] + dataset.hist[[2]]

并出现以下错误消息：

Error in p + o : non-numeric argument to binary operator
In addition: Warning message:
Incompatible methods ("+.gg", "Ops.data.frame") for "+"

我浏览了谷歌以及ggplot和geom_histogram帮助页面并没有获得新的见解。任何人都可以提出另一种方法吗？

Answer 1

最好以自适应方式计算值，然后绘制单个直方图。您可以使用hist（您也可以在这里使用tapply）来计算每个文件中出现的“值”，然后将结果汇总到一个data.frame中。

## get all files in directory split1
res <- sapply(list.files("split1",full.names=TRUE), 
             function(x){
              dat <- read.csv(x)
              ## EDIT :remove data outside the range
              dat <- dat[dat$Value <=200,]
              counts <- hist(dat$Value,breaks=seq(200),plot=FALSE)
              rm(dat)
              }
     )

## aggregate all counts and create a single data.frame
dat <- data.frame(Value=rowSums(res),
                breaks = seq(200))

## plot the histogram
ggplot(dat) + 
    geom_bar(aes(x=breaks,y=Value),stat='identity')

Answer 2

我不确定这应该是单独的答案还是对agstudy帖子的评论。

@agstudy：我发布了一个小小的变化，如果其他人试图以这种方式对直方图进行求和，我必须做出一些改变。我遇到的问题是将直方图返回到res作为复杂res结构的对象。为了避免这种情况，我更改sapply语句中的代码以从hist对象返回$ counts字段。这允许后面的聚合顺利运行，因为数据结构res仅包含数字$ counts对象。仅供参考。再次感谢大家的帮助。

res <- sapply(list.files("split1/", pattern = "*.csv",  full.names=TRUE), 
          function(x){
            dat <- read.csv(x)
            dat.clean <- dat$Value[which(dat$Value > 0 & dat$Value< 200)]
            dat.counts <- hist(dat.clean, breaks = seq(0, 200, by = 1), plot = FALSE)
            rm(dat, dat.clean)
            # return the $counts field from hist to avoid complicating the list res
            dat.counts$counts
          })

## aggregate all counts and create a single data.frame
dat <- data.frame(Value= rowSums(res),
              breaks = seq(200))

## plot the histogram
ggplot(dat) + 
  geom_bar(aes(x=breaks,y=Value),stat='identity')

如何按bin添加N ggplot直方图

2 个答案: