从情节的箱子生成汇总表

时间:2016-05-13 14:58:53

标签: r summary

我有一个表格的数据集:

d = data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,-0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) = c("X","Y")
ggplot(d, aes(x=X, y=Y)) + geom_line()

我正在尝试生成一个汇总表,将Y变量分成10%的相等组,并为每个bin生成X的汇总统计信息。这就是我希望我的结果看起来像:

Y Group    X Group
0-10%      {Range1: 10-30%, mean1, median1, sd1} {Range2: 85-100%, mean2, median2, sd2}
10-20%     ... 
20-30%     ...
30-40%     ...
40-50%     ...    

X的范围并不总是两个,Y的20-30%有三个X范围,40-50%有一个范围。

我有许多必须实现的大数据集。数据用于重现问题。我的实际数据可能有很多拐点,因为此代码必须在XY的许多组合上运行。

3 个答案:

答案 0 :(得分:1)

输出格式不像你的。

但这是一个紧密的解决方案。您可以根据自己的喜好轻松重新格式化。看来你在10组中分组Y但在X上不确定。我在X上也使用了10组。

d = data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,-0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))

names(d) = c("X","Y")

library(dplyr)

d$x.decile<-ntile(d$X,10)
d$y.decile<-ntile(d$Y,10)


summary<-data.frame(d%>%group_by(y.decile, x.decile)%>%summarise(mean=mean(X),median=median(X), min=min(X), max=max(X), sd=sd(X)))

> summary
   y.decile x.decile  mean median  min  max          sd
1         1        2 0.175  0.175 0.15 0.20 0.018708287
2         1        3 0.210  0.210 0.21 0.21         NaN
3         1       10 0.990  0.990 0.98 1.00 0.010000000
4         2        2 0.135  0.135 0.13 0.14 0.007071068
5         2        3 0.235  0.235 0.22 0.25 0.012909944
6         2       10 0.955  0.955 0.94 0.97 0.012909944
7         3        1 0.095  0.095 0.09 0.10 0.007071068

答案 1 :(得分:1)

您可以使用melt包中的dcastreshape获取所需的格式。

在下面的代码中,我将数据切割成10个Y组和2个X组,只是为了保持输出的宽度合理。在ntile函数中更改2到10以获得X的实际十分位数。此外,我没有包括每个摘要项目,但希望以下代码将指导您添加其他信息。

library(dplyr)
library(reshape2)

sm = d %>% group_by(`Y decile`=ntile(Y,10), X.decile=ntile(X,2)) %>%
  summarise(`X decile` = paste0("{Count: ", n(), ", Range: ", min(X),"-",max(X),", Median: ",median(X),"}"))

sm %>% melt(id.var=c("Y decile","X.decile")) %>%
  dcast(`Y decile` ~ variable + X.decile, value.var="value", fill="")
   Y decile                                  X decile_1                                   X decile_2
1         1  {Count: 7, Range: 0.15-0.21, Median: 0.18}      {Count: 3, Range: 0.98-1, Median: 0.99}
2         2 {Count: 6, Range: 0.13-0.25, Median: 0.225}  {Count: 4, Range: 0.94-0.97, Median: 0.955}
3         3  {Count: 7, Range: 0.09-0.28, Median: 0.12}   {Count: 3, Range: 0.91-0.93, Median: 0.92}
4         4 {Count: 6, Range: 0.06-0.31, Median: 0.185}   {Count: 4, Range: 0.87-0.9, Median: 0.885}
5         5 {Count: 8, Range: 0.02-0.35, Median: 0.185}  {Count: 2, Range: 0.85-0.86, Median: 0.855}
6         6  {Count: 5, Range: 0.01-0.39, Median: 0.37}    {Count: 5, Range: 0.8-0.84, Median: 0.82}
7         7   {Count: 5, Range: 0.4-0.44, Median: 0.42}   {Count: 5, Range: 0.75-0.79, Median: 0.77}
8         8  {Count: 5, Range: 0.45-0.49, Median: 0.47}    {Count: 5, Range: 0.7-0.74, Median: 0.72}
9         9     {Count: 1, Range: 0.5-0.5, Median: 0.5}   {Count: 9, Range: 0.51-0.69, Median: 0.65}
10       10                                             {Count: 10, Range: 0.55-0.64, Median: 0.595}
这里实际上不需要{p> melt。您可以进行以下操作,最后的额外行是获取更多解释性名称。

sm = d %>% group_by(`Y decile`=ntile(Y,10), X.decile=ntile(X,2)) %>%
  summarise(`X decile` = paste0("{N: ", n(), ", Range: ", min(X),"-",max(X),", Median: ",median(X),"}")) %>% 
  dcast(`Y decile` ~ X.decile, value.var="X decile", fill="", value.name=) %>%
  setNames(., c(names(.)[1], paste0("X decile ", names(.)[-1])))

答案 2 :(得分:0)

quantileaggregate功能可以为您提供帮助。

# Create data frame
d <- data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,- 0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) <- c("X","Y")

# Define bins
bins <- quantile(d$Y, seq(0.1,1,length.out=10))

# Create indicator variable for which bin each Y belongs in
ag <- c()
for (i in 1:nrow(d)) {ag[i] <- which(d$Y[i] < bins)[1]}

# Compute summary statistics 
means <- aggregate(d$X, by=list(ag), mean)
medians <- aggregate(d$X, by=list(ag), median)
variances <- aggregate(d$X, by=list(ag), var)

# Put them all into a new data frame
data.frame(group=(1:10),mean=means[,2], median=medians[,2], variance=variances[,2])

##   group      mean median    variance
##1      1 0.4533333  0.200 0.162250000
##2      2 0.4709091  0.240 0.148969091
##3      3 0.3990000  0.265 0.134543333
##4      4 0.4650000  0.305 0.139583333
##5      5 0.3525000  0.325 0.114278571
##6      6 0.4983333  0.385 0.097178788
##7      7 0.5950000  0.595 0.034250000
##8      8 0.5950000  0.595 0.017583333
##9      9 0.5950000  0.595 0.006472222
##10    10 0.5950000  0.595 0.001171429