我有一个表格的数据集:
d = data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,-0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) = c("X","Y")
ggplot(d, aes(x=X, y=Y)) + geom_line()
我正在尝试生成一个汇总表,将Y
变量分成10%的相等组,并为每个bin生成X
的汇总统计信息。这就是我希望我的结果看起来像:
Y Group X Group
0-10% {Range1: 10-30%, mean1, median1, sd1} {Range2: 85-100%, mean2, median2, sd2}
10-20% ...
20-30% ...
30-40% ...
40-50% ...
X
的范围并不总是两个,Y
的20-30%有三个X
范围,40-50%有一个范围。
我有许多必须实现的大数据集。数据用于重现问题。我的实际数据可能有很多拐点,因为此代码必须在X
和Y
的许多组合上运行。
答案 0 :(得分:1)
输出格式不像你的。
但这是一个紧密的解决方案。您可以根据自己的喜好轻松重新格式化。看来你在10组中分组Y但在X上不确定。我在X上也使用了10组。
d = data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,-0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) = c("X","Y")
library(dplyr)
d$x.decile<-ntile(d$X,10)
d$y.decile<-ntile(d$Y,10)
summary<-data.frame(d%>%group_by(y.decile, x.decile)%>%summarise(mean=mean(X),median=median(X), min=min(X), max=max(X), sd=sd(X)))
> summary
y.decile x.decile mean median min max sd
1 1 2 0.175 0.175 0.15 0.20 0.018708287
2 1 3 0.210 0.210 0.21 0.21 NaN
3 1 10 0.990 0.990 0.98 1.00 0.010000000
4 2 2 0.135 0.135 0.13 0.14 0.007071068
5 2 3 0.235 0.235 0.22 0.25 0.012909944
6 2 10 0.955 0.955 0.94 0.97 0.012909944
7 3 1 0.095 0.095 0.09 0.10 0.007071068
答案 1 :(得分:1)
您可以使用melt
包中的dcast
和reshape
获取所需的格式。
在下面的代码中,我将数据切割成10个Y组和2个X组,只是为了保持输出的宽度合理。在ntile
函数中更改2到10以获得X
的实际十分位数。此外,我没有包括每个摘要项目,但希望以下代码将指导您添加其他信息。
library(dplyr)
library(reshape2)
sm = d %>% group_by(`Y decile`=ntile(Y,10), X.decile=ntile(X,2)) %>%
summarise(`X decile` = paste0("{Count: ", n(), ", Range: ", min(X),"-",max(X),", Median: ",median(X),"}"))
sm %>% melt(id.var=c("Y decile","X.decile")) %>%
dcast(`Y decile` ~ variable + X.decile, value.var="value", fill="")
这里实际上不需要{p>Y decile X decile_1 X decile_2 1 1 {Count: 7, Range: 0.15-0.21, Median: 0.18} {Count: 3, Range: 0.98-1, Median: 0.99} 2 2 {Count: 6, Range: 0.13-0.25, Median: 0.225} {Count: 4, Range: 0.94-0.97, Median: 0.955} 3 3 {Count: 7, Range: 0.09-0.28, Median: 0.12} {Count: 3, Range: 0.91-0.93, Median: 0.92} 4 4 {Count: 6, Range: 0.06-0.31, Median: 0.185} {Count: 4, Range: 0.87-0.9, Median: 0.885} 5 5 {Count: 8, Range: 0.02-0.35, Median: 0.185} {Count: 2, Range: 0.85-0.86, Median: 0.855} 6 6 {Count: 5, Range: 0.01-0.39, Median: 0.37} {Count: 5, Range: 0.8-0.84, Median: 0.82} 7 7 {Count: 5, Range: 0.4-0.44, Median: 0.42} {Count: 5, Range: 0.75-0.79, Median: 0.77} 8 8 {Count: 5, Range: 0.45-0.49, Median: 0.47} {Count: 5, Range: 0.7-0.74, Median: 0.72} 9 9 {Count: 1, Range: 0.5-0.5, Median: 0.5} {Count: 9, Range: 0.51-0.69, Median: 0.65} 10 10 {Count: 10, Range: 0.55-0.64, Median: 0.595}
melt
。您可以进行以下操作,最后的额外行是获取更多解释性名称。
sm = d %>% group_by(`Y decile`=ntile(Y,10), X.decile=ntile(X,2)) %>%
summarise(`X decile` = paste0("{N: ", n(), ", Range: ", min(X),"-",max(X),", Median: ",median(X),"}")) %>%
dcast(`Y decile` ~ X.decile, value.var="X decile", fill="", value.name=) %>%
setNames(., c(names(.)[1], paste0("X decile ", names(.)[-1])))
答案 2 :(得分:0)
quantile
和aggregate
功能可以为您提供帮助。
# Create data frame
d <- data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,- 0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) <- c("X","Y")
# Define bins
bins <- quantile(d$Y, seq(0.1,1,length.out=10))
# Create indicator variable for which bin each Y belongs in
ag <- c()
for (i in 1:nrow(d)) {ag[i] <- which(d$Y[i] < bins)[1]}
# Compute summary statistics
means <- aggregate(d$X, by=list(ag), mean)
medians <- aggregate(d$X, by=list(ag), median)
variances <- aggregate(d$X, by=list(ag), var)
# Put them all into a new data frame
data.frame(group=(1:10),mean=means[,2], median=medians[,2], variance=variances[,2])
## group mean median variance
##1 1 0.4533333 0.200 0.162250000
##2 2 0.4709091 0.240 0.148969091
##3 3 0.3990000 0.265 0.134543333
##4 4 0.4650000 0.305 0.139583333
##5 5 0.3525000 0.325 0.114278571
##6 6 0.4983333 0.385 0.097178788
##7 7 0.5950000 0.595 0.034250000
##8 8 0.5950000 0.595 0.017583333
##9 9 0.5950000 0.595 0.006472222
##10 10 0.5950000 0.595 0.001171429