我正在尝试使用R为我的excel数据集构建一个数据透视表。我需要对数字进行分组(称为列,权重范围为70-100。每个权重都有相关的价格。我需要找到平均值(权重),最大值(权重),最小值(权重) )和每个重量类别的产品数量.25个变量中约有3000个障碍物。重量和价格是其中两个。 数据摘录:
Weight Price Order No. Date_Ordered Invoiced_Date Region
85 $2300
78 $5600
100 $3490
95 $2450
90 $5890
I am looking for something like:
Weight Count Mean(Price) Min(Price) Max(Price)
70-75(including 75)
75-80
80-85
85-90
90-95
95-100
我能够得到计数,但我无法得到每个体重类别的均值,最小值和最大值:
#Import the dataset
dataset = read.xlsx('Product_Data.xlsx')
gdataset <- group_by(dataset, Weight)
attach(gdataset)
periods <- seq(from = 70, to = 100, by 5)
snip < -cut(Weight, breaks = periods, right = TRUE, include.lowest = TRUE)
report <- cbind(table(snip))
答案 0 :(得分:2)
您的数据有点稀疏,因此我会为此答案创建自己的数据。我会忽略其他列,但数据中的存在不应影响任何内容。
set.seed(2)
n <- 100
dat <- data.frame(
Weight = sample(100, size=n, replace=TRUE),
Price = sample(9999, size=n, replace=TRUE)
)
head(dat)
# Weight Price
# 1 19 2010
# 2 71 4276
# 3 58 9806
# 4 17 8289
# 5 95 2870
# 6 95 5959
首先要意识到的是,您需要将数据分组到分档中。在R中,可以使用cut
轻松完成。
bins <- seq(0, 100, by=5)
dat$WeightBin <- cut(dat$Weight, breaks = bins)
head(dat)
# Weight Price WeightBin
# 1 19 2010 (15,20]
# 2 71 4276 (70,75]
# 3 58 9806 (55,60]
# 4 17 8289 (15,20]
# 5 95 2870 (90,95]
# 6 95 5959 (90,95]
现在我们将它分成几组并在每个组上运行一个简单的汇总函数,将其重新包装回data.frame
:
do.call(rbind, by(dat$Price, dat$WeightBin, function(x) {
setNames(
sapply(c(length, mean, min, max), function(f) f(x)),
c("Count", "Mean(Price)", "Min(Price)", "Max(Price)")
)
}))
# Count Mean(Price) Min(Price) Max(Price)
# (0,5] 5 3919.000 1822 9536
# (5,10] 3 4287.000 1782 5690
# (10,15] 5 5402.200 2739 8989
# (15,20] 11 5192.545 1183 9192
# (20,25] 3 2868.667 137 7363
# (25,30] 6 6594.500 2855 9657
# (30,35] 5 2960.200 777 7486
# (35,40] 6 4937.000 850 9749
# (40,45] 7 5986.000 1307 9527
# (45,50] 4 5957.750 1475 9754
# (50,55] 3 3077.333 1287 4786
# (55,60] 4 4285.500 247 9806
# (60,65] 3 2633.000 450 6656
# (65,70] 4 4244.250 369 9038
# (70,75] 3 2616.333 652 4276
# (75,80] 5 7183.800 3734 8537
# (80,85] 6 4273.667 229 9788
# (85,90] 6 6659.000 1388 9637
# (90,95] 4 4301.750 2870 5959
# (95,100] 7 3967.857 872 8727
dplyr
我从group_by
的存在推断您打算使用dplyr
。这是获得类似结果的替代方法(从我的原始数据开始):
library(dplyr)
dat %>%
group_by(Bin = cut(Weight, seq(0, 100, by=5))) %>%
summarize(
Count = n(),
Mean = mean(Price),
Min = min(Price),
Max = max(Price)
)