在R中创建数据透视表和分组字段

时间:2018-01-22 22:07:48

标签: r excel algorithm

我正在尝试使用R为我的excel数据集构建一个数据透视表。我需要对数字进行分组(称为列,权重范围为70-100。每个权重都有相关的价格。我需要找到平均值(权重),最大值(权重),最小值(权重) )和每个重量类别的产品数量.25个变量中约有3000个障碍物。重量和价格是其中两个。 数据摘录:

Weight   Price   Order No.   Date_Ordered   Invoiced_Date   Region  
85       $2300 
78       $5600
100      $3490
95       $2450
90       $5890

I am looking for something like:
    Weight                       Count    Mean(Price)   Min(Price)   Max(Price)
70-75(including 75)     
75-80
80-85
85-90
90-95
95-100

我能够得到计数,但我无法得到每个体重类别的均值,最小值和最大值:

#Import the dataset
dataset = read.xlsx('Product_Data.xlsx')
gdataset <- group_by(dataset, Weight)
attach(gdataset)
periods <- seq(from = 70, to = 100, by 5)
snip < -cut(Weight, breaks = periods, right = TRUE, include.lowest = TRUE)
report <- cbind(table(snip))

1 个答案:

答案 0 :(得分:2)

您的数据有点稀疏,因此我会为此答案创建自己的数据。我会忽略其他列,但数据中的存在不应影响任何内容。

set.seed(2)
n <- 100
dat <- data.frame(
  Weight = sample(100, size=n, replace=TRUE),
  Price = sample(9999, size=n, replace=TRUE)
)
head(dat)
#   Weight Price
# 1     19  2010
# 2     71  4276
# 3     58  9806
# 4     17  8289
# 5     95  2870
# 6     95  5959

基础R

首先要意识到的是,您需要将数据分组到分档中。在R中,可以使用cut轻松完成。

bins <- seq(0, 100, by=5)
dat$WeightBin <- cut(dat$Weight, breaks = bins)
head(dat)
#   Weight Price WeightBin
# 1     19  2010   (15,20]
# 2     71  4276   (70,75]
# 3     58  9806   (55,60]
# 4     17  8289   (15,20]
# 5     95  2870   (90,95]
# 6     95  5959   (90,95]

现在我们将它分成几组并在每个组上运行一个简单的汇总函数,将其重新包装回data.frame

do.call(rbind, by(dat$Price, dat$WeightBin, function(x) {
  setNames(
    sapply(c(length, mean, min, max), function(f) f(x)),
    c("Count", "Mean(Price)", "Min(Price)", "Max(Price)")
  )
}))
#          Count Mean(Price) Min(Price) Max(Price)
# (0,5]        5    3919.000       1822       9536
# (5,10]       3    4287.000       1782       5690
# (10,15]      5    5402.200       2739       8989
# (15,20]     11    5192.545       1183       9192
# (20,25]      3    2868.667        137       7363
# (25,30]      6    6594.500       2855       9657
# (30,35]      5    2960.200        777       7486
# (35,40]      6    4937.000        850       9749
# (40,45]      7    5986.000       1307       9527
# (45,50]      4    5957.750       1475       9754
# (50,55]      3    3077.333       1287       4786
# (55,60]      4    4285.500        247       9806
# (60,65]      3    2633.000        450       6656
# (65,70]      4    4244.250        369       9038
# (70,75]      3    2616.333        652       4276
# (75,80]      5    7183.800       3734       8537
# (80,85]      6    4273.667        229       9788
# (85,90]      6    6659.000       1388       9637
# (90,95]      4    4301.750       2870       5959
# (95,100]     7    3967.857        872       8727

dplyr

我从group_by的存在推断您打算使用dplyr。这是获得类似结果的替代方法(从我的原始数据开始):

library(dplyr)
dat %>%
  group_by(Bin = cut(Weight, seq(0, 100, by=5))) %>%
  summarize(
    Count = n(),
    Mean = mean(Price),
    Min = min(Price),
    Max = max(Price)
  )