假设我有以下数据
ID Category Price Month
1 X 2 1
1 X 2 2
1 X 2 3
1 X 2 4
2 X 3 1
2 X 3 2
2 X 3 3
2 X 3 4
3 X 1 1
3 X 1 2
3 X 1 3
3 X 1 4
4 X 10 1
4 X 10 2
4 X 10 3
4 X 10 4
5 Y 5 1
5 Y 5 2
5 Y 5 3
5 Y 5 4
6 Y 2 1
6 Y 2 2
6 Y 2 3
6 Y 2 4
7 Y 1 1
7 Y 1 2
7 Y 1 3
7 Y 1 4
8 Y 10 1
8 Y 10 2
8 Y 10 3
8 Y 10 4
特定类别的产品价格不同,有些价格低,价格高。我想要一个新变量" Price Level
"这表明该产品是低价产品,中价产品还是高价产品。
级别定义如下。 它将特定类别中所有产品的价格分为4个百分位数。
所以表格看起来像这样
ID Category Price Month Price Level
1 X 4 1 Medium
1 X 4 2 Medium
1 X 4 3 Medium
1 X 4 4 Medium
2 X 3 1 Medium
2 X 3 2 Medium
2 X 3 3 Medium
2 X 3 4 Medium
3 X 1 1 Low
3 X 1 2 Low
3 X 1 3 Low
3 X 1 4 Low
4 X 10 1 High
4 X 10 2 High
4 X 10 3 High
4 X 10 4 High
5 Y 5 1 Medium
5 Y 5 2 Medium
5 Y 5 3 Medium
5 Y 5 4 Medium
6 Y 2 1 Low
6 Y 2 2 Low
6 Y 2 3 Low
6 Y 2 4 Low
7 Y 1 1 Low
7 Y 1 2 Low
7 Y 1 3 Low
7 Y 1 4 Low
8 Y 10 1 Low
8 Y 10 2 Low
8 Y 10 3 Low
8 Y 10 4 Low
答案 0 :(得分:1)
您可以lapply
在{。} split
之间Category
,并在每个组中调用cut
和quantile
。 data.frame
和do.call(rbind,
将数据重新组合回单个data.frame:
do.call(rbind, lapply(split(df, df$Category), function(x){
data.frame(x, Price_Level = cut(x$Price,
quantile(x$Price, probs = c(0, .25, .75, 1)),
labels = c('Low', 'Medium', 'High'),
include.lowest = TRUE))
}))
# ID Category Price Month Price_Level
# 1 1 X 2 1 Medium
# 2 1 X 2 2 Medium
# 3 1 X 2 3 Medium
# 4 1 X 2 4 Medium
# 5 2 X 3 1 Medium
# 6 2 X 3 2 Medium
# 7 2 X 3 3 Medium
# 8 2 X 3 4 Medium
# 9 3 X 1 1 Low
# 10 3 X 1 2 Low
# 11 3 X 1 3 Low
# 12 3 X 1 4 Low
# 13 4 X 10 1 High
# 14 4 X 10 2 High
# 15 4 X 10 3 High
# 16 4 X 10 4 High
# 17 5 Y 5 1 Medium
# 18 5 Y 5 2 Medium
# 19 5 Y 5 3 Medium
# 20 5 Y 5 4 Medium
# 21 6 Y 2 1 Medium
# 22 6 Y 2 2 Medium
# 23 6 Y 2 3 Medium
# 24 6 Y 2 4 Medium
# 25 7 Y 1 1 Low
# 26 7 Y 1 2 Low
# 27 7 Y 1 3 Low
# 28 7 Y 1 4 Low
# 29 8 Y 10 1 High
# 30 8 Y 10 2 High
# 31 8 Y 10 3 High
# 32 8 Y 10 4 High
如果您只想返回一个列,但又不想担心分组搞乱您的订单,可以使用等效的
factor(ave(df$Price, df$Category, FUN = function(x){
cut(x,
quantile(x, probs = c(0, .25, .75, 1)),
include.lowest = TRUE)
}), levels = c(1, 2, 3), labels = c('Low', 'Medium', 'High'))
使用dplyr
稍微不那么丑陋的版本:
library(dplyr)
df %>% group_by(Category) %>% mutate(Price_Level = cut(Price,
quantile(Price, c(0, .25, .75, 1)),
labels = c('Low', 'Medium', 'High'),
include.lowest = TRUE))
答案 1 :(得分:0)
我们可以使用data.table
library(data.table)
setDT(df)[, Price_Level := cut(Price,
quantile(Price, c(0, .25, .75, 1)),
labels = c('Low', 'Medium', 'High'),
include.lowest = TRUE), by = Category]