不幸的是,我有另一个问题我无法单独解决 - 我想列出每个用户订购最多的大小。如果2个或更多尺寸具有相同的出现率,则应写入“ - ”
已经使用数据表尝试过,但我仍然坚持如何解决它;)
setDT(DB)[, `:=` (mostorderedsize = .N), by='customerID,size']
预期结果:
mostorderedsize = c("m", "-", 42, "m", "m", 42, "-", "-", "m", "m")
数据:
DB <- data.frame(orderID = c(1,2,3,4,5,6,7,8,9,10),
orderDate = c("1.1.14","1.1.14","1.1.14","1.1.14","2.1.14", "2.1.14","2.1.14","2.1.14","2.1.14","2.1.14"),
itemID = c(2,3,2,5,12,4,2,3,1,5),
size = c("m", "l", 42, "xxl", "m", 42, 39, "m", "xl", 44),
customerID = c(1, 2, 3, 1, 1, 3, 2, 2, 1, 1),
ItemReturned = c(0, 0, 0, 1, 1, 0, 1, 0, 0, 0))
希望你能告诉我什么是错的,或者告诉我另一种解决问题的可能性。
答案 0 :(得分:1)
使用Base R:
使用tapply
循环访问customerID组并使用table
tmp <- with(DB, tapply(size, customerID, function(x) {
tbl <- table(x)
most <- which(tbl == max(tbl))
if (length(most) > 1) return('-') else return(names(tbl)[most])
}))
然后可以使用customerID列进行索引以广播结果:
DB$mostoreredsize <- tmp[DB$customerID]
DB
> DB
orderID orderDate itemID size customerID ItemReturned mostoreredsize
1 1 1.1.14 2 m 1 0 m
2 2 1.1.14 3 l 2 0 -
3 3 1.1.14 2 42 3 0 42
4 4 1.1.14 5 xxl 1 1 m
5 5 2.1.14 12 m 1 1 m
6 6 2.1.14 4 42 3 0 42
7 7 2.1.14 2 39 2 1 -
8 8 2.1.14 3 m 2 0 -
9 9 2.1.14 1 xl 1 0 m
10 10 2.1.14 5 44 1 0 m