使用R,我需要为每个部门的前2名员工编制一份费用最高的报告,并为该部门的其他员工添加“其他”。例如,我需要一份类似的报告。
Dept. EmployeeId Expense
Marketing 12345 100
Marketing 12346 90
Marketing Others 200
Sales 12347 50 <-- There's just one employee with expenses
Research 12348 2000
Research 12349 900
Research Others 10000
换句话说,我需要总结数据,重点关注费用最高的前2名员工。费用栏的总和应该是公司费用的总额。
employeIds <- sample(1000:9999, 20)
depts <- sample(c('Sales', 'Marketing', 'Research'), 20, replace = TRUE)
expenses <- sample(1:1000, 20, replace = TRUE)
df <- data.frame(employeIds, depts, expenses)
# Based on that data, how do I build a table with the top 2 employees with the most expenses in each department, including an "Other" employee per department.
我是R的新手,我不知道如何处理这个问题。在SQL中,我可以使用RANK()函数和JOIN,但它不是一个选项。
答案 0 :(得分:4)
这是一个data.table
解决方案:
创建数据:我还提出了“其他”不会发生的情况(该部门的条目数为:1&lt; = entries&lt; = 2)
set.seed(45)
employeIds <- sample(1000:9999, 20)
depts <- sample(c('Sales', 'Marketing', 'Research'), 20, replace = TRUE)
expenses <- sample(1:1000, 20, replace = TRUE)
df <- data.frame(employeIds, depts, expenses)
df <- df[-c(6,10,12,18,19), ]
data.table
解决方案:
require(data.table)
dt <- data.table(df, key=c("depts", "expenses"))
k <- 2
dt[, if(.N > k) {
idx <- (seq_len(.N)-1) %/% max(k, (.N - k)) == 1
list(EmployeeIds = c(employeIds[idx], "Others"),
Expenses = c(expenses[idx], sum(expenses[!idx])))
} else {
list(EmployeeIds = as.character(employeIds), Expenses = expenses)
}, by = depts]
# depts EmployeeIds Expenses
# 1: Marketing 4870 567
# 2: Marketing 3167 591
# 3: Marketing Others 2285
# 4: Research 5989 878
# 5: Research 9667 930
# 6: Research Others 1301
# 7: Sales 6700 129
# 8: Sales 3857 714
创意:使用dt
创建key = depts, expenses
的第一步可确保expenses
按递增顺序排序。然后,根据每个dept
的条目数,我们要么创建一个“其他”条目。
答案 1 :(得分:2)
可能不是最优雅的,但这是一个解决方案:
func <- function(data) {
data1 <- aggregate(data$expenses, list(employeIds=data$employeIds), sum)
# rank without ties.method = "first" will screw things up with identical values
data1$employeIds[!(rank(data1$x, ties.method="first") %in% 1:2)] <- 'Others'
data1 <- aggregate(data.frame(expenses=data1$x), list(employeIds=data1$employeIds), sum)
}
do.call(rbind, by(df, df$depts, func))
答案 2 :(得分:2)
另一种data.table
方法(可能更接近你所知道的SQL风格):
dt <- data.table(employeIds, depts, expenses)
dt[, rank:=rank(-expenses), by=depts][,
list("Expenses"=sum(expenses)),
keyby=list(depts, "Employee"=ifelse(rank<=2,employeIds,"Other"))
]
depts Employee Expenses
1: Marketing 6988 986
2: Marketing 7011 940
3: Marketing Other 2614
4: Research 2434 763
5: Research 9852 731
6: Research Other 3397
7: Sales 3120 581
8: Sales 6069 868
答案 3 :(得分:1)
df <- split(df, df$depts)
df <- lapply(df, FUN=function(x){
x <- x[order(x$expenses, decreasing=TRUE), ]
x$total.expenses <- sum(x$expenses)
x$group <- 1:nrow(x)
x$group <- ifelse(x$group <= 2, x$group, "Other")
x
})
df <- do.call(rbind, df)