我正在尝试在R中复制SUMIFS功能。我有两个数据帧。
数据帧1
allReported
ID employeeGroup
1093 Bargaining Unit
1093 Management
1093 Non-Union
55 Bargaining Unit
55 Management
55 Non-Union
数据帧2
employeeCompSummary
ID employeeGroup statBenefits regularWages
1093 Management 500.00 10000.00
1093 Management 200.00 60000.00
1093 Bargaining Unit 100.00 20000.00
1093 Bargaining Unit 150.00 30000.00
1093 Non-Union 500.00 60000.00
55 Bargaining Unit 750.00 65000.00
55 Bargaining Unit 500.00 75000.00
55 Management 250.00 45000.00
55 Management 850.00 90000.00
我正在尝试将statBenefits(然后是以后的正常工资)加起来以创建一个新表,该表将产生以下结果:
ID employeeGroup statBenefits
1093 Bargaining Unit 250.00
1093 Management 700.00
1093 Non-Union 500.00
55 Bargaining Unit 1250.00
55 Management 1100.00
55 Non-Union 0.00
我尝试了以下方法:
library(data.table)
setDT(allReported)[, list(total=sum(statbenefits)), list(employeeCompSummary, employeeGroup)]
,并出现以下错误:
Error in `[.data.table`(setDT(allReported), , list(total = sum(statbenefits)), : column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
我也尝试过:
sumTest <- aggregate(allReported, by = list(employeeCompSummary), sum)
,并出现以下错误:
**Error in aggregate.data.frame(allReported, by = list(employeeCompSummary), : arguments must have same length**
任何人都能提供的帮助将不胜感激。我查看了其他似乎可以解决此问题的问题,但未能找到有效的答案。我将在多个方面完成此任务,因此我想知道是否有人知道这种简单的技术。与往常一样,在此先感谢Stack Overflow上的精彩社区。 p>
编辑两个示例表的dput():
allReported <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))
employeeCompSummary <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
.
答案 0 :(得分:2)
根据您的评论进行编辑:一种方法是使用data.table这样的方式
library(data.table)
dt1 <- data.table(structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55),
employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)),
row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))
dt2 <- data.table(structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))
dt1[dt2][, lapply(.SD, sum), .SDcols = c("statBenefits", "regularWages"), by = c("ID", "employeeGroup")]
给出
ID employeeGroup statBenefits regularWages
1: 55 Bargaining Unit 1250 140000
2: 55 Management 1100 135000
3: 55 Non-Union NA NA
4: 1093 Bargaining Unit 250 50000
5: 1093 Management 700 70000
6: 1093 Non-Union 500 60000
您以后可以将NA值替换为0
答案 1 :(得分:2)
我愿意...
library(data.table)
# don't use setDT, since who knows if it works on tibbeldies
ar = data.table(allReported)
ecs = data.table(employeeCompSummary)
ecs[, total := ar[.SD, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI][, V1]]
ID employeeGroup total
1: 1093 Bargaining Unit 250
2: 1093 Management 700
3: 1093 Non-Union 500
4: 55 Bargaining Unit 1250
5: 55 Management 1100
6: 55 Non-Union NA
即使OP请求了新表,此代码也会将列添加到ecs
中。新表和ecs
之间的行集是相同的,因此携带这两个行似乎浪费了精力。稍后删除列很简单。
如果您想知道此“更新联接”的工作原理,请尝试向后工作...
ar[ecs, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI]
# or
ar[ecs, on=.(ID, employeeGroup)]
注意原始代码中的.SD == ecs。参见?.SD
。
答案 2 :(得分:1)
您可以使用dplyr
和magrittr
(对于%>%
)包来做到这一点-
library(dplyr)
library(magrittr)
df1 <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))
result <- left_join(df1, df2, by = c("ID", "employeeGroup")) %>%
group_by(ID, employeeGroup) %>%
summarize(
statBenefits = sum(statBenefits, na.rm = T),
regularWages = sum(regularWages, na.rm = T)
)
result