R中跨数据帧的条件求和

时间:2018-09-19 21:46:31

标签: r data.table aggregate

我正在尝试在R中复制SUMIFS功能。我有两个数据帧。

数据帧1

allReported

ID       employeeGroup
1093     Bargaining Unit
1093     Management
1093     Non-Union
55       Bargaining Unit
55       Management
55       Non-Union

数据帧2

employeeCompSummary

ID       employeeGroup      statBenefits    regularWages
1093     Management         500.00          10000.00
1093     Management         200.00          60000.00
1093     Bargaining Unit    100.00          20000.00
1093     Bargaining Unit    150.00          30000.00
1093     Non-Union          500.00          60000.00
55       Bargaining Unit    750.00          65000.00
55       Bargaining Unit    500.00          75000.00
55       Management         250.00          45000.00
55       Management         850.00          90000.00

我正在尝试将statBenefits(然后是以后的正常工资)加起来以创建一个新表,该表将产生以下结果:

ID       employeeGroup          statBenefits
1093     Bargaining Unit        250.00
1093     Management             700.00
1093     Non-Union              500.00
55       Bargaining Unit        1250.00
55       Management             1100.00
55       Non-Union              0.00

我尝试了以下方法:

library(data.table)
setDT(allReported)[, list(total=sum(statbenefits)), list(employeeCompSummary, employeeGroup)]

,并出现以下错误:

Error in `[.data.table`(setDT(allReported), , list(total = sum(statbenefits)),  :   column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

我也尝试过:

sumTest <- aggregate(allReported, by = list(employeeCompSummary), sum)

,并出现以下错误:

**Error in aggregate.data.frame(allReported, by = list(employeeCompSummary),  :   arguments must have same length**

任何人都能提供的帮助将不胜感激。我查看了其他似乎可以解决此问题的问题,但未能找到有效的答案。我将在多个方面完成此任务,因此我想知道是否有人知道这种简单的技术。与往常一样,在此先感谢Stack Overflow上的精彩社区。

编辑两个示例表的dput():

allReported <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))

employeeCompSummary <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

 . 

3 个答案:

答案 0 :(得分:2)

根据您的评论进行编辑:一种方法是使用data.table这样的方式

library(data.table)
dt1 <- data.table(structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), 
               employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), 
          row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))

dt2 <- data.table(structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), 
          row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))



dt1[dt2][, lapply(.SD, sum), .SDcols = c("statBenefits", "regularWages"), by = c("ID", "employeeGroup")]

给出

ID   employeeGroup statBenefits regularWages
1:   55 Bargaining Unit         1250       140000
2:   55      Management         1100       135000
3:   55       Non-Union           NA           NA
4: 1093 Bargaining Unit          250        50000
5: 1093      Management          700        70000
6: 1093       Non-Union          500        60000

您以后可以将NA值替换为0

答案 1 :(得分:2)

我愿意...

library(data.table)

# don't use setDT, since who knows if it works on tibbeldies
ar = data.table(allReported)
ecs = data.table(employeeCompSummary)

ecs[, total := ar[.SD, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI][, V1]]

     ID   employeeGroup total
1: 1093 Bargaining Unit   250
2: 1093      Management   700
3: 1093       Non-Union   500
4:   55 Bargaining Unit  1250
5:   55      Management  1100
6:   55       Non-Union    NA

即使OP请求了新表,此代码也会将列添加到ecs中。新表和ecs之间的行集是相同的,因此携带这两个行似乎浪费了精力。稍后删除列很简单。

如果您想知道此“更新联接”的工作原理,请尝试向后工作...

ar[ecs, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI]

# or

ar[ecs, on=.(ID, employeeGroup)]

注意原始代码中的.SD == ecs。参见?.SD

答案 2 :(得分:1)

您可以使用dplyrmagrittr(对于%>%)包来做到这一点-

library(dplyr)
library(magrittr)

df1 <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))

result <- left_join(df1, df2, by = c("ID", "employeeGroup")) %>%
  group_by(ID, employeeGroup) %>%
  summarize(
    statBenefits = sum(statBenefits, na.rm = T),
    regularWages = sum(regularWages, na.rm = T)
  )
result