这是设置(它真的不那么复杂......):
表 JobTitles
| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A | A1 | 1 | 5 |
| A | A2 | 6 | 10 |
| A | A3 | 11 | 15 |
| B | B1 | 2 | 4 |
| B | B2 | 5 | 7 |
| B | B3 | 8 | 11 |
| C | C1 | 5 | 12 |
| C | C2 | 13 | 14 |
| C | C3 | 15 | 18 |
表交易:
| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A | 2 | 5 |
| A | 3 | 10 |
| A | 12 | 5 |
| A | 12 | 10 |
| B | 3 | 5 |
| B | 3 | 10 |
| B | 10 | 5 |
| C | 16 | 10 |
| C | 17 | 5 |
| C | 17 | 10 |
| C | 17 | 5 |
期望输出:
| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A | A1 | 1 | 5 | 15 |
| A | A2 | 6 | 10 | 0 |
| A | A3 | 11 | 15 | 15 |
| B | B1 | 2 | 4 | 15 |
| B | B2 | 5 | 7 | 0 |
| B | B3 | 8 | 11 | 5 |
| C | C1 | 5 | 12 | 0 |
| C | C2 | 13 | 14 | 0 |
| C | C3 | 15 | 18 | 30 |
这个SQL为我提供了所需的输出:
select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
Transactions t
on jt.PersonId = t.PersonId and
t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;
R表:
JobTitles <- structure(list(PersonID = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), JobTitle = structure(1:9, .Label = c("A1",
"A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"), class = "factor"),
StartDate = c(1L, 6L, 11L, 2L, 5L, 8L, 5L, 13L, 15L), EndDate = c(5L,
10L, 15L, 4L, 7L, 11L, 12L, 14L, 18L)), .Names = c("PersonID",
"JobTitle", "StartDate", "EndDate"), class = "data.frame", row.names = c(NA,
-9L))
Transactions <- structure(list(PersonID = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
TransDate = c(2L, 3L, 12L, 12L, 3L, 3L, 10L, 16L, 17L, 17L,
17L), Amt = c(5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L,
5L)), .Names = c("PersonID", "TransDate", "Amt"), class = "data.frame", row.names = c(NA,
-11L))
如何将SQL
翻译成工作dplyr
代码?我陷入了left_join
:
left_join(JobTitles, Transactions,
by = c("PersonID" = "PersonID",
"StartDate" < "TransDate",
"EndDate" >= "TransDate"))
# Error: cannot join on columns 'TRUE' x '' : index out of bounds
答案 0 :(得分:4)
与@ zx8754的想法类似,我想出了以下内容。我尝试在之间使用,因为它在SQL脚本中。但是,结果与@ zx8754的结果基本相同。我进一步做了计算并得到了结果(即foo)。然后,我将它与来自JobTitles的两列(即PersonID JobTitle)合并,以获得预期的结果。
foo <- left_join(JobTitles, Transactions) %>%
rowwise() %>%
mutate(check = between(TransDate, StartDate, EndDate)) %>%
filter(check == TRUE) %>%
group_by(PersonID, JobTitle) %>%
summarise(total = sum(Amt))
### Merge the master frame including all combs of PersonID and JobTitle, and foo
foo2 <- left_join(JobTitles[,c(1,2)], foo)
### NA to 0
foo2$total[which(foo2$total %in% NA)] <- 0
# PersonID JobTitle total
#1 A A1 15
#2 A A2 0
#3 A A3 15
#4 B B1 15
#5 B B2 0
#6 B B3 5
#7 C C1 0
#8 C C2 0
#9 C C3 30
或稍短且在一个管道中:
left_join(JobTitles, Transactions) %>%
filter(TransDate > StartDate & TransDate < EndDate) %>%
group_by(PersonID, JobTitle) %>%
summarise(total = sum(Amt)) %>%
left_join(JobTitles[,c(1,2)], .) %>%
mutate(total = replace(total, is.na(total), 0))