将SQL转换/转换为dplyr使用多个条件进行左连接

时间:2014-12-09 15:54:56

标签: r dplyr

这是设置(它真的不那么复杂......):

JobTitles

| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A        | A1       | 1         | 5       |
| A        | A2       | 6         | 10      |
| A        | A3       | 11        | 15      |
| B        | B1       | 2         | 4       |
| B        | B2       | 5         | 7       |
| B        | B3       | 8         | 11      |
| C        | C1       | 5         | 12      |
| C        | C2       | 13        | 14      |
| C        | C3       | 15        | 18      |

交易

| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A        | 2         | 5   |
| A        | 3         | 10  |
| A        | 12        | 5   |
| A        | 12        | 10  |
| B        | 3         | 5   |
| B        | 3         | 10  |
| B        | 10        | 5   |
| C        | 16        | 10  |
| C        | 17        | 5   |
| C        | 17        | 10  |
| C        | 17        | 5   |

期望输出

| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A        | A1       | 1         | 5       | 15  |
| A        | A2       | 6         | 10      | 0   |
| A        | A3       | 11        | 15      | 15  |
| B        | B1       | 2         | 4       | 15  |
| B        | B2       | 5         | 7       | 0   |
| B        | B3       | 8         | 11      | 5   |
| C        | C1       | 5         | 12      | 0   |
| C        | C2       | 13        | 14      | 0   |
| C        | C3       | 15        | 18      | 30  |

这个SQL为我提供了所需的输出:

select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
     Transactions t
     on jt.PersonId = t.PersonId and
        t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;

R表:

JobTitles <- structure(list(PersonID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), JobTitle = structure(1:9, .Label = c("A1", 
"A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"), class = "factor"), 
    StartDate = c(1L, 6L, 11L, 2L, 5L, 8L, 5L, 13L, 15L), EndDate = c(5L, 
    10L, 15L, 4L, 7L, 11L, 12L, 14L, 18L)), .Names = c("PersonID", 
"JobTitle", "StartDate", "EndDate"), class = "data.frame", row.names = c(NA, 
-9L))
Transactions <- structure(list(PersonID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    TransDate = c(2L, 3L, 12L, 12L, 3L, 3L, 10L, 16L, 17L, 17L, 
    17L), Amt = c(5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L, 
    5L)), .Names = c("PersonID", "TransDate", "Amt"), class = "data.frame", row.names = c(NA, 
-11L))

如何将SQL翻译成工作dplyr代码?我陷入了left_join

left_join(JobTitles, Transactions, 
          by = c("PersonID" = "PersonID", 
                 "StartDate" < "TransDate",
                 "EndDate" >= "TransDate"))
# Error: cannot join on columns 'TRUE' x '' : index out of bounds

1 个答案:

答案 0 :(得分:4)

与@ zx8754的想法类似,我想出了以下内容。我尝试在之间使用,因为它在SQL脚本中。但是,结果与@ zx8754的结果基本相同。我进一步做了计算并得到了结果(即foo)。然后,我将它与来自JobTitles的两列(即PersonID JobTitle)合并,以获得预期的结果。

foo <- left_join(JobTitles, Transactions) %>%
       rowwise() %>%
       mutate(check = between(TransDate, StartDate, EndDate)) %>%
       filter(check == TRUE) %>%
       group_by(PersonID, JobTitle) %>%
       summarise(total = sum(Amt))

### Merge the master frame including all combs of PersonID and JobTitle, and foo
foo2 <- left_join(JobTitles[,c(1,2)], foo)

### NA to 0 
foo2$total[which(foo2$total %in% NA)] <- 0

#  PersonID JobTitle total
#1        A       A1    15
#2        A       A2     0
#3        A       A3    15
#4        B       B1    15
#5        B       B2     0
#6        B       B3     5
#7        C       C1     0
#8        C       C2     0
#9        C       C3    30

或稍短且在一个管道中:

left_join(JobTitles, Transactions) %>%
  filter(TransDate > StartDate & TransDate < EndDate) %>%
  group_by(PersonID, JobTitle) %>%
  summarise(total = sum(Amt)) %>%
  left_join(JobTitles[,c(1,2)], .) %>%
  mutate(total = replace(total, is.na(total), 0))