如何使用不精确的匹配标识符匹配两个data.frames(一个标识符必须在另一个标识符的范围内)

时间:2011-11-04 15:05:27

标签: r data.table

我有以下匹配问题:我有两个data.frames,一个每月观察一次(每个公司ID),另一个每季度观察一次(每个公司ID;注意该季度意味着财政季度;因此1Q = 1月,2月,3月不一定正确,而且财政季度不一定是3个月)。

对于每个月和公司,我希望得到该季度的正确价值。因此,几个月的四分之一具有相同的价值。作为示例,请参阅以下代码:

monthlyData <- data.frame(ID = rep(c("A", "B"), each = 5),
                  Month = rep(1:5, times = 2),
                  MonValue = 1:10)
monthlyData
   ID Month MonValue
1   A     1        1
2   A     2        2
3   A     3        3
4   A     4        4
5   A     5        5
6   B     1        6
7   B     2        7
8   B     3        8
9   B     4        9
10  B     5       10

#Quarterly data, i.e. the value of every quarter has to be matched to several months in d1
#However, I want to match fiscal quarters, which means that one quarter is not necessarily 3 month long
qtrData <- data.frame(ID = rep(c("A", "B"), each = 2),
                  startMonth = c(1, 4, 1, 3),
                  endMonth   = c(3, 5, 2, 5),
                  QTRValue   = 1:4)
qtrData
  ID startMonth endMonth QTRValue
1  A          1        3        1
2  A          4        5        2
3  B          1        2        3
4  B          3        5        4

#Desired output
   ID Month MonValue QTRValue
1   A     1        1        1
2   A     2        2        1
3   A     3        3        1
4   A     4        4        2
5   A     5        5        2
6   B     1        6        3
7   B     2        7        3
8   B     3        8        4
9   B     4        9        4
10  B     5       10        4

注意:这个问题在几个月前发布在R-help上,但我没有得到任何答案,我自己找到了解决方案(参见R-help)。然而,现在,我在stackoverflow上发布了一个问题,我对data.table提出了这个问题的问题,Andrie让我再次发布这个问题,因为他显然有一个很好的解决方案(见Question on SO

更新:请参阅Matthew Dowle的评论:真实数据看起来如何?

这个数据更加真实。我添加了几行,但唯一更改的主要部分是endMonth中的列qtrData。更确切地说,startMonth不一定是上一季度的endMonth再加上一个月。因此,使用roll选项,我认为你需要另一行代码(如果没有,你会得到20行,但是使用Andrie的解决方案,这是所需的一行,你得到17行)。如果我在这里不遗漏任何东西,那么就没有任何性能差异了。

monthlyData_new <- data.table(ID = rep(c("A", "B"), each = 10),
                  Month = rep(1:10, times = 2),
                  MonValue = 1:20)

qtrData_new <- data.table(ID = rep(c("A", "B"), each = 3),
                  startMonth = c(1, 4, 7, 1, 3, 8),
                  endMonth   = c(3, 5, 10, 2, 5, 10),
                  QTRValue   = 1:6)

setkey(qtrData_new, ID)
setkey(monthlyData_new, ID)

qtrData1 <- qtrData_new
setkey(qtrData1, ID, startMonth)
monthlyData1 <- monthlyData_new
setkey(monthlyData1, ID, Month)

withTable1 <- function(){
  xx <- qtrData1[monthlyData1, roll=TRUE]
  xx <- xx[startMonth <= endMonth]

}

withTable2 <- function(){
  yy <- monthlyData_new[qtrData_new][Month >= startMonth & Month <= endMonth]

}

benchmark(withTable1, withTable2, replications=1e6)
        test replications elapsed relative user.self sys.self user.child sys.child
1 withTable1      1000000   4.244 1.028599     4.232    0.008          0         0
2 withTable2      1000000   4.126 1.000000     4.096    0.028          0         0

2 个答案:

答案 0 :(得分:4)

试试这个:

mD = data.table(monthlyData, key="ID,Month")
qD = data.table(qtrData,key="ID,startMonth")
qD[mD,roll=TRUE]
      ID startMonth endMonth QTRValue MonValue
 [1,]  A          1        3        1        1
 [2,]  A          2        3        1        2
 [3,]  A          3        3        1        3
 [4,]  A          4        5        2        4
 [5,]  A          5        5        2        5
 [6,]  B          1        2        3        6
 [7,]  B          2        2        3        7
 [8,]  B          3        5        4        8
 [9,]  B          4        5        4        9
[10,]  B          5        5        4       10

那应该快得多。

编辑:回答有问题的后续编辑。一种方法是使用NA存储缺失月份的位置。我发现更容易看到一个时间序列列(不规则的间隙和NA),而不是两个制作一系列范围。

> mD <- data.table(ID = rep(c("A", "B"), each = 10),
+                  Month = rep(1:10, times = 2),
+                  MonValue = 1:20,  key="ID,Month")
>                  
> qD <- data.table(ID = rep(c("A", "B"), each = 4),
+                   Month = c(1,4,6,7, 1,3,6,8),
+                   QtrValue = c(1,2,NA,3, 4,5,NA,6),
+                   key="ID,Month")
>                   
> mD
      ID Month MonValue
 [1,]  A     1        1
 [2,]  A     2        2
 [3,]  A     3        3
 [4,]  A     4        4
 [5,]  A     5        5
 [6,]  A     6        6
 [7,]  A     7        7
 [8,]  A     8        8
 [9,]  A     9        9
[10,]  A    10       10
[11,]  B     1       11
[12,]  B     2       12
[13,]  B     3       13
[14,]  B     4       14
[15,]  B     5       15
[16,]  B     6       16
[17,]  B     7       17
[18,]  B     8       18
[19,]  B     9       19
[20,]  B    10       20
> qD
     ID Month QtrValue
[1,]  A     1        1
[2,]  A     4        2
[3,]  A     6       NA     # missing for 1 month  (6)
[4,]  A     7        3
[5,]  B     1        4
[6,]  B     3        5
[7,]  B     6       NA     # missing for 2 months (6 and 7)
[8,]  B     8        6
> qD[mD,roll=TRUE]
      ID Month QtrValue MonValue
 [1,]  A     1        1        1
 [2,]  A     2        1        2
 [3,]  A     3        1        3
 [4,]  A     4        2        4
 [5,]  A     5        2        5
 [6,]  A     6       NA        6
 [7,]  A     7        3        7
 [8,]  A     8        3        8
 [9,]  A     9        3        9
[10,]  A    10        3       10
[11,]  B     1        4       11
[12,]  B     2        4       12
[13,]  B     3        5       13
[14,]  B     4        5       14
[15,]  B     5        5       15
[16,]  B     6       NA       16
[17,]  B     7       NA       17
[18,]  B     8        6       18
[19,]  B     9        6       19
[20,]  B    10        6       20
> qD[mD,roll=TRUE][!is.na(QtrValue)]
      ID Month QtrValue MonValue
 [1,]  A     1        1        1
 [2,]  A     2        1        2
 [3,]  A     3        1        3
 [4,]  A     4        2        4
 [5,]  A     5        2        5
 [6,]  A     7        3        7
 [7,]  A     8        3        8
 [8,]  A     9        3        9
 [9,]  A    10        3       10
[10,]  B     1        4       11
[11,]  B     2        4       12
[12,]  B     3        5       13
[13,]  B     4        5       14
[14,]  B     5        5       15
[15,]  B     8        6       18
[16,]  B     9        6       19
[17,]  B    10        6       20

答案 1 :(得分:3)

以下是使用Base R和data.table的两种解决方案。由于data.table解决方案比基本R快约30%,并且更容易阅读,因此我建议使用data.table


基础R

由于您表示希望提高效率,我使用vapply

matchData <- function(id, month, data=d2){
  vapply(seq_along(id), 
      function(i)which(
            id[i]==data$ID & 
                month[i] >= data$startMonth & 
                month[i] <= data$endMonth),
      FUN.VALUE=1,
      USE.NAMES=FALSE
      )
}


within(monthlyData, 
    Value <- qtrData$QTRValue[matchData(
               monthlyData$ID, monthlyData$Month, qtrData)]
)

   ID Month MonValue Value
1   A     1        1     1
2   A     2        2     1
3   A     3        3     1
4   A     4        4     2
5   A     5        5     2
6   B     1        6     3
7   B     2        7     3
8   B     3        8     4
9   B     4        9     4
10  B     5       10     4

data.table

并使用data.table演示如何执行此操作:

mD <- data.table(monthlyData, key="ID")
qD <- data.table(qtrData, key="ID")
mD[qD][Month>=startMonth & Month<=endMonth]


      ID Month MonValue startMonth endMonth QTRValue
 [1,]  A     1        1          1        3        1
 [2,]  A     2        2          1        3        1
 [3,]  A     3        3          1        3        1
 [4,]  A     4        4          4        5        2
 [5,]  A     5        5          4        5        2
 [6,]  B     1        6          1        2        3
 [7,]  B     2        7          1        2        3
 [8,]  B     3        8          3        5        4
 [9,]  B     4        9          3        5        4
[10,]  B     5       10          3        5        4

基准

我很好奇这两种方法的比较:

library(rbenchmark)

withBase <- function(){
  xx <- within(monthlyData, 
      Value <- qtrData$QTRValue[matchData(monthlyData$ID, monthlyData$Month, qtrData)])

}

withTable <- function(){
  yy <- mD[qD][Month>=startMonth & Month<=endMonth]

}

benchmark(withBase, withTable, replications=1e6)

       test replications elapsed relative user.self sys.self user.child
1  withBase      1000000   10.09 1.296915      7.65     0.21         NA
2 withTable      1000000    7.78 1.000000      6.38     0.16         NA