将年度数据与匹配相结合

时间:2015-04-14 15:19:05

标签: r data.table

我有一些关于匹配的纵向data.table,其中包含不同AB之间的匹配,以及这些匹配之间的付款流程。

           A year        B   payment start_global end_global
 1: 51557094 2002 65122111  80.39000         TRUE      FALSE
 2: 51557094 2003 65122111   9.74000        FALSE      FALSE
 3: 51557094 2004 65122111   7.85000        FALSE      FALSE
 4: 51557094 2005 65122111  97.16000        FALSE      FALSE
 5: 51557094 2006 65122111  48.22000        FALSE      FALSE
 6: 51557094 2007 65122111  91.24000        FALSE      FALSE
 7: 51557094 2008 65122111   9.35000        FALSE      FALSE
 8: 51557094 2009 65122111  13.15000        FALSE      FALSE
 9: 51557094 2010 65122111   3.46000        FALSE       TRUE
10: 51557133 1998 65142845  60.43981         TRUE      FALSE
11: 51557133 1999 65142845 111.60000        FALSE       TRUE
12: 51557133 1997 65224333  21.03455         TRUE       TRUE
13: 51557133 2000 65224333 144.17000         TRUE      FALSE
14: 51557133 2001 65224333 102.52000        FALSE      FALSE
15: 51557133 2002 65224333   5.79000        FALSE      FALSE
16: 51557133 2003 65224333   8.48000        FALSE      FALSE
17: 51557133 2004 65224333  68.16000        FALSE      FALSE
18: 51557133 2005 65224333  29.36000        FALSE       TRUE

我已经添加了指标start_globalend_global,它们指示匹配的开始位置和结束位置(根据是否存在特定A-B链接的连接给出了下一年和前一年。

我现在需要为每个A-B链接计算匹配长度和平均付款。也就是说,我的预期输出将类似于

         A        B  payment start  end
1 51557094 65122111 40.06222  2002 2010 

pandas中,我只会做一个简单的groupby并在那里进行计算。我将如何进行R

注意:实际匹配,而不仅仅是A-B组合

在我的数据中,同一AB之间可能存在多个匹配,这些匹配在两者之间终止。如果是这种情况,我希望每个(start_globalend_global)对都有几个匹配。

也就是说,我有以下数据:

13: 51557133 2000 65224333 144.17000         TRUE      FALSE
14: 51557133 2001 65224333 102.52000        FALSE       TRUE
16: 51557133 2003 65224333   8.48000         TRUE      FALSE
17: 51557133 2004 65224333  68.16000        FALSE      FALSE
18: 51557133 2005 65224333  29.36000        FALSE       TRUE

我希望这成为

         A        B   payment start  end
1 51557133 65224333 123.34500  2000 2001
2 51557133 65224333  35.33333  2003 2005

而不是

         A        B payment start  end
1 51557133 65224333  70.538  2000 2005

注意:没有dplyr

我将在安全的服务器上使用它,其中安装额外的软件包非常麻烦且几乎不可能。我已经在该服务器上plyrdata.table,如果有办法在不安装其他软件包的情况下执行此操作,我会更喜欢这样做。

为了完整性,这里是允许的包列表:

MASS        devtools    gtable      munsell     reshape2
RColorBrewer    dichromat   haven       packrat     rstudio
Rcpp        digest      labeling    plyr        scales
colorspace  foreign     mFilter     proto       stringr
data.table  ggplot2     manipulate  reshape     yaml

1 个答案:

答案 0 :(得分:2)

使用基础R:

df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
cbind(
  aggregate(payment~A+B+startCount, mean, data = df1)[, -3],
  start = aggregate(year~A+B+startCount, min, data = df1)[, 4],
  end = aggregate(year~A+B+startCount, max, data = df1)[, 4]
)

使用dplyr包:

library(dplyr)
df1 %>%
  group_by(A, B) %>%
  mutate(startCount = cumsum(ifelse(start_global==TRUE,1,0))) %>%
  group_by(A, B, startCount) %>%
  summarise(
    payment = mean(payment),
    start = min(year),
    end = max(year)
    ) %>%
  select(-startCount)

使用data.table:

library(data.table)
df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
result <- df1[, j = list(payment = mean(payment), start = min(year), end = max(year)), by = list(A,B,startCount)]
result[, startCount:=NULL]

<强>输出:

Source: local data table [4 x 5]
Groups: A, B

         A        B  payment start  end
1 51557094 65122111 40.06222  2002 2010
2 51557133 65142845 86.01990  1998 1999
3 51557133 65224333 21.03455  1997 1997
4 51557133 65224333 59.74667  2000 2005

<强>基准 到目前为止data.table是最快的:

Unit: milliseconds
      expr      min        lq      mean    median       uq       max neval
      BASE 5.808398 22.212135 32.391813 26.293450 34.08702 325.40491  1000
     DPLYR 4.352663 17.011435 25.892872 20.931953 27.37157 177.39900  1000
 DATATABLE 1.067853  4.139477  6.326194  4.987943  6.75672  85.24855  1000

使用的数据:

df1 <- 
structure(list(A = c(51557094L, 51557094L, 51557094L, 51557094L, 
51557094L, 51557094L, 51557094L, 51557094L, 51557094L, 51557133L, 
51557133L, 51557133L, 51557133L, 51557133L, 51557133L, 51557133L, 
51557133L, 51557133L), year = c(2002L, 2003L, 2004L, 2005L, 2006L, 
2007L, 2008L, 2009L, 2010L, 1998L, 1999L, 1997L, 2000L, 2001L, 
2002L, 2003L, 2004L, 2005L), B = c(65122111L, 65122111L, 65122111L, 
65122111L, 65122111L, 65122111L, 65122111L, 65122111L, 65122111L, 
65142845L, 65142845L, 65224333L, 65224333L, 65224333L, 65224333L, 
65224333L, 65224333L, 65224333L), payment = c(80.39, 9.74, 7.85, 
97.16, 48.22, 91.24, 9.35, 13.15, 3.46, 60.43981, 111.6, 21.03455, 
144.17, 102.52, 5.79, 8.48, 68.16, 29.36), start_global = c(TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, 
FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), end_global = c(FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)), .Names = c("A", 
"year", "B", "payment", "start_global", "end_global"), class = c("data.table", 
"data.frame"), row.names = c(NA, -18L))