我有一些关于匹配的纵向data.table
,其中包含不同A
和B
之间的匹配,以及这些匹配之间的付款流程。
A year B payment start_global end_global
1: 51557094 2002 65122111 80.39000 TRUE FALSE
2: 51557094 2003 65122111 9.74000 FALSE FALSE
3: 51557094 2004 65122111 7.85000 FALSE FALSE
4: 51557094 2005 65122111 97.16000 FALSE FALSE
5: 51557094 2006 65122111 48.22000 FALSE FALSE
6: 51557094 2007 65122111 91.24000 FALSE FALSE
7: 51557094 2008 65122111 9.35000 FALSE FALSE
8: 51557094 2009 65122111 13.15000 FALSE FALSE
9: 51557094 2010 65122111 3.46000 FALSE TRUE
10: 51557133 1998 65142845 60.43981 TRUE FALSE
11: 51557133 1999 65142845 111.60000 FALSE TRUE
12: 51557133 1997 65224333 21.03455 TRUE TRUE
13: 51557133 2000 65224333 144.17000 TRUE FALSE
14: 51557133 2001 65224333 102.52000 FALSE FALSE
15: 51557133 2002 65224333 5.79000 FALSE FALSE
16: 51557133 2003 65224333 8.48000 FALSE FALSE
17: 51557133 2004 65224333 68.16000 FALSE FALSE
18: 51557133 2005 65224333 29.36000 FALSE TRUE
我已经添加了指标start_global
和end_global
,它们指示匹配的开始位置和结束位置(根据是否存在特定A-B
链接的连接给出了下一年和前一年。
我现在需要为每个A-B
链接计算匹配长度和平均付款。也就是说,我的预期输出将类似于
A B payment start end
1 51557094 65122111 40.06222 2002 2010
在pandas
中,我只会做一个简单的groupby
并在那里进行计算。我将如何进行R
?
在我的数据中,同一A
,B
之间可能存在多个匹配,这些匹配在两者之间终止。如果是这种情况,我希望每个(start_global
,end_global
)对都有几个匹配。
也就是说,我有以下数据:
13: 51557133 2000 65224333 144.17000 TRUE FALSE
14: 51557133 2001 65224333 102.52000 FALSE TRUE
16: 51557133 2003 65224333 8.48000 TRUE FALSE
17: 51557133 2004 65224333 68.16000 FALSE FALSE
18: 51557133 2005 65224333 29.36000 FALSE TRUE
我希望这成为
A B payment start end
1 51557133 65224333 123.34500 2000 2001
2 51557133 65224333 35.33333 2003 2005
而不是
A B payment start end
1 51557133 65224333 70.538 2000 2005
我将在安全的服务器上使用它,其中安装额外的软件包非常麻烦且几乎不可能。我已经在该服务器上plyr
和data.table
,如果有办法在不安装其他软件包的情况下执行此操作,我会更喜欢这样做。
为了完整性,这里是允许的包列表:
MASS devtools gtable munsell reshape2
RColorBrewer dichromat haven packrat rstudio
Rcpp digest labeling plyr scales
colorspace foreign mFilter proto stringr
data.table ggplot2 manipulate reshape yaml
答案 0 :(得分:2)
使用基础R:
df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
cbind(
aggregate(payment~A+B+startCount, mean, data = df1)[, -3],
start = aggregate(year~A+B+startCount, min, data = df1)[, 4],
end = aggregate(year~A+B+startCount, max, data = df1)[, 4]
)
使用dplyr包:
library(dplyr)
df1 %>%
group_by(A, B) %>%
mutate(startCount = cumsum(ifelse(start_global==TRUE,1,0))) %>%
group_by(A, B, startCount) %>%
summarise(
payment = mean(payment),
start = min(year),
end = max(year)
) %>%
select(-startCount)
使用data.table:
library(data.table)
df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
result <- df1[, j = list(payment = mean(payment), start = min(year), end = max(year)), by = list(A,B,startCount)]
result[, startCount:=NULL]
<强>输出:强>
Source: local data table [4 x 5]
Groups: A, B
A B payment start end
1 51557094 65122111 40.06222 2002 2010
2 51557133 65142845 86.01990 1998 1999
3 51557133 65224333 21.03455 1997 1997
4 51557133 65224333 59.74667 2000 2005
<强>基准强>
到目前为止data.table
是最快的:
Unit: milliseconds
expr min lq mean median uq max neval
BASE 5.808398 22.212135 32.391813 26.293450 34.08702 325.40491 1000
DPLYR 4.352663 17.011435 25.892872 20.931953 27.37157 177.39900 1000
DATATABLE 1.067853 4.139477 6.326194 4.987943 6.75672 85.24855 1000
使用的数据:
df1 <-
structure(list(A = c(51557094L, 51557094L, 51557094L, 51557094L,
51557094L, 51557094L, 51557094L, 51557094L, 51557094L, 51557133L,
51557133L, 51557133L, 51557133L, 51557133L, 51557133L, 51557133L,
51557133L, 51557133L), year = c(2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2008L, 2009L, 2010L, 1998L, 1999L, 1997L, 2000L, 2001L,
2002L, 2003L, 2004L, 2005L), B = c(65122111L, 65122111L, 65122111L,
65122111L, 65122111L, 65122111L, 65122111L, 65122111L, 65122111L,
65142845L, 65142845L, 65224333L, 65224333L, 65224333L, 65224333L,
65224333L, 65224333L, 65224333L), payment = c(80.39, 9.74, 7.85,
97.16, 48.22, 91.24, 9.35, 13.15, 3.46, 60.43981, 111.6, 21.03455,
144.17, 102.52, 5.79, 8.48, 68.16, 29.36), start_global = c(TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), end_global = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)), .Names = c("A",
"year", "B", "payment", "start_global", "end_global"), class = c("data.table",
"data.frame"), row.names = c(NA, -18L))