我编写了一个for
循环来执行一些检查,并根据结果返回0或1。但是,在大型数据集上运行此操作需要很长时间(一夜之间仍然在早上运行)。关于如何使用dplyr
或其他工具提高效率的任何想法?谢谢
以下是一些测试数据:
tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2), fyear = c("1971", "1971", "1971", "1971",
"1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971",
"1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972",
"1972", "1972", "1972", "1972", "1972", "1973", "1973", "1973",
"1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973",
"1973", "1974", "1974", "1974", "1974", "1974", "1974", "1974",
"1974", "1974", "1974", "1974", "1974", "1975", "1975", "1975",
"1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975"
), datadate = c(19711231L, 19710129L, 19710226L, 19710331L, 19710430L,
19710528L, 19710630L, 19710730L, 19710831L, 19710930L, 19711029L,
19711130L, 19721231L, 19720131L, 19720229L, 19720330L, 19720428L,
19720531L, 19720630L, 19720731L, 19720831L, 19720929L, 19721031L,
19721130L, 19721229L, 19731231L, 19730131L, 19730228L, 19730330L,
19730430L, 19730531L, 19730629L, 19730731L, 19730831L, 19730928L,
19731031L, 19731130L, 19741231L, 19740131L, 19740228L, 19740329L,
19740430L, 19740531L, 19740628L, 19740731L, 19740830L, 19740930L,
19741031L, 19741129L, 19751231L, 19750131L, 19750228L, 19750331L,
19750430L, 19750530L, 19750630L, 19750731L, 19750829L, 19750930L,
19751031L), month = c("12", "01", "02", "03", "04", "05", "06",
"07", "08", "09", "10", "11", "12", "01", "02", "03", "04", "05",
"06", "07", "08", "09", "10", "11", "12", "12", "01", "02", "03",
"04", "05", "06", "07", "08", "09", "10", "11", "12", "01", "02",
"03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "01",
"02", "03", "04", "05", "06", "07", "08", "09", "10")), .Names = c("cusip",
"fyear", "datadate", "month"), row.names = c(NA, -60L), class = c("tbl_df",
"tbl", "data.frame"))
For loop:
for(i in min(tdata$cusip):max(tdata$cusip)){
for (j in min(tdata$fyear):max(tdata$fyear)) {
monthcheck <- filter(tdata, cusip == i & (fyear == j-1 | fyear == j-2 | fyear == j-3 | fyear == j-4))
if((length(monthcheck$month) / 60) >= 0.4) tdata$check[tdata$cusip == i & tdata$fyear == j] <- 1
}}
这为1973-1975返回1,因为检查通过了。有没有办法让这个for
循环更有效率,因为这需要一段时间才能在大型数据集上运行?
编辑:for循环的说明
对于每个唯一ID(cusip)和每年(fyear)使用select
获取前4年的数据,然后计算观察数量并检查它是否大于40%。如果是,请为特定的cusip指定1到tdata$check
。
这个想法是确保每个唯一身份证上至少有24个上个月的观察结果。
答案 0 :(得分:2)
分组和滞后累积总和的解决方案:
library(dplyr)
tdata %>%
group_by(cusip, fyear) %>%
summarise(number = n(), share = n() / 60) %>%
mutate( cum_y = lag(cumsum(share)),
cum_y4 = lag(cum_y, 4),
last4 = ifelse(is.na(cum_y4), cum_y, cum_y - cum_y4),
check = as.numeric( last4 >= 0.4 )
) %>%
select(cusip, fyear, last4, check)
说明:
fyear
分组,计算观察结果并获得share
一年cum_y
是滞后累积的股数总和cum_y4
落后cum_y
last4
是cum_y
和cum_y4
check
正在检查last4
加入原始数据中的变量:
... %>%
left_join(tdata, by = c("cusip", "fyear"))