我有一个随时间变化的公司员工数据集,看起来像这样
data.table(firm = c(rep("A", 8), rep("B", 8)),
employee = c(1, 2, 3, 4, 1, 2, 3, NA, 5, 6, NA, NA, 5, 6, 7, 8),
year = c(rep(1, 4), rep(2, 4)))
firm employee_id year
A 1 1
A 2 1
A 3 1
A 4 1
A 1 2
A 2 2
A 3 2
A NA 2
B 5 1
B 6 1
B NA 1
B NA 1
B 5 2
B 6 2
B 7 2
B 8 2
我想计算每家公司的年度== 1的员工百分比= = 2。
输出应该是这样的
firm year continued_employees
A 2 0.75
B 2 1
我可以使用
在每年的循环中完成 sum(employee_id[year==1] %in% employee_id[year==2]) / length(employee_id[year==1])
但我有大约4万家公司和10年的观察。有关如何使用dplyr
或data.table
语法进行操作的任何想法?
答案 0 :(得分:2)
这是一种不那么漂亮的data.table
方法,可以用于任何数量的公司和年份:
years <- head(sort(unique(dt$year)), -1)
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][,
.(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
#$`Year-1`
# firm continued year
#1: A 0.75 2
#2: B 1.00 2
由于您的样本数据只有两年,因此只返回一个列表元素。
答案 1 :(得分:1)
这是一种使用一种带有转移年份的自我加入的方法:
library(data.table)
options(datatable.print.class = TRUE)
# self join with shifted year
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][]
# aggregate
DT[!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
# beautify result
year <= max(DT$year)]
firm year V1 <char> <num> <num> 1: A 2 0.75 2: B 2 1.00
第一个表达式修改DT
以表示继续雇员:
firm employee year cont <char> <num> <num> <lgcl> 1: A 1 1 TRUE 2: A 2 1 TRUE 3: A 3 1 TRUE 4: A 4 1 NA 5: A 1 2 NA 6: A 2 2 NA 7: A 3 2 NA 8: A NA 2 NA 9: B 5 1 TRUE 10: B 6 1 TRUE 11: B NA 1 NA 12: B NA 1 NA 13: B 5 2 NA 14: B 6 2 NA 15: B 7 2 NA 16: B 8 2 NA
shift()
或者,shift()
函数可用于计算cont
列。聚合部分与上面的连接方法相同。 shift()
要求确保按年份排序数据。
DT[order(year), cont := shift(year, type = "lead") == year + 1, by = .(firm, employee)][
!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
year <= max(DT$year)]
在撰写本文时,除了OP自己尝试使用循环之外,还提出了三种方法:
shift()
基准不考虑Jean Vuda的答案,因为它仅限于2年。
根据OP,生产数据集包括40 k公司和10年的数据。对于实际的基准测试,会创建一个类似大小的样本数据集:
n_firm <- 40000L
max_employee <- 10L
fluctuation_rate <- 0.2
n_year <- 10L
start_year <- 2001L
DT0 <- CJ(firm = sprintf("%06i", seq_len(n_firm)),
employee = seq_len(max_employee),
year = seq(start_year, length.out = n_year))
set.seed(123L)
n_row <- nrow(DT0)
DT0[sample.int(n_row, fluctuation_rate * n_row), employee := NA]
样本数据集由4 M行组成,在从长格式转换为宽格式后可以最佳地显示:
dcast(DT0[!is.na(employee)], firm + employee ~ year)
Using 'year' as value column. Use 'value.var' to override firm employee 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1: 000001 1 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2: 000001 2 2001 2002 2003 NA 2005 2006 2007 NA 2009 NA 3: 000001 3 2001 2002 NA NA 2005 2006 2007 2008 2009 2010 4: 000001 4 2001 NA NA NA 2005 2006 2007 2008 NA 2010 --- 399996: 040000 6 2001 2002 NA 2004 2005 NA NA NA 2009 2010 399997: 040000 7 NA 2002 NA NA 2005 2006 2007 2008 2009 2010 399998: 040000 8 2001 2002 2003 NA NA NA 2007 NA NA 2010 399999: 040000 9 2001 2002 2003 NA 2005 2006 2007 2008 2009 NA 400000: 040000 10 2001 2002 2003 NA NA 2006 2007 2008 2009 2010
对于基准测试,使用microbenchmark
包,因为可以传递检查函数以验证结果是否相同:
my_check <- function(values) {
values <- lapply(values, function(x) x[, dcast(.SD, firm ~ year, value.var = "continued")])
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
基准代码:
microbenchmark::microbenchmark(
dd = {
dt <- copy(DT0)
years <- head(sort(unique(dt$year)), -1)
rbindlist(
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][
, .(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
)
},
join = {
DT <- copy(DT0)
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
shift = {
DT <- copy(DT0)
DT[order(year), cont := shift(year, type = "lead") == year + 1,
by = .(firm, employee)][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
check = my_check,
times = 3L
)
基准测试结果显示, join 方法比 shift 方法快4倍,比docendo discimus方法快8倍。
Unit: seconds expr min lq mean median uq max neval cld dd 11.756114 11.919959 12.083042 12.083805 12.246506 12.409207 3 c join 1.054293 1.239829 1.303971 1.425366 1.428810 1.432254 3 a shift 6.105725 6.105906 6.148136 6.106087 6.169342 6.232596 3 b
答案 2 :(得分:0)
这是一种略有不同的方法:
dt<-dat[,list(all=.(unique(employee))), by=list(year,firm)]
dt<-dt[,list(year1=sapply(list(all),`[`,1),
year2=sapply(list(all),`[`,2)), by=firm]
dt[,uniqueN(mapply(intersect, year1, year2))/uniqueN(na.omit(unlist(year1))),by=firm]