我所描述的索引问题here已通过开发data.table
版本1.9.7解决。
我的问题是了解我在向自己的函数发送数据和从我自己的函数返回时做错了什么。
正如另一个问题中所述,我想只为每个gvkey
保留最长的连续片段,如果有多个相等长度的片段,请选择最近的片段。
DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
DT[, gap := fyear - fyear.lag]
这里我得到了预期的结果(data.table
v1.9.7):
DT[, step.idx := 0] # initialize
DT[gap >=2 , step.idx := 1] # 1's at each multi-year jump
DT[, step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ , seq.lengths := .N, by=.(gvkey,step.idx)] # length of each sequence
DT[, keep.seq := 1*(seq.lengths == max(seq.lengths)), by = gvkey] # each firm's longest sequence
DT[keep.seq==1, keep.seq := c(rep(0, (.N-max(seq.lengths))), rep(1, max(seq.lengths))), by = gvkey]
#' expected results:
DT.out <- DT[keep.seq==1] # 23
DT.out[keep.seq==0, .N] # 0
nrow(DT.out)# [1] 149
当我使用自己的函数尝试基本相同的过程时,我会得到额外的keep.seq==0
个案例。 我的问题是为什么我不能从上面得到与上面相同的结果:
find.seq.keep <- function(g){
step.idx = rep(0, length(g))
step.idx[g>=2] = 1
step.idx = cumsum(step.idx)
N.seq = length(unique(step.idx))
seq.lengths = as.vector(unlist(tapply(step.idx, step.idx,
function(x) rep(length(x), length(x)))))
keep.seq = 1*(seq.lengths == max(seq.lengths))
if(length(keep.seq[keep.seq == 1]) > max(seq.lengths)){
N.max = max(seq.lengths)
N.1s = length(keep.seq[keep.seq==1])
keep.seq[keep.seq==1] = c(rep(0, (N.1s-N.max)), rep(1, N.max))
}
return(as.list(keep.seq))
}
DT[,keep.seqF := find.seq.keep(gap), by = gvkey]
删除行有效,但有一些误报可以删除:
DT.outF <- DT[keep.seqF==1]
DT.outF[keep.seqF==0, .N] # 0
nrow(DT.outF) # 141 (<149 = nrow(DT.out) !!)
我想让我的个人功能正常工作,以便我仍然可以使用1.9.6版本(更容易与同事分享),至少在CRAN上使用1.9.7。现在弗兰克为我的问题提供了一个解决方案,当我打电话给j
时,我想更好地掌握find.seq.keep
参数的内容。
=======
**可重复的示例数据***
DT <- data.table(
gvkey = c(1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681,
1681, 1681, 1681, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914,
1914, 1914, 1914, 1914, 1914, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2164, 2164, 2164, 2164,
2164, 2164, 2164, 2164, 2164, 2164, 2164, 2164, 2185, 2185, 2185, 2185, 2185,
2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185,
2185, 2185, 2185),
fyear = c(1983, 1984, 1985, 1986, 1987, 1988, 1989, 1997, 1998, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 1983, 1984, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965,
1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978,
1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2007, 2008,
2009, 2010, 2011, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973,
1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986,
1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
2000, 2001, 2002, 2003, 2004, 2005, 2006, 2011, 2012, 1978, 1979, 1980, 1981,
1982, 1983, 1984, 1985, 1986, 1989, 1990, 1991, 1970, 1971, 1972, 1973, 1974,
1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
1988, 1994, 1995))
setkey(DT, gvkey, fyear)
答案 0 :(得分:3)
我不确定为什么你的功能不起作用,但这是另一种方法:
DT[, g := cumsum( fyear - shift(fyear, fill=fyear[1L]-1L) != 1L ), by=gvkey]
keep = DT[,
.(len = .N), by=.(gvkey, g)][,
.( g = g[tail(which(len == max(len)), 1)]), by=gvkey]
DT.out = DT[keep, on=names(keep)]
DT.out[, .N] # 149, as expected
工作原理:
g
是每个gvkey
内的广告投放的ID。 len
是每次运行的长度。g[tail(which(len == max(len)), 1)]
是最长的,通过采取最新的打破关系。DT[keep, on=names(keep)
是一种合并,可将DT
子集设置为保留中的(gvkey,g)
。如果由于某种原因,您想要一个基本功能来执行此操作......
tag.long.seq = function(x){
g = cumsum(c(1L, diff(x) > 1L))
len = tapply(g, g, FUN = length)
w = tail(which(len == max(len)), 1L)
ave(g, g, FUN = function(z) z[1] == w)
}
DT[, keepem := tag.long.seq(fyear), by=gvkey]
DT[(keepem==1L), .N] # 149 again