我在data.table j中调用的函数没有返回预期的结果

时间:2016-03-27 16:11:45

标签: r data.table

我所描述的索引问题here已通过开发data.table版本1.9.7解决。

我的问题是了解我在向自己的函数发送数据和从我自己的函数返回时做错了什么。

正如另一个问题中所述,我想只为每个gvkey保留最长的连续片段,如果有多个相等长度的片段,请选择最近的片段。

 DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
 DT[, gap := fyear - fyear.lag]

这里我得到了预期的结果(data.table v1.9.7):

DT[,         step.idx := 0]    # initialize
DT[gap >=2 , step.idx := 1]    # 1's at each multi-year jump
DT[, step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ ,  seq.lengths := .N,  by=.(gvkey,step.idx)]      # length of each sequence
DT[,   keep.seq := 1*(seq.lengths == max(seq.lengths)), by = gvkey]        # each firm's longest sequence
DT[keep.seq==1,  keep.seq := c(rep(0, (.N-max(seq.lengths))), rep(1, max(seq.lengths))), by = gvkey] 

 #' expected results:
 DT.out <- DT[keep.seq==1] # 23
 DT.out[keep.seq==0, .N] # 0 
 nrow(DT.out)#   [1] 149

当我使用自己的函数尝试基本相同的过程时,我会得到额外的keep.seq==0个案例。 我的问题是为什么我不能从上面得到与上面相同的结果

find.seq.keep <- function(g){
    step.idx = rep(0, length(g))
    step.idx[g>=2] = 1
    step.idx = cumsum(step.idx)
    N.seq = length(unique(step.idx))

    seq.lengths = as.vector(unlist(tapply(step.idx, step.idx,
                     function(x) rep(length(x), length(x)))))
    keep.seq = 1*(seq.lengths == max(seq.lengths))
    if(length(keep.seq[keep.seq == 1]) > max(seq.lengths)){
      N.max = max(seq.lengths)
      N.1s  = length(keep.seq[keep.seq==1])
      keep.seq[keep.seq==1] = c(rep(0, (N.1s-N.max)), rep(1, N.max))
    }
return(as.list(keep.seq))
}
DT[,keep.seqF := find.seq.keep(gap), by = gvkey]

删除行有效,但有一些误报可以删除:

   DT.outF <- DT[keep.seqF==1]
   DT.outF[keep.seqF==0, .N]  # 0
   nrow(DT.outF)   # 141 (<149 = nrow(DT.out)  !!)

我想让我的个人功能正常工作,以便我仍然可以使用1.9.6版本(更容易与同事分享),至少在CRAN上使用1.9.7。现在弗兰克为我的问题提供了一个解决方案,当我打电话给j时,我想更好地掌握find.seq.keep参数的内容。

=======

**可重复的示例数据***

DT <- data.table(
   gvkey =  c(1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 
              1681, 1681, 1681, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 
              1914, 1914, 1914, 1914, 1914, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2164, 2164, 2164, 2164, 
              2164, 2164, 2164, 2164, 2164, 2164, 2164, 2164, 2185, 2185, 2185, 2185, 2185, 
              2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 
              2185, 2185, 2185),
   fyear = c(1983, 1984, 1985, 1986, 1987, 1988, 1989, 1997, 1998, 2008, 2009, 2010, 2011, 
             2012, 2013, 2014, 1983, 1984, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 
             2001, 2002, 2003, 2004, 2005, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965,
             1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 
             1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
             1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2007, 2008, 
             2009, 2010, 2011, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 
             1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973,
             1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 
             1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 
             2000, 2001, 2002, 2003, 2004, 2005, 2006, 2011, 2012, 1978, 1979, 1980, 1981, 
             1982, 1983, 1984, 1985, 1986, 1989, 1990, 1991, 1970, 1971, 1972, 1973, 1974,
             1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 
             1988, 1994, 1995))

setkey(DT, gvkey, fyear)

1 个答案:

答案 0 :(得分:3)

我不确定为什么你的功能不起作用,但这是另一种方法:

DT[, g := cumsum( fyear - shift(fyear, fill=fyear[1L]-1L) != 1L ), by=gvkey]
keep = DT[, 
  .(len = .N), by=.(gvkey, g)][, 
  .( g = g[tail(which(len == max(len)), 1)]), by=gvkey]

DT.out = DT[keep, on=names(keep)]

DT.out[, .N] # 149, as expected

工作原理:

  • g是每个gvkey内的广告投放的ID。
  • len是每次运行的长度。
  • g[tail(which(len == max(len)), 1)]是最长的,通过采取最新的打破关系。
  • DT[keep, on=names(keep)是一种合并,可将DT子集设置为保留中的(gvkey,g)

如果由于某种原因,您想要一个基本功能来执行此操作......

tag.long.seq = function(x){
    g    = cumsum(c(1L, diff(x) > 1L))
    len  = tapply(g, g, FUN = length)
    w    = tail(which(len == max(len)), 1L)

    ave(g, g, FUN = function(z) z[1] == w)    
}

DT[, keepem := tag.long.seq(fyear), by=gvkey]

DT[(keepem==1L), .N] # 149 again