data.table中的选定行未被第一次删除(必须删除两次)

时间:2016-03-26 17:34:03

标签: r data.table

我在R中使用data.table获得了一些奇怪的行为。我想只保留行的某个子集,例如DT <- DT[max.seq == 1],(我认为)过去一直都很好。但是对于这个特定的数据集,我不知道这是我的代码还是我误解的一些data.table功能。

似乎删除我不想要的行的命令需要运行两次才能正常工作。

具体来说,我试图通过仅保留每个公司最长的连续序列(或者如果有多个最大长度序列的最新序列)来删除非连续的公司级时间序列。

======

这是我正在使用的数据的一个子集:

library(data.table)
DT <- data.table(
       gvkey =  c(7221, 7221, 7221, 7221, 7221, 7221, 7221, 7221, 7392, 7392, 7392, 7392, 7392, 
                  7392, 7392, 7392, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 
                  8344, 8344, 10589, 10589, 10589, 10589, 11759, 11759, 12675, 12675, 12675, 12675, 
                  12675, 12675, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 
                  1312, 1312, 13910, 13910, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 
                  17286, 17286, 17286, 17286, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 
                  2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 
                  2090, 2090, 2090, 2090, 2090, 2090, 2212, 2212, 2212),
       fyear =  c(1982, 1983, 1984, 1985, 1990, 1991, 1992, 1993, 1975, 1976, 1977, 1978, 1983, 
                  1984, 1985, 1986, 1982, 1983, 1984, 1985, 1986, 1987, 1990, 1991, 1992, 1993, 
                  1994, 1995, 1978, 1979, 1983, 1984, 1984, 1988, 1985, 1986, 1987, 2001, 2002, 
                  2003, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 
                  1986, 1986, 1989, 1989, 1990, 1991, 1992, 1993, 1994, 2001, 2002, 2003, 2004, 
                  2005, 2006, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 
                  1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 
                  1980, 1981, 1982, 1983, 1982, 1983, 1984))

setkey(DT, gvkey)

===========

然后我运行以下命令来创建一个二进制变量(max.seq),对应于每个公司(即gvkey)最长的每一行为1,然后再用{{ 1}}在必要时保留最新的序列。

one.segment

现在这不是最有效的方法,因为我在删除非最长的时间序列时制作上面的副本,然后在我保持最近的等长最大系列时间序列时再次执行此操作 - 但是我不认为这会影响我的功能问题。

DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
DT[, gap := fyear - fyear.lag]

DT[,  step.idx := 0]    # initialize
DT[gap >=2, step.idx := 1]    # 1's at each multi-year jump
DT[,        step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ ,  seq.lengths := .N,  by=.(gvkey,step.idx)]      # length of each sequence
DT[,   max.seq := max(seq.lengths), by = gvkey]       # each firm's longest sequence

DT <- DT[max.seq == seq.lengths]  # Keep only the longest sequence(s)

已编辑以报告完整输出

我从

开始
DT[, one.segment := 1*(max.seq == .N), by= gvkey] # 0 if there multiple series remain

DT[one.segment == 0,  # make the last max.seq elements 1, leave the rest as 0
    one.segment := c(rep(0, (.N-max.seq[1])), rep(1, max.seq[1])), by=gvkey]

然后只保留 nrow(DT) # [1] 98 DT[one.segment ==0, .N] # [1] 14 行。

one.segment==1

我现在应该还有没有 DT.out <- DT[one.segment == 1] # Finished! ... or am I? 个案例,但我确实如此。

one.segment == 0

但是,如果我再次运行行删除命令,那么问题就解决了(对于这个例子和我的完整数据集 nrow(DT.out) # [1] 76 DT.out[one.segment ==0, .N] # [1] 13 )。

nrow(DT)>35000

我错过了什么?

谢谢!

**输出**

DT.out2 <- DT.out[one.segment == 1]
nrow(DT.out2)  # [1] 63
DT.out[one.segment ==0, .N]  # [1] 0

**会话信息***

> DT.out
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment
 1:  1312  1974        NA  NA        0          13      13           1
 2:  1312  1975      1974   1        0          13      13           1
 3:  1312  1976      1975   1        0          13      13           1
 4:  1312  1977      1976   1        0          13      13           1
 5:  1312  1978      1977   1        0          13      13           1
 6:  1312  1979      1978   1        0          13      13           1
 7:  1312  1980      1979   1        0          13      13           1
 8:  1312  1981      1980   1        0          13      13           1
 9:  1312  1982      1981   1        0          13      13           1
10:  1312  1983      1982   1        0          13      13           1
11:  1312  1984      1983   1        0          13      13           1
12:  1312  1985      1984   1        0          13      13           1
13:  1312  1986      1985   1        0          13      13           1
14:  2090  1956        NA  NA        0          28      28           1
15:  2090  1957      1956   1        0          28      28           1
16:  2090  1958      1957   1        0          28      28           1
17:  2090  1959      1958   1        0          28      28           1
18:  2090  1960      1959   1        0          28      28           1
19:  2090  1961      1960   1        0          28      28           1
20:  2090  1962      1961   1        0          28      28           1
21:  2090  1963      1962   1        0          28      28           1
22:  2090  1964      1963   1        0          28      28           1
23:  2090  1965      1964   1        0          28      28           1
24:  2090  1966      1965   1        0          28      28           1
25:  2090  1967      1966   1        0          28      28           1
26:  2090  1968      1967   1        0          28      28           1
27:  2090  1969      1968   1        0          28      28           1
28:  2090  1970      1969   1        0          28      28           1
29:  2090  1971      1970   1        0          28      28           1
30:  2090  1972      1971   1        0          28      28           1
31:  2090  1973      1972   1        0          28      28           1
32:  2090  1974      1973   1        0          28      28           1
33:  2090  1975      1974   1        0          28      28           1
34:  2090  1976      1975   1        0          28      28           1
35:  2090  1977      1976   1        0          28      28           1
36:  2090  1978      1977   1        0          28      28           1
37:  2090  1979      1978   1        0          28      28           1
38:  2090  1980      1979   1        0          28      28           1
39:  2090  1981      1980   1        0          28      28           1
40:  2090  1982      1981   1        0          28      28           1
41:  2090  1983      1982   1        0          28      28           1
42:  2212  1982        NA  NA        0           3       3           1
43:  2212  1983      1982   1        0           3       3           1
44:  2212  1984      1983   1        0           3       3           1
45:  8344  1990      1987   3        1           6       6           1
46:  8344  1991      1990   1        1           6       6           1
47:  8344  1992      1991   1        1           6       6           1
48:  8344  1993      1992   1        1           6       6           1
49:  8344  1994      1993   1        1           6       6           1
50:  8344  1995      1994   1        1           6       6           1
51: 10589  1978        NA  NA        0           2       2           0
52: 10589  1979      1978   1        0           2       2           0
53: 10589  1983      1979   4        1           2       2           1
54: 10589  1984      1983   1        1           2       2           1
55: 11759  1984        NA  NA        0           1       1           0
56: 11759  1988      1984   4        1           1       1           1
57: 12675  1985        NA  NA        0           3       3           0
58: 12675  1986      1985   1        0           3       3           0
59: 12675  1987      1986   1        0           3       3           0
60: 12675  2001      1987  14        1           3       3           1
61: 12675  2002      2001   1        1           3       3           1
62: 12675  2003      2002   1        1           3       3           1
63: 13910  1986        NA  NA        0           1       1           0
64: 13910  1989      1986   3        1           1       1           1
65: 17286  1989        NA  NA        0           6       6           0
66: 17286  1990      1989   1        0           6       6           0
67: 17286  1991      1990   1        0           6       6           0
68: 17286  1992      1991   1        0           6       6           0
69: 17286  1993      1992   1        0           6       6           0
70: 17286  1994      1993   1        0           6       6           0
71: 17286  2001      1994   7        1           6       6           1
72: 17286  2002      2001   1        1           6       6           1
73: 17286  2003      2002   1        1           6       6           1
74: 17286  2004      2003   1        1           6       6           1
75: 17286  2005      2004   1        1           6       6           1
76: 17286  2006      2005   1        1           6       6           1
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment

0 个答案:

没有答案