我在R中使用data.table
获得了一些奇怪的行为。我想只保留行的某个子集,例如DT <- DT[max.seq == 1]
,(我认为)过去一直都很好。但是对于这个特定的数据集,我不知道这是我的代码还是我误解的一些data.table
功能。
似乎删除我不想要的行的命令需要运行两次才能正常工作。
具体来说,我试图通过仅保留每个公司最长的连续序列(或者如果有多个最大长度序列的最新序列)来删除非连续的公司级时间序列。
======
这是我正在使用的数据的一个子集:
library(data.table)
DT <- data.table(
gvkey = c(7221, 7221, 7221, 7221, 7221, 7221, 7221, 7221, 7392, 7392, 7392, 7392, 7392,
7392, 7392, 7392, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344,
8344, 8344, 10589, 10589, 10589, 10589, 11759, 11759, 12675, 12675, 12675, 12675,
12675, 12675, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312,
1312, 1312, 13910, 13910, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286,
17286, 17286, 17286, 17286, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090,
2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090,
2090, 2090, 2090, 2090, 2090, 2090, 2212, 2212, 2212),
fyear = c(1982, 1983, 1984, 1985, 1990, 1991, 1992, 1993, 1975, 1976, 1977, 1978, 1983,
1984, 1985, 1986, 1982, 1983, 1984, 1985, 1986, 1987, 1990, 1991, 1992, 1993,
1994, 1995, 1978, 1979, 1983, 1984, 1984, 1988, 1985, 1986, 1987, 2001, 2002,
2003, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,
1986, 1986, 1989, 1989, 1990, 1991, 1992, 1993, 1994, 2001, 2002, 2003, 2004,
2005, 2006, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966,
1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,
1980, 1981, 1982, 1983, 1982, 1983, 1984))
setkey(DT, gvkey)
===========
然后我运行以下命令来创建一个二进制变量(max.seq
),对应于每个公司(即gvkey
)最长的每一行为1,然后再用{{ 1}}在必要时保留最新的序列。
one.segment
现在这不是最有效的方法,因为我在删除非最长的时间序列时制作上面的副本,然后在我保持最近的等长最大系列时间序列时再次执行此操作 - 但是我不认为这会影响我的功能问题。
DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
DT[, gap := fyear - fyear.lag]
DT[, step.idx := 0] # initialize
DT[gap >=2, step.idx := 1] # 1's at each multi-year jump
DT[, step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ , seq.lengths := .N, by=.(gvkey,step.idx)] # length of each sequence
DT[, max.seq := max(seq.lengths), by = gvkey] # each firm's longest sequence
DT <- DT[max.seq == seq.lengths] # Keep only the longest sequence(s)
已编辑以报告完整输出
我从
开始DT[, one.segment := 1*(max.seq == .N), by= gvkey] # 0 if there multiple series remain
DT[one.segment == 0, # make the last max.seq elements 1, leave the rest as 0
one.segment := c(rep(0, (.N-max.seq[1])), rep(1, max.seq[1])), by=gvkey]
然后只保留 nrow(DT) # [1] 98
DT[one.segment ==0, .N] # [1] 14
行。
one.segment==1
我现在应该还有没有 DT.out <- DT[one.segment == 1] # Finished! ... or am I?
个案例,但我确实如此。
one.segment == 0
但是,如果我再次运行行删除命令,那么问题就解决了(对于这个例子和我的完整数据集 nrow(DT.out) # [1] 76
DT.out[one.segment ==0, .N] # [1] 13
)。
nrow(DT)>35000
我错过了什么?
谢谢!
**输出**
DT.out2 <- DT.out[one.segment == 1]
nrow(DT.out2) # [1] 63
DT.out[one.segment ==0, .N] # [1] 0
**会话信息***
> DT.out
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment
1: 1312 1974 NA NA 0 13 13 1
2: 1312 1975 1974 1 0 13 13 1
3: 1312 1976 1975 1 0 13 13 1
4: 1312 1977 1976 1 0 13 13 1
5: 1312 1978 1977 1 0 13 13 1
6: 1312 1979 1978 1 0 13 13 1
7: 1312 1980 1979 1 0 13 13 1
8: 1312 1981 1980 1 0 13 13 1
9: 1312 1982 1981 1 0 13 13 1
10: 1312 1983 1982 1 0 13 13 1
11: 1312 1984 1983 1 0 13 13 1
12: 1312 1985 1984 1 0 13 13 1
13: 1312 1986 1985 1 0 13 13 1
14: 2090 1956 NA NA 0 28 28 1
15: 2090 1957 1956 1 0 28 28 1
16: 2090 1958 1957 1 0 28 28 1
17: 2090 1959 1958 1 0 28 28 1
18: 2090 1960 1959 1 0 28 28 1
19: 2090 1961 1960 1 0 28 28 1
20: 2090 1962 1961 1 0 28 28 1
21: 2090 1963 1962 1 0 28 28 1
22: 2090 1964 1963 1 0 28 28 1
23: 2090 1965 1964 1 0 28 28 1
24: 2090 1966 1965 1 0 28 28 1
25: 2090 1967 1966 1 0 28 28 1
26: 2090 1968 1967 1 0 28 28 1
27: 2090 1969 1968 1 0 28 28 1
28: 2090 1970 1969 1 0 28 28 1
29: 2090 1971 1970 1 0 28 28 1
30: 2090 1972 1971 1 0 28 28 1
31: 2090 1973 1972 1 0 28 28 1
32: 2090 1974 1973 1 0 28 28 1
33: 2090 1975 1974 1 0 28 28 1
34: 2090 1976 1975 1 0 28 28 1
35: 2090 1977 1976 1 0 28 28 1
36: 2090 1978 1977 1 0 28 28 1
37: 2090 1979 1978 1 0 28 28 1
38: 2090 1980 1979 1 0 28 28 1
39: 2090 1981 1980 1 0 28 28 1
40: 2090 1982 1981 1 0 28 28 1
41: 2090 1983 1982 1 0 28 28 1
42: 2212 1982 NA NA 0 3 3 1
43: 2212 1983 1982 1 0 3 3 1
44: 2212 1984 1983 1 0 3 3 1
45: 8344 1990 1987 3 1 6 6 1
46: 8344 1991 1990 1 1 6 6 1
47: 8344 1992 1991 1 1 6 6 1
48: 8344 1993 1992 1 1 6 6 1
49: 8344 1994 1993 1 1 6 6 1
50: 8344 1995 1994 1 1 6 6 1
51: 10589 1978 NA NA 0 2 2 0
52: 10589 1979 1978 1 0 2 2 0
53: 10589 1983 1979 4 1 2 2 1
54: 10589 1984 1983 1 1 2 2 1
55: 11759 1984 NA NA 0 1 1 0
56: 11759 1988 1984 4 1 1 1 1
57: 12675 1985 NA NA 0 3 3 0
58: 12675 1986 1985 1 0 3 3 0
59: 12675 1987 1986 1 0 3 3 0
60: 12675 2001 1987 14 1 3 3 1
61: 12675 2002 2001 1 1 3 3 1
62: 12675 2003 2002 1 1 3 3 1
63: 13910 1986 NA NA 0 1 1 0
64: 13910 1989 1986 3 1 1 1 1
65: 17286 1989 NA NA 0 6 6 0
66: 17286 1990 1989 1 0 6 6 0
67: 17286 1991 1990 1 0 6 6 0
68: 17286 1992 1991 1 0 6 6 0
69: 17286 1993 1992 1 0 6 6 0
70: 17286 1994 1993 1 0 6 6 0
71: 17286 2001 1994 7 1 6 6 1
72: 17286 2002 2001 1 1 6 6 1
73: 17286 2003 2002 1 1 6 6 1
74: 17286 2004 2003 1 1 6 6 1
75: 17286 2005 2004 1 1 6 6 1
76: 17286 2006 2005 1 1 6 6 1
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment