Question

我有一个面板数据，您可能会注意到某些人在某些时候缺少观察结果。例如＆＃34; C＆＃34;缺少2001年的数据点和＆＃34; D＆＃34; 2002年和2003年。

> mydata
    id year sales profit
 1:  A 2000  2000    200
 2:  A 2001  2050    245
 3:  A 2002  2100    290
 4:  A 2003  2150    335
 5:  B 2000  2200    380
 6:  B 2001  2250    425
 7:  B 2002  2300    470
 8:  B 2003  2350    515
 9:  C 2000  2400    560
10:  C 2002  2500    650
11:  C 2003  2550    695
12:  D 2000  2600    740
13:  D 2001  2650    785

我尝试了类似下面的内容

subset(mydata, year==c(2000:2003)

结果如下所示。

   id year sales profit
1:  A 2000  2000    200
2:  A 2001  2050    245
3:  A 2002  2100    290
4:  A 2003  2150    335
5:  B 2000  2200    380
6:  B 2001  2250    425
7:  B 2002  2300    470
8:  B 2003  2350    515
9:  C 2000  2400    560
Warning message:
In year == c(2000:2003) :
  longer object length is not a multiple of shorter object length

我需要的是包含具有完整周期的实体的数据，从2000年开始到2003年结束。在这种情况下，它将是这样的。

   id year sales profit
1:  A 2000  2000    200
2:  A 2001  2050    245
3:  A 2002  2100    290
4:  A 2003  2150    335
5:  B 2000  2200    380
6:  B 2001  2250    425
7:  B 2002  2300    470
8:  B 2003  2350    515

感谢您的时间和提前回答，但如果答案有点简单，我会非常感激，因为我非常无法理解并且刚刚开始了解R。

Answer 1

您可以尝试以下内容：

library(data.table)
mydata[, ind := all(2000:2003 %in% year), id][(ind)]
#    id year sales profit  ind
# 1:  A 2000  2000    200 TRUE
# 2:  A 2001  2050    245 TRUE
# 3:  A 2002  2100    290 TRUE
# 4:  A 2003  2150    335 TRUE
# 5:  B 2000  2200    380 TRUE
# 6:  B 2001  2250    425 TRUE
# 7:  B 2002  2300    470 TRUE
# 8:  B 2003  2350    515 TRUE

使用＆＃34; tidyverse＆＃34;：

library(tidyverse)
mydata %>% 
  group_by(id) %>% 
  filter(all(2000:2003 %in% year))

示例数据（以后应该如何分享）：

mydata <- structure(list(id = c("A", "A", "A", "A", "B", "B", "B", "B", 
    "C", "C", "C", "D", "D"), year = c(2000L, 2001L, 2002L, 2003L, 
    2000L, 2001L, 2002L, 2003L, 2000L, 2002L, 2003L, 2000L, 2001L
    ), sales = c(2000L, 2050L, 2100L, 2150L, 2200L, 2250L, 2300L, 
    2350L, 2400L, 2500L, 2550L, 2600L, 2650L), profit = c(200L, 245L, 
    290L, 335L, 380L, 425L, 470L, 515L, 560L, 650L, 695L, 740L, 785L
    )), .Names = c("id", "year", "sales", "profit"), row.names = c(NA, 
    13L), class = c("data.table", "data.frame"))

Answer 2

考虑基数R的ave来计算id组，并且只保留等于4的年份长度的记录：

数据

txt = ' id year sales profit A 2000 2000 200 A 2001 2050 245 A 2002 2100 290 A 2003 2150 335 B 2000 2200 380 B 2001 2250 425 B 2002 2300 470 B 2003 2350 515 C 2000 2400 560 C 2002 2500 650 C 2003 2550 695 D 2000 2600 740 D 2001 2650 785' df <- read.table(text=txt, header=TRUE)

<强>码

df$grp_cnt <- ave(df$year, df$id, FUN=length) df <- transform(subset(df, df$grp_cnt == 4), grp_cnt = NULL) df # id year sales profit # 1 A 2000 2000 200 # 2 A 2001 2050 245 # 3 A 2002 2100 290 # 4 A 2003 2150 335 # 5 B 2000 2200 380 # 6 B 2001 2250 425 # 7 B 2002 2300 470 # 8 B 2003 2350 515

Answer 3

为了完整起见，这里也是使用 join 的mydata[mydata[, uniqueN(year), by = id][V1 == 4L, .(id)], on = "id"]解决方案：

   id year sales profit
1:  A 2000  2000    200
2:  A 2001  2050    245
3:  A 2002  2100    290
4:  A 2003  2150    335
5:  B 2000  2200    380
6:  B 2001  2250    425
7:  B 2002  2300    470
8:  B 2003  2350    515

  hostname --fqdn
  hostname: Name or service not known

子集化具有完整时间维度的面板数据

3 个答案: