我对R很新,我的问题如下:
我有一组面板数据按时间序列组织,如下所示(仅显示部分):
Week_Starting Team A Team B Team C Team D
2010-01-02 1 2 3 4
2010-01-09 2 40 1 5
2010-01-16 15 <NA> 4 11
2010-01-23 25 <NA> 7 18
2010-01-30 38 <NA> 9 29
2010-02-06 <NA> <NA> 12 34
2010-02-13 <NA> <NA> 16 40
2010-02-20 <NA> <NA> 20 <NA>
2010-02-27 <NA> <NA> 15 28
2010-03-06 <NA> <NA> 20 <NA>
2010-03-13 <NA> <NA> 24 <NA>
2010-03-20 <NA> <NA> 24 <NA>
2010-03-27 <NA> <NA> 21 <NA>
2010-04-03 <NA> <NA> 27 <NA>
2010-04-10 <NA> <NA> 24 <NA>
2010-04-17 <NA> <NA> 25 <NA>
2010-04-24 <NA> <NA> 35 <NA>
2010-05-01 <NA> <NA> 40 <NA>
2010-05-08 <NA> <NA> 32 <NA>
2010-05-15 <NA> <NA> <NA> <NA>
2010-05-22 <NA> <NA> 39 <NA>
例如,使用B队是没有意义的,因为有太多的缺失观察。排名系统不提供低于40的排名数据。因此,我希望通过删除没有最少8周连续观察的列(变量)来清理(例如,在此示例中为A,B和D组)。所以D不符合要求,因为2010-02-20开始的一周存在差距。请记住,我有超过1000列。
我试过这个&#34; Subsetting a unbalanced panel dataset to have at least 2 consecutive observations in R&#34;之前,但它没有给我我想要的东西,不幸的是我不够熟练修改代码以满足我的需要。
我可以想到的一些可能的解决方案:
1)对具有8个或更多连续观察的每个变量的部分进行子集
2)如果连续运行8个obs包含NA,则设置观察值= NA,然后删除仅具有NA的列,因为不符合8个最小周要求的列将只有NA值(我希望你得到我的意思是)
感谢先进的任何帮助,评论和其他建议! :)
编辑:
只是出于兴趣,如果数据以长格式组织,那么做同样的事情会更难吗?
#Using MrFlick's data frame
melt(dd,id="Week_Starting")
Week_Starting variable value
1 2010-01-02 Team_A 1
2 2010-01-09 Team_A 2
3 2010-01-16 Team_A 15
4 2010-01-23 Team_A 25
5 2010-01-30 Team_A 38
6 2010-02-06 Team_A NA
7 2010-02-13 Team_A NA
8 2010-02-20 Team_A NA
9 2010-02-27 Team_A NA
10 2010-03-06 Team_A NA
11 2010-03-13 Team_A NA
12 2010-03-20 Team_A NA
13 2010-03-27 Team_A NA
14 2010-04-03 Team_A NA
15 2010-04-10 Team_A NA
16 2010-04-17 Team_A NA
17 2010-04-24 Team_A NA
18 2010-05-01 Team_A NA
19 2010-05-08 Team_A NA
20 2010-05-15 Team_A NA
21 2010-05-22 Team_A NA
22 2010-01-02 Team_B 2
23 2010-01-09 Team_B 40
24 2010-01-16 Team_B NA
25 2010-01-23 Team_B NA
26 2010-01-30 Team_B NA
27 2010-02-06 Team_B NA
28 2010-02-13 Team_B NA
29 2010-02-20 Team_B NA
30 2010-02-27 Team_B NA
31 2010-03-06 Team_B NA
32 2010-03-13 Team_B NA
33 2010-03-20 Team_B NA
34 2010-03-27 Team_B NA
35 2010-04-03 Team_B NA
36 2010-04-10 Team_B NA
37 2010-04-17 Team_B NA
38 2010-04-24 Team_B NA
39 2010-05-01 Team_B NA
40 2010-05-08 Team_B NA
41 2010-05-15 Team_B NA
42 2010-05-22 Team_B NA
43 2010-01-02 Team_C 3
44 2010-01-09 Team_C 1
45 2010-01-16 Team_C 4
46 2010-01-23 Team_C 7
47 2010-01-30 Team_C 9
48 2010-02-06 Team_C 12
49 2010-02-13 Team_C 16
50 2010-02-20 Team_C 20
51 2010-02-27 Team_C 15
52 2010-03-06 Team_C 20
53 2010-03-13 Team_C 24
54 2010-03-20 Team_C 24
55 2010-03-27 Team_C 21
56 2010-04-03 Team_C 27
57 2010-04-10 Team_C 24
58 2010-04-17 Team_C 25
59 2010-04-24 Team_C 35
60 2010-05-01 Team_C 40
61 2010-05-08 Team_C 32
62 2010-05-15 Team_C NA
63 2010-05-22 Team_C 39
64 2010-01-02 Team_D 4
65 2010-01-09 Team_D 5
66 2010-01-16 Team_D 11
67 2010-01-23 Team_D 18
68 2010-01-30 Team_D 29
69 2010-02-06 Team_D 34
70 2010-02-13 Team_D 40
71 2010-02-20 Team_D NA
72 2010-02-27 Team_D 28
73 2010-03-06 Team_D NA
74 2010-03-13 Team_D NA
75 2010-03-20 Team_D NA
76 2010-03-27 Team_D NA
77 2010-04-03 Team_D NA
78 2010-04-10 Team_D NA
79 2010-04-17 Team_D NA
80 2010-04-24 Team_D NA
81 2010-05-01 Team_D NA
82 2010-05-08 Team_D NA
83 2010-05-15 Team_D NA
84 2010-05-22 Team_D NA
有什么建议吗? :)
答案 0 :(得分:4)
您可以使用rle
来计算非NA值的运行长度。首先,您可以使用数据复制/粘贴数据。
dd<-structure(list(Week_Starting = structure(1:21, .Label = c("2010-01-02",
"2010-01-09", "2010-01-16", "2010-01-23", "2010-01-30", "2010-02-06",
"2010-02-13", "2010-02-20", "2010-02-27", "2010-03-06", "2010-03-13",
"2010-03-20", "2010-03-27", "2010-04-03", "2010-04-10", "2010-04-17",
"2010-04-24", "2010-05-01", "2010-05-08", "2010-05-15", "2010-05-22"
), class = "factor"), Team_A = c(1L, 2L, 15L, 25L, 38L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Team_B = c(2L,
40L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), Team_C = c(3L, 1L, 4L, 7L, 9L, 12L, 16L,
20L, 15L, 20L, 24L, 24L, 21L, 27L, 24L, 25L, 35L, 40L, 32L, NA,
39L), Team_D = c(4L, 5L, 11L, 18L, 29L, 34L, 40L, NA, 28L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Week_Starting",
"Team_A", "Team_B", "Team_C", "Team_D"), class = "data.frame", row.names = c(NA,
-21L))
现在我们定义一个函数,可以计算向量中最长的非NA值
consecnonNA <- function(x) {
rr<-rle(is.na(x))
max(rr$lengths[rr$values==FALSE])
}
我们可以为每个列计算此值,并返回至少连续8周的列的名称
atleast <- function(i) {function(x) x>=i}
hasatleast8 <- names(Filter(atleast(8), sapply(dd[,-1], consecnonNA)))
然后我们可以用
进行子集化dd[, c("Week_Starting", hasatleast8), drop=F]