基于定义数量的连续观察,在R条件中删除面板数据中的变量

时间:2014-07-06 22:01:53

标签: r data-cleansing panel-data

我对R很新,我的问题如下:

我有一组面板数据按时间序列组织,如下所示(仅显示部分):

Week_Starting    Team A            Team B      Team C   Team D              
2010-01-02         1                   2           3        4
2010-01-09         2                  40           1        5
2010-01-16        15                <NA>           4       11
2010-01-23        25                <NA>           7       18
2010-01-30        38                <NA>           9       29
2010-02-06      <NA>                <NA>          12       34
2010-02-13      <NA>                <NA>          16       40
2010-02-20      <NA>                <NA>          20     <NA>
2010-02-27      <NA>                <NA>          15       28
2010-03-06      <NA>                <NA>          20     <NA>
2010-03-13      <NA>                <NA>          24     <NA>
2010-03-20      <NA>                <NA>          24     <NA>
2010-03-27      <NA>                <NA>          21     <NA>
2010-04-03      <NA>                <NA>          27     <NA>
2010-04-10      <NA>                <NA>          24     <NA>
2010-04-17      <NA>                <NA>          25     <NA>
2010-04-24      <NA>                <NA>          35     <NA>
2010-05-01      <NA>                <NA>          40     <NA>
2010-05-08      <NA>                <NA>          32     <NA>
2010-05-15      <NA>                <NA>        <NA>     <NA>
2010-05-22      <NA>                <NA>          39     <NA>

例如,使用B队是没有意义的,因为有太多的缺失观察。排名系统不提供低于40的排名数据。因此,我希望通过删除没有最少8周连续观察的列(变量)来清理(例如,在此示例中为A,B和D组)。所以D不符合要求,因为2010-02-20开始的一周存在差距。请记住,我有超过1000列。

我试过这个&#34; Subsetting a unbalanced panel dataset to have at least 2 consecutive observations in R&#34;之前,但它没有给我我想要的东西,不幸的是我不够熟练修改代码以满足我的需要。

我可以想到的一些可能的解决方案:

1)对具有8个或更多连续观察的每个变量的部分进行子集

2)如果连续运行8个obs包含NA,则设置观察值= NA,然后删除仅具有NA的列,因为不符合8个最小周要求的列将只有NA值(我希望你得到我的意思是)

感谢先进的任何帮助,评论和其他建议! :)

编辑:

只是出于兴趣,如果数据以长格式组织,那么做同样的事情会更难吗?

#Using MrFlick's data frame

melt(dd,id="Week_Starting")

       Week_Starting variable value
    1     2010-01-02   Team_A     1
    2     2010-01-09   Team_A     2
    3     2010-01-16   Team_A    15
    4     2010-01-23   Team_A    25
    5     2010-01-30   Team_A    38
    6     2010-02-06   Team_A    NA
    7     2010-02-13   Team_A    NA
    8     2010-02-20   Team_A    NA
    9     2010-02-27   Team_A    NA
    10    2010-03-06   Team_A    NA
    11    2010-03-13   Team_A    NA
    12    2010-03-20   Team_A    NA
    13    2010-03-27   Team_A    NA
    14    2010-04-03   Team_A    NA
    15    2010-04-10   Team_A    NA
    16    2010-04-17   Team_A    NA
    17    2010-04-24   Team_A    NA
    18    2010-05-01   Team_A    NA
    19    2010-05-08   Team_A    NA
    20    2010-05-15   Team_A    NA
    21    2010-05-22   Team_A    NA
    22    2010-01-02   Team_B     2
    23    2010-01-09   Team_B    40
    24    2010-01-16   Team_B    NA
    25    2010-01-23   Team_B    NA
    26    2010-01-30   Team_B    NA
    27    2010-02-06   Team_B    NA
    28    2010-02-13   Team_B    NA
    29    2010-02-20   Team_B    NA
    30    2010-02-27   Team_B    NA
    31    2010-03-06   Team_B    NA
    32    2010-03-13   Team_B    NA
    33    2010-03-20   Team_B    NA
    34    2010-03-27   Team_B    NA
    35    2010-04-03   Team_B    NA
    36    2010-04-10   Team_B    NA
    37    2010-04-17   Team_B    NA
    38    2010-04-24   Team_B    NA
    39    2010-05-01   Team_B    NA
    40    2010-05-08   Team_B    NA
    41    2010-05-15   Team_B    NA
    42    2010-05-22   Team_B    NA
    43    2010-01-02   Team_C     3
    44    2010-01-09   Team_C     1
    45    2010-01-16   Team_C     4
    46    2010-01-23   Team_C     7
    47    2010-01-30   Team_C     9
    48    2010-02-06   Team_C    12
    49    2010-02-13   Team_C    16
    50    2010-02-20   Team_C    20
    51    2010-02-27   Team_C    15
    52    2010-03-06   Team_C    20
    53    2010-03-13   Team_C    24
    54    2010-03-20   Team_C    24
    55    2010-03-27   Team_C    21
    56    2010-04-03   Team_C    27
    57    2010-04-10   Team_C    24
    58    2010-04-17   Team_C    25
    59    2010-04-24   Team_C    35
    60    2010-05-01   Team_C    40
    61    2010-05-08   Team_C    32
    62    2010-05-15   Team_C    NA
    63    2010-05-22   Team_C    39
    64    2010-01-02   Team_D     4
    65    2010-01-09   Team_D     5
    66    2010-01-16   Team_D    11
    67    2010-01-23   Team_D    18
    68    2010-01-30   Team_D    29
    69    2010-02-06   Team_D    34
    70    2010-02-13   Team_D    40
    71    2010-02-20   Team_D    NA
    72    2010-02-27   Team_D    28
    73    2010-03-06   Team_D    NA
    74    2010-03-13   Team_D    NA
    75    2010-03-20   Team_D    NA
    76    2010-03-27   Team_D    NA
    77    2010-04-03   Team_D    NA
    78    2010-04-10   Team_D    NA
    79    2010-04-17   Team_D    NA
    80    2010-04-24   Team_D    NA
    81    2010-05-01   Team_D    NA
    82    2010-05-08   Team_D    NA
    83    2010-05-15   Team_D    NA
    84    2010-05-22   Team_D    NA

有什么建议吗? :)

1 个答案:

答案 0 :(得分:4)

您可以使用rle来计算非NA值的运行长度。首先,您可以使用数据复制/粘贴数据。

dd<-structure(list(Week_Starting = structure(1:21, .Label = c("2010-01-02", 
"2010-01-09", "2010-01-16", "2010-01-23", "2010-01-30", "2010-02-06", 
"2010-02-13", "2010-02-20", "2010-02-27", "2010-03-06", "2010-03-13", 
"2010-03-20", "2010-03-27", "2010-04-03", "2010-04-10", "2010-04-17", 
"2010-04-24", "2010-05-01", "2010-05-08", "2010-05-15", "2010-05-22"
), class = "factor"), Team_A = c(1L, 2L, 15L, 25L, 38L, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Team_B = c(2L, 
40L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), Team_C = c(3L, 1L, 4L, 7L, 9L, 12L, 16L, 
20L, 15L, 20L, 24L, 24L, 21L, 27L, 24L, 25L, 35L, 40L, 32L, NA, 
39L), Team_D = c(4L, 5L, 11L, 18L, 29L, 34L, 40L, NA, 28L, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Week_Starting", 
"Team_A", "Team_B", "Team_C", "Team_D"), class = "data.frame", row.names = c(NA, 
-21L))

现在我们定义一个函数,可以计算向量中最长的非NA值

consecnonNA <- function(x) {
    rr<-rle(is.na(x))
    max(rr$lengths[rr$values==FALSE])
}

我们可以为每个列计算此值,并返回至少连续8周的列的名称

atleast <- function(i) {function(x) x>=i}
hasatleast8 <- names(Filter(atleast(8), sapply(dd[,-1], consecnonNA)))

然后我们可以用

进行子集化
dd[, c("Week_Starting", hasatleast8), drop=F]