仅过滤完整的年份

时间:2015-05-14 05:19:21

标签: r filtering panel

我有州和县组织的收益率数据。根据这些数据,我只想保留那些在1970年到2000年之间提供完整年份的县。

以下代码清除了一些不完整的案例,但未能省略所有案例 - 特别是对于较大的数据集。假数据

一些假数据:

虚假数据

K <- 5 # number of rows set to NaN

df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
                 county = rep(1:4, 5), yield = 100)

df[sample(1:20, K), 3] <- NaN

当前代码:

df1 <- read.csv("gly2.csv",header=TRUE)

df <- data.frame(df1)


droprows_1 <- function(df, v1, v2, v3, value = 'x'){
  idx <- df[, v3] == value
  todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
  todrop <- unique(todrop); todrop # but unique values could be less

  nrow <- dim(todrop)[1]
  for(i in 1:nrow){
    idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
    df <- df[!idx, ]
  }
  return(df)
}

qq <- droprows_1(df, 1, 2, 3)

谢谢

2 个答案:

答案 0 :(得分:2)

要删除具有单个缺失值的县,请使用:

library(dplyr)
df %>% group_by(county) %>% filter( !any(is.nan(yield)))

答案 1 :(得分:2)

data.table这很容易。我并不完全按照您的示例进行操作,但此示例数据符合我认为您正在寻找的内容:

dt<-data.table(state=letters[sample(26,size=20000,replace=T)],
               county=sample(20,size=20000,replace=T),
               year=rep(1981:2000,length.out=20000),
               var=rnorm(20000),
               key=c("state","county","year"))

# Duplicated a bunch of state/year combinations
dt<-unique(dt)

现在,回答你的问题。如果您是data.table新手,我会一步一步走。最后一行是你真正需要的。

# This will count the number of years for each state/county combination:
dt[,.N,by=.(state,county)]

# To focus on only those combinations which appear for every year
# (in my example there are 20 years)
# (also simultaneously drop the N column since we know every N is 20)
dt[,.N,by=.(state,county)][N==20,!"N",with=F]

# The grande finale: reduce your data set to
# ONLY those combinations with full support:
full_data<-dt[.(dt[,.N,by=.(state,county)][N==20,!"N",with=F])]

请注意,最后一步需要,我们按此顺序将dt的键设置为statecounty,这可以通过{ {1}}。如果您不熟悉setkey(dt,state,county)符号,我建议使用How can I get Id of inserted entity in Entity framework?页面,特别是this插图。

编辑:刚看到您可能正在为data.table存储NA值,在这种情况下,您应该调整代码以免计算year s:

NA