虚假数据

Question

我有州和县组织的收益率数据。根据这些数据，我只想保留那些在1970年到2000年之间提供完整年份的县。

以下代码清除了一些不完整的案例，但未能省略所有案例 - 特别是对于较大的数据集。假数据

一些假数据：

虚假数据

K <- 5 # number of rows set to NaN

df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
                 county = rep(1:4, 5), yield = 100)

df[sample(1:20, K), 3] <- NaN

当前代码：

df1 <- read.csv("gly2.csv",header=TRUE)

df <- data.frame(df1)


droprows_1 <- function(df, v1, v2, v3, value = 'x'){
  idx <- df[, v3] == value
  todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
  todrop <- unique(todrop); todrop # but unique values could be less

  nrow <- dim(todrop)[1]
  for(i in 1:nrow){
    idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
    df <- df[!idx, ]
  }
  return(df)
}

qq <- droprows_1(df, 1, 2, 3)

谢谢

Answer 1

要删除具有单个缺失值的县，请使用：

library(dplyr)
df %>% group_by(county) %>% filter( !any(is.nan(yield)))

Answer 2

data.table这很容易。我并不完全按照您的示例进行操作，但此示例数据符合我认为您正在寻找的内容：

dt<-data.table(state=letters[sample(26,size=20000,replace=T)],
               county=sample(20,size=20000,replace=T),
               year=rep(1981:2000,length.out=20000),
               var=rnorm(20000),
               key=c("state","county","year"))

# Duplicated a bunch of state/year combinations
dt<-unique(dt)

现在，回答你的问题。如果您是data.table新手，我会一步一步走。最后一行是你真正需要的。

# This will count the number of years for each state/county combination:
dt[,.N,by=.(state,county)]

# To focus on only those combinations which appear for every year
# (in my example there are 20 years)
# (also simultaneously drop the N column since we know every N is 20)
dt[,.N,by=.(state,county)][N==20,!"N",with=F]

# The grande finale: reduce your data set to
# ONLY those combinations with full support:
full_data<-dt[.(dt[,.N,by=.(state,county)][N==20,!"N",with=F])]

请注意，最后一步需要，我们按此顺序将dt的键设置为state和county，这可以通过{ {1}}。如果您不熟悉setkey(dt,state,county)符号，我建议使用How can I get Id of inserted entity in Entity framework?页面，特别是this插图。

编辑：刚看到您可能正在为data.table存储NA值，在这种情况下，您应该调整代码以免计算year s：

NA

仅过滤完整的年份

虚假数据

2 个答案: