我有州和县组织的收益率数据。根据这些数据,我只想保留那些在1970年到2000年之间提供完整年份的县。
以下代码清除了一些不完整的案例,但未能省略所有案例 - 特别是对于较大的数据集。假数据
一些假数据:
K <- 5 # number of rows set to NaN
df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
county = rep(1:4, 5), yield = 100)
df[sample(1:20, K), 3] <- NaN
当前代码:
df1 <- read.csv("gly2.csv",header=TRUE)
df <- data.frame(df1)
droprows_1 <- function(df, v1, v2, v3, value = 'x'){
idx <- df[, v3] == value
todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
todrop <- unique(todrop); todrop # but unique values could be less
nrow <- dim(todrop)[1]
for(i in 1:nrow){
idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
df <- df[!idx, ]
}
return(df)
}
qq <- droprows_1(df, 1, 2, 3)
谢谢
答案 0 :(得分:2)
要删除具有单个缺失值的县,请使用:
library(dplyr)
df %>% group_by(county) %>% filter( !any(is.nan(yield)))
答案 1 :(得分:2)
data.table
这很容易。我并不完全按照您的示例进行操作,但此示例数据符合我认为您正在寻找的内容:
dt<-data.table(state=letters[sample(26,size=20000,replace=T)],
county=sample(20,size=20000,replace=T),
year=rep(1981:2000,length.out=20000),
var=rnorm(20000),
key=c("state","county","year"))
# Duplicated a bunch of state/year combinations
dt<-unique(dt)
现在,回答你的问题。如果您是data.table
新手,我会一步一步走。最后一行是你真正需要的。
# This will count the number of years for each state/county combination:
dt[,.N,by=.(state,county)]
# To focus on only those combinations which appear for every year
# (in my example there are 20 years)
# (also simultaneously drop the N column since we know every N is 20)
dt[,.N,by=.(state,county)][N==20,!"N",with=F]
# The grande finale: reduce your data set to
# ONLY those combinations with full support:
full_data<-dt[.(dt[,.N,by=.(state,county)][N==20,!"N",with=F])]
请注意,最后一步需要,我们按此顺序将dt
的键设置为state
和county
,这可以通过{ {1}}。如果您不熟悉setkey(dt,state,county)
符号,我建议使用How can I get Id of inserted entity in Entity framework?页面,特别是this插图。
编辑:刚看到您可能正在为data.table
存储NA
值,在这种情况下,您应该调整代码以免计算year
s:
NA