Question

我有一个数据集，包含每月对美国公司回报的观察。我试图从我的样本中排除所有具有少于一定数量的非NA观察值的公司。

我设法使用foreach做我想做的事，但我的数据集非常大，这需要很长时间。这是一个工作示例，展示了我如何完成我想要的工作，并希望明确我的目标

#load required packages
library(data.table)
library(foreach)

#example data
myseries <- data.table(
 X = sample(letters[1:6],30,replace=TRUE),
 Y = sample(c(NA,1,2,3),30,replace=TRUE))

setkey(myseries,"X") #so X is the company identifier

#here I create another data table with each company identifier and its number 
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]

# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]

#finally I exclude all companies which are in the list "comps", 
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one, 
#and this is what makes it slow.

for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}

如何更有效地完成这项工作？是否有data.table方法获得相同的结果？

Answer 1

如果您想要考虑NA值超过1列，那么您可以使用complete.cases(.SD)，但是因为您想要测试单个列我会建议像

naCases <- myseries[,list(totalNA  = sum(!is.na(Y))),by=X]

然后，您可以在给定阈值总NA值

的情况下加入

例如

threshold <- 3
myseries[naCases[totalNA > threshold]]

您也可以选择不使用加入来获取已排除的案例

 myseries[!naCases[totalNA > threshold]]

如评论中所述，类似

myseries[,totalNA  := sum(!is.na(Y)),by=X][totalNA > 3]

可以工作，但是，在这种情况下，您正在对整个data.table执行矢量扫描，而之前的解决方案对仅为nrow(unique(myseries[['X']]))的data.table执行矢量扫描。

鉴于这是一个单一的矢量扫描，无论如何它都是有效的（也许二进制连接+小矢量扫描可能比较大的矢量扫描慢），但我怀疑这两种方式会有很大差异。

Answer 2

如何在Y上聚合Y中的NA数量，然后进行子集化？

# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))

# Subset
myseries[!X %in% num_nas$X[Y>=3],]

根据每个密钥值的非NA观察数量排除数据

2 个答案: