Question

在R中，我有一个这种形式的数据框df：

a  b  year month id
1  2  2012 01    1234758
1  1  2012 02    1234758

NA 5  2011 04    1234759
5  5  2011 05    1234759
5  5  2011 06    1234759

2  2  2001 11    1234760
NA NA 2001 11    1234760

a和b中的一些是NA。我希望通过id对数据帧进行子集化，让每个子集按年和月排序，然后如果a或b的时间顺序中的第一个观察是na，则丢弃整个子集/ id。

对于上面的示例，结果为：

a  b  year month id
1  2  2012 01    1234758
1  1  2012 02    1234758

2  2  2001 11    1234760
NA NA 2001 11    1234760

我用非矢量化的方式做了，这需要永远运行，如下所示：

df_summary <- as.data.frame(table(df$id),stringsAsFactors=FALSE)
df <- df[order(df$id,df$year,df$month),]
remove <- ""

j <- 1
l <- 0
for(i in 1:nrow(df_summary)){

    m <- df_summary$Var1[i]
    if( is.na(df$a[j]) | is.na(df$b[j]) ) {
        l <- l + 1
        remove[l] <- df_summary$id[i]
    }
    j <- j + m 
}

df <- df[!(df$id %in% remove),]

实现相同结果的更快，矢量化的方法是什么？

我尝试了什么，也是为了仔细检查我的代码：

dt <- setDT(df)
remove_vectorized <- dt[,list(remove_first_na=(is.na(a[1]) | is.na(b[1]))),by=id]

这表明我要删除所有观察，这显然是错误的。

Answer 1

以下是data.table种可能的方法

首先解决你的尝试

library(data.table)
setDT(df)[, if(!is.na(a[1L]) & !is.na(b[1L])) .SD, by = id]
#         id  a  b year month
# 1: 1234758  1  2 2012     1
# 2: 1234758  1  1 2012     2
# 3: 1234760  2  2 2001    11
# 4: 1234760 NA NA 2001    11

或者我们可以概括一下（可能是以速度为代价）

setDT(df)[, if(Reduce(`&`, !is.na(.SD[1L, .(a, b)]))) .SD, by = id]
## OR maybe `setDT(df)[, if(Reduce(`&`, !sapply(.SD[1L, .(a, b)], is.na))) .SD , by = id]` 
## in order to avoid to matrix conversions)
#         id  a  b year month
# 1: 1234758  1  2 2012     1
# 2: 1234758  1  1 2012     2
# 3: 1234760  2  2 2001    11
# 4: 1234760 NA NA 2001    11

另一种方法是合并unique和na.omit方法

indx <- na.omit(unique(setDT(df), by = "id"), by = c("a", "b"))

然后，一个简单的子集将

df[id %in% indx$id]
#         id  a  b year month
# 1: 1234758  1  2 2012     1
# 2: 1234758  1  1 2012     2
# 3: 1234760  2  2 2001    11
# 4: 1234760 NA NA 2001    11

或者也许是二元加盟？

df[indx[, .(id)], on = "id"]
#         id  a  b year month
# 1: 1234758  1  2 2012     1
# 2: 1234758  1  1 2012     2
# 3: 1234760  2  2 2001    11
# 4: 1234760 NA NA 2001    11

或

indx <- na.omit(unique(setDT(df, key = "id")), by = c("a", "b"))
df[.(indx$id)]
#         id  a  b year month
# 1: 1234758  1  2 2012     1
# 2: 1234758  1  1 2012     2
# 3: 1234760  2  2 2001    11
# 4: 1234760 NA NA 2001    11

（最后两个主要用于说明）

有关data.table的更多信息，请访问GH上的Getting Started

如果第一次观察组是na，则在数据帧中删除obs的子组

1 个答案: