Question

获取此样本数据：

data.frame(a_1=c("Apple","Grapes","Melon","Peach"),a_2=c("Nuts","Kiwi","Lime","Honey"),a_3=c("Plum","Apple",NA,NA),a_4=c("Cucumber",NA,NA,NA)) 

   a_1    a_2   a_3     a_4
1  Apple  Nuts  Plum    Cucumber
2 Grapes  Kiwi  Apple    <NA>
3  Melon  Lime  <NA>     <NA>
4  Peach  Honey  <NA>    <NA>

基本上我想在每行的最后一列上运行grep，而不是NA。因此我在grep中的x（＆＃34;模式＆＃34;，x）应该是：

Cucumber
Apple
Lime
Honey

我有一个整数告诉我哪个a_N是最后一个：

numcol <- rowSums(!is.na(df[,grep("(^a_)\\d", colnames(df))]))

到目前为止，我已尝试过与ave（），apply（）和dplyr结合使用此类内容：

grepl("pattern",df[,sprintf("a_%i",numcol)])

但是我不能让它发挥作用。请记住，我的数据集非常大，因此我希望使用矢量化解决方案或mb dplyr。非常感谢帮助。

/ e：谢谢，这是一个非常好的解决方案。我的想法太复杂了。（正则表达式是由于我更具体的数据）

Answer 1

这里不需要正则表达式。只需使用apply + tail + na.omit：

> apply(mydf, 1, function(x) tail(na.omit(x), 1))
[1] "Cucumber" "Apple"    "Lime"     "Honey"

~~我不知道这在速度方面有何比较，但你~~你也可以使用＆＃34; data.table＆＃34;和＆＃34; reshape2＆＃34;，像这样：

library(data.table)
library(reshape2)
na.omit(melt(as.data.table(mydf, keep.rownames = TRUE), 
             id.vars = "rn"))[, value[.N], by = rn]
#    rn       V1
# 1:  1 Cucumber
# 2:  2    Apple
# 3:  3     Lime
# 4:  4    Honey

或者，甚至更好：

melt(as.data.table(df, keep.rownames = TRUE), 
     id.vars = "rn", na.rm = TRUE)[, value[.N], by = rn]
#    rn       V1
# 1:  1 Cucumber
# 2:  2    Apple
# 3:  3     Lime
# 4:  4    Honey

这将更多更快。在800k行数据集上，apply花了大约50秒，而data.table方法花了大约2.5秒。

Answer 2

另一种可能非常快的替代方案：

DF[cbind(seq_len(nrow(DF)), max.col(!is.na(DF), "last"))]
#[1] "Cucumber" "Apple"    "Lime"     "Honey"

在哪里＆＃34; DF＆＃34;：

DF = structure(list(a_1 = structure(1:4, .Label = c("Apple", "Grapes", 
"Melon", "Peach"), class = "factor"), a_2 = structure(c(4L, 2L, 
3L, 1L), .Label = c("Honey", "Kiwi", "Lime", "Nuts"), class = "factor"), 
    a_3 = structure(c(2L, 1L, NA, NA), .Label = c("Apple", "Plum"
    ), class = "factor"), a_4 = structure(c(1L, NA, NA, NA), .Label = "Cucumber", class = "factor")), .Names = c("a_1", 
"a_2", "a_3", "a_4"), row.names = c(NA, -4L), class = "data.frame")

Answer 3

你也可以尝试：（df1是数据集）

 indx <- which(!is.na(df1), arr.ind=TRUE)
 df1[cbind(1:nrow(df1),tapply(indx[,2], indx[,1], FUN=max))]
 #[1] "Cucumber" "Apple"    "Lime"     "Honey"

获取每行的最后一个非空列的值

3 个答案: