Question

我有一个稀疏的数据框example。它有五个数据列，但每行只有两个条目，在列中随机分布：

id  a   b   c   d   e
1   NA  10  NA  NA  1
2   6   NA  10  NA  NA
3   3   NA  NA  2   NA
4   NA  NA  9   4   NA
5   NA  NA  1   NA  5

我想返回一个只有两个数据列的数据帧，即每行中的值：

id  val1    val2
1   10      1
2   6       10
3   3       2
4   9       4
5   1       5

这可以通过for循环实现。但我的真实数据非常大，所以我想制作一个类似apply的函数。我见过的所有内容都假设您知道您将使用哪些列。我尝试制作自己的单行函数然后使用apply，但我不断收到错误“维数不正确”。

Answer 1

尝试

d1 <- setNames(data.frame(example$id,t(apply(example[-1], 1,
                        function(x) x[!is.na(x)]))),
                                 c('id', 'val1', 'val2'))
d1
#  id val1 val2
#1  1   10    1
#2  2    6   10
#3  3    3    2
#4  4    9    4
#5  5    1    5

或者你可以转换为＆＃39; long＆＃39;格式，然后重新转换回广泛＆＃39;

library(data.table)
dcast(melt(setDT(example), id.var='id', na.rm=TRUE)[,
           ind:=paste0('val', 1:.N) , id], id~ind, value.var='value')
#    id val1 val2
#1:  1   10    1
#2:  2    6   10
#3:  3    3    2
#4:  4    9    4
#5:  5    1    5

数据

example <- structure(list(id = 1:5, a = c(NA, 6L, 3L, NA, NA),
b = c(10L, 
NA, NA, NA, NA), c = c(NA, 10L, NA, 9L, 1L), d = c(NA, NA, 2L, 
4L, NA), e = c(1L, NA, NA, NA, 5L)), .Names = c("id", "a", "b", 
"c", "d", "e"), class = "data.frame", row.names = c(NA, -5L))

Answer 2

这应该是一种非常快速的方法：

temp <- t(example[-1])  # Matrix of all columns other than the first, transposed
cbind(example[1],       # Bind the first column with a two-column matrix
                        # created by using is.na and which
      matrix(temp[which(!is.na(temp), arr.ind = TRUE)], 
             ncol = 2, byrow = TRUE))
#   id  1  2
# 1  1 10  1
# 2  2  6 10
# 3  3  3  2
# 4  4  9  4
# 5  5  1  5

在使用500万行数据集的快速测试中，它的执行速度比“data.table”和apply方法都快。

R：列变化时，数据帧行的类似应用的函数

2 个答案:

数据