Question

我有以下类型的数据集：

id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA

也就是说，每一行都是唯一的客户，每一列都描述了过去10年来该客户的特征（每个月都有自己的专栏）。问题是我想将这个120列数据帧压缩成10列数据帧，这是因为我知道几乎所有行都有（虽然月份本身可以变化）每年有1或0个观察值。

我已经做了一年，当时使用带有嵌套if子句的循环：

for(i in 1:nrow(input_data)) {
    temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
    loc2011 <- which(!is.na(temp_row))
    if(length(loc2011 ) > 0) {
        temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
    } else {
        temp_row_2011[i,] <- NA
    }
}

由于我的数据集很大，我需要执行上述循环10次（每年一次），这需要花费太多时间。我知道在R中使用apply命令要好得多，所以我非常感谢这项任务的帮助。我怎么能更好地写出整件事（包括不同年份）？

Answer 1

你是否经历过这样的事情？：

    temp_row_2011 <- apply(input_data, 1, function(x){
        temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
        temp_row[!is.na(temp_row)][1]
    })

如果这给你正确的输出，如果它的运行速度比你的循环快，那么它不一定只是因为使用了{{1} }，但也因为它分配的东西较少，并避免使用apply() if {}。您可以通过编译匿名函数使其更快：

else {}

您没有说明reduceyear <- function(x){ temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")] temp_row[!is.na(temp_row)][1] } # compile, just in case it runs faster: reduceyear_c <- compiler:::cmpfun(reduceyear) # this ought to do the same as the above. temp_row_2011 <- apply(input_data, 1, reduceyear_c)是input_data还是data.frame，但矩阵会比前者快（但只有matrix才有效相同类别的数据）。

[编辑：完整的例子，由DWin推动]

input_data

所有input_data <- matrix(ncol=24,nrow=10) # years and months: colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_")) # some ids rownames(input_data) <- 1:10 # put in some values: input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200)) # make an all-NA case: input_data[2,1:12] <- NA # and here's the full deal: sapply(2010:2011, function(x,input_data){ input_data_yr <- input_data[, grep(x, colnames(input_data) )] apply(input_data_yr, 1, function(id){ id[!is.na(id)][1] } ) }, input_data)案例都有效。从DWin中提取NA列选择理念。如上例所示，您实际上可以定义匿名内部函数并对其进行编译，以使事情运行得更快。

Answer 2

我构建了一个小测试用例（timriffe的建议失败了）。您可能会通过编写代码来吸引更多的兴趣，这些代码可以创建更完整的测试用例，例如2年的4个季度，并且包括病理案例，例如一年中一行中的所有NA。我认为不是要求你按名称写出所有年份列，而是应该用grep（）策略循环它们：

  # funyear <- function to work on one year's data and return a single vector
  # my efforts keep failing on the all(NA) row by year combos
  sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )

R：使用apply重写循环

2 个答案: