sapply() returns list instead of a vector

时间:2017-08-06 07:30:34

标签: r vector

I have the following data frame -

   name amarks bmarks cmarks
1   A    25      30     40   
2   B    45      78     50 
3   C    75      72     29 
4   D    18      16     70  
.   .    .       .      .

Where name is the name of the person, amarks, bmarks and cmarks are the marks scored at different exams by the person. Now I am tasked to find out name's of the people who have scored max in amarks, bmarks and cmarks. Also I got to store it as a vector. I have solved it in the following way -

> max_name <- sapply(marks[,2:4], function(x) {subset(marks, x == max(x,
> na.rm = T), name)})

This gives me the correct answers but when I check the data type of max_name, I see that its a list when ideally I expected sapply to return vector.

Following are my observations -

class(max_name)

> list

typeof(max_name)

> list

is.vector(max_name)

> vector

Can somebody please explain what is happening over here. Am I missing something. Do I need to make any changes to my code so that it returns a vector?

1 个答案:

答案 0 :(得分:3)

您的代码存在一些问题:

    数据集的
  1. subset方法将, drop = FALSE设置为默认值,这意味着始终将获得数据帧作为回报(除非您明确指定, drop = TRUE )。因此,您将始终获得list向量,因为这是R中唯一可以将多个数据帧保存在一起的结构(另请注意?subset文档中有关何时的“警告”部分和如果你应该使用它。)
  2. x == max(...可以返回未知数量的行,因为在每列中可能有多个值等于最大值。因此,大多数情况下,你会得到不同的长度向量,而且,只有list可以容纳不同大小的向量。如果您只希望每列有一个结果,则可以使用which.max,例如,它也会自动忽略NA
  3. 最后,目前还不是很清楚你的回报是什么?如果列中有多行等于最大值,您是否需要两个名称?还是只有第一个?无论哪种方式,下面都是几个选项
  4. 让我们添加一些NA和一些重复的行,这些行等于maxs列,这样我们就可以看到结果的差异

    marks <- read.table(text = "name amarks bmarks cmarks
    1   A    NA      30     40   
    2   B    45      78     50 
    3   C    75      NA     70 
    4   D    75      16     70", header = TRUE, stringsAsFactors = FALSE)
    
    marks 
    #   name amarks bmarks cmarks
    # 1    A     NA     30     40
    # 2    B     45     78     50
    # 3    C     75     NA     70
    # 4    D     75     16     70
    

    基本上,如果您想要所有 name,我们只需将unlist添加到您的代码中

    unlist(sapply(marks[, 2:4], function(x) {subset(marks, x == max(x, na.rm = TRUE), name)}))
    # amarks.name1 amarks.name2  bmarks.name cmarks.name1 cmarks.name2 
    #         "C"          "D"          "B"          "C"          "D" 
    

    在不使用subset

    的情况下实现相同目标的替代方法
    marks$name[unlist(sapply(marks[, 2:4], function(x) which(x == max(x, na.rm = TRUE))))]
    ## [1] "C" "D" "B" "C" "D"
    

    甚至(矢量化/过度并发症权衡)

    marks$name[which(sapply(marks[, 2:4], 
                            function(x) x == max(x, na.rm = TRUE)), arr.ind = TRUE)[, "row"]]
    ## [1] "C" "D" "B" "C" "D"
    

    或完全矢量化的解决方案(换取使用外部包装,矩阵转换和通常超过复杂化)

    marks$name[which(marks[, 2:4] == matrixStats::colMaxs(as.matrix(marks[, 2:4]), 
                                                          na.rm = TRUE)[col(marks[, 2:4])], 
                     arr.ind = TRUE)[, "row"]]
    
    ## [1] "C" "D" "B" "C" "D"
    

    但是,如果您只想要每列的第一个最大值,我们可以简化为just(也处理NA s)

    marks$name[sapply(marks[, 2:4], which.max)]
    # [1] "C" "B" "C"