Question

I have the following data frame -

   name amarks bmarks cmarks
1   A    25      30     40   
2   B    45      78     50 
3   C    75      72     29 
4   D    18      16     70  
.   .    .       .      .

Where name is the name of the person, amarks, bmarks and cmarks are the marks scored at different exams by the person. Now I am tasked to find out name's of the people who have scored max in amarks, bmarks and cmarks. Also I got to store it as a vector. I have solved it in the following way -

> max_name <- sapply(marks[,2:4], function(x) {subset(marks, x == max(x,
> na.rm = T), name)})

This gives me the correct answers but when I check the data type of max_name, I see that its a list when ideally I expected sapply to return vector.

Following are my observations -

class(max_name)

> list

typeof(max_name)

> list

is.vector(max_name)

> vector

Can somebody please explain what is happening over here. Am I missing something. Do I need to make any changes to my code so that it returns a vector?

Answer 1

您的代码存在一些问题：

subset方法将, drop = FALSE设置为默认值，这意味着始终将获得数据帧作为回报（除非您明确指定, drop = TRUE ）。因此，您将始终获得list向量，因为这是R中唯一可以将多个数据帧保存在一起的结构（另请注意?subset文档中有关何时的“警告”部分和如果你应该使用它。）
x == max(...可以返回未知数量的行，因为在每列中可能有多个值等于最大值。因此，大多数情况下，你会得到不同的长度向量，而且，只有list可以容纳不同大小的向量。如果您只希望每列有一个结果，则可以使用which.max，例如，它也会自动忽略NA。
最后，目前还不是很清楚你的回报是什么？如果列中有多行等于最大值，您是否需要两个名称？还是只有第一个？无论哪种方式，下面都是几个选项

让我们添加一些NA和一些重复的行，这些行等于maxs列，这样我们就可以看到结果的差异

marks <- read.table(text = "name amarks bmarks cmarks
1   A    NA      30     40   
2   B    45      78     50 
3   C    75      NA     70 
4   D    75      16     70", header = TRUE, stringsAsFactors = FALSE)

marks 
#   name amarks bmarks cmarks
# 1    A     NA     30     40
# 2    B     45     78     50
# 3    C     75     NA     70
# 4    D     75     16     70

基本上，如果您想要所有 name，我们只需将unlist添加到您的代码中

unlist(sapply(marks[, 2:4], function(x) {subset(marks, x == max(x, na.rm = TRUE), name)}))
# amarks.name1 amarks.name2  bmarks.name cmarks.name1 cmarks.name2 
#         "C"          "D"          "B"          "C"          "D"

在不使用subset

marks$name[unlist(sapply(marks[, 2:4], function(x) which(x == max(x, na.rm = TRUE))))]
## [1] "C" "D" "B" "C" "D"

甚至（矢量化/过度并发症权衡）

marks$name[which(sapply(marks[, 2:4], 
                        function(x) x == max(x, na.rm = TRUE)), arr.ind = TRUE)[, "row"]]
## [1] "C" "D" "B" "C" "D"

或完全矢量化的解决方案（换取使用外部包装，矩阵转换和通常超过复杂化）

marks$name[which(marks[, 2:4] == matrixStats::colMaxs(as.matrix(marks[, 2:4]), 
                                                      na.rm = TRUE)[col(marks[, 2:4])], 
                 arr.ind = TRUE)[, "row"]]

## [1] "C" "D" "B" "C" "D"

但是，如果您只想要每列的第一个最大值，我们可以简化为just（也处理NA s）

marks$name[sapply(marks[, 2:4], which.max)]
# [1] "C" "B" "C"

sapply() returns list instead of a vector

1 个答案: