I have the following data frame -
name amarks bmarks cmarks
1 A 25 30 40
2 B 45 78 50
3 C 75 72 29
4 D 18 16 70
. . . . .
Where name is the name of the person, amarks, bmarks and cmarks are the marks scored at different exams by the person. Now I am tasked to find out name's of the people who have scored max in amarks, bmarks and cmarks. Also I got to store it as a vector. I have solved it in the following way -
> max_name <- sapply(marks[,2:4], function(x) {subset(marks, x == max(x,
> na.rm = T), name)})
This gives me the correct answers but when I check the data type of max_name, I see that its a list when ideally I expected sapply to return vector.
Following are my observations -
class(max_name)
> list
typeof(max_name)
> list
is.vector(max_name)
> vector
Can somebody please explain what is happening over here. Am I missing something. Do I need to make any changes to my code so that it returns a vector?
答案 0 :(得分:3)
您的代码存在一些问题:
subset
方法将, drop = FALSE
设置为默认值,这意味着始终将获得数据帧作为回报(除非您明确指定, drop = TRUE
)。因此,您将始终获得list
向量,因为这是R中唯一可以将多个数据帧保存在一起的结构(另请注意?subset
文档中有关何时的“警告”部分和如果你应该使用它。)x == max(...
可以返回未知数量的行,因为在每列中可能有多个值等于最大值。因此,大多数情况下,你会得到不同的长度向量,而且,只有list
可以容纳不同大小的向量。如果您只希望每列有一个结果,则可以使用which.max
,例如,它也会自动忽略NA
。 让我们添加一些NA
和一些重复的行,这些行等于maxs列,这样我们就可以看到结果的差异
marks <- read.table(text = "name amarks bmarks cmarks
1 A NA 30 40
2 B 45 78 50
3 C 75 NA 70
4 D 75 16 70", header = TRUE, stringsAsFactors = FALSE)
marks
# name amarks bmarks cmarks
# 1 A NA 30 40
# 2 B 45 78 50
# 3 C 75 NA 70
# 4 D 75 16 70
基本上,如果您想要所有 name
,我们只需将unlist
添加到您的代码中
unlist(sapply(marks[, 2:4], function(x) {subset(marks, x == max(x, na.rm = TRUE), name)}))
# amarks.name1 amarks.name2 bmarks.name cmarks.name1 cmarks.name2
# "C" "D" "B" "C" "D"
在不使用subset
marks$name[unlist(sapply(marks[, 2:4], function(x) which(x == max(x, na.rm = TRUE))))]
## [1] "C" "D" "B" "C" "D"
甚至(矢量化/过度并发症权衡)
marks$name[which(sapply(marks[, 2:4],
function(x) x == max(x, na.rm = TRUE)), arr.ind = TRUE)[, "row"]]
## [1] "C" "D" "B" "C" "D"
或完全矢量化的解决方案(换取使用外部包装,矩阵转换和通常超过复杂化)
marks$name[which(marks[, 2:4] == matrixStats::colMaxs(as.matrix(marks[, 2:4]),
na.rm = TRUE)[col(marks[, 2:4])],
arr.ind = TRUE)[, "row"]]
## [1] "C" "D" "B" "C" "D"
但是,如果您只想要每列的第一个最大值,我们可以简化为just(也处理NA
s)
marks$name[sapply(marks[, 2:4], which.max)]
# [1] "C" "B" "C"