对于数据框中的每一行,我想查找第二高的发生值以及最小的发生值。我该怎么办?
Df:
label v1 v2 v3 v4 v5 v6
5 3 3 3 6 6 8
5 7 1 1 1 7 0
5 3 5 6 6 6 5
我想考虑“标签”之外的所有列
预期输出:
second largest occuring least occuring
6 8
7 0
5 3
编辑:接受答案后,我已经更新了示例,以减少混乱
答案 0 :(得分:4)
一种dplyr
解决方案:
df %>%
rowid_to_column() %>%
gather(var, val, -label, -rowid) %>%
group_by(rowid, val) %>%
tally() %>%
summarise(second_largest_occuring = val[dense_rank(n) == 2],
least_occuring = val[n == min(n)]) %>%
ungroup() %>%
select(-rowid)
# A tibble: 3 x 2
second_largest_occuring least_occuring
<int> <int>
1 2 1
2 2 0
3 5 3
数据:
df <- read.table(text = "label v1 v2 v3 v4 v5 v6
5 3 3 3 2 2 1
5 2 1 1 1 2 0
5 3 5 6 6 6 5", header= TRUE)
答案 1 :(得分:1)
另一种dplyr解决方案,它更具可读性,可以处理NA和多次出现第二大错误的实例的错误。该解决方案还允许您使用dplyr语言选择多个列。
library(dplyr)
dat = read.table(text = 'label v1 v2 v3 v4 v5 v6
5 3 3 3 2 2 1
5 2 1 1 1 2 0
5 3 5 6 6 6 5', header = T)
second_largest <- function(x,na.rm = TRUE) {
if(na.rm) { x <- na.omit(x) } # omit NA values
second_largest <- x[dense_rank(x) == 2] # return all values where the rank is equal to 2nd largest
second_largest <- max(second_largest) # keep one value out of all the second largest, or NA
return(second_largest)
}
df <- dat %>%
mutate(
second_largest = select(., v1:v6) %>% apply(1, second_largest,na.rm = TRUE), # apply second_largest func to every row
min = select(., v1:v6) %>% apply(1,min,na.rm = TRUE) # apply min to every row
)
# label v1 v2 v3 v4 v5 v6 second_largest min
# 1 5 3 3 3 2 2 1 2 1
# 2 5 2 1 1 1 2 0 1 0
# 3 5 3 5 6 6 6 5 5 3
一些注意事项。
apply语句中的1表示应将函数应用于行。
更新
如果您想要第二个最常见的数字,只需插入一个新函数即可。
second_most_frequent <- function(x, is_numeric = TRUE) {
out <- x %>%
table() %>% # Create a table of frequencies as characters
as.data.frame(stringsAsFactors = FALSE) %>%
arrange(desc(Freq)) %>% # Arrange with frequency descending
.[,1] %>% # Select the first column
.[2] # select the second most frequent (WARNING: Doesn't check for ties)
if(is_numeric){ out <- as.numeric(out) }
return(out)
}
df <- df %>%
mutate(
second_most_freq = select(., v1:v6) %>% apply(1,second_most_frequent,is_numeric = TRUE)
)
# label v1 v2 v3 v4 v5 v6 second_largest min second_most_freq
# 1 5 3 3 3 2 2 1 2 1 2
# 2 5 2 1 1 1 2 0 1 0 2
# 3 5 3 5 6 6 6 5 5 3 5