Question

我想计算分类变量的最常见值。我尝试在modeest包中使用mlv函数，但是获得了NA。

user <- c("A","B","A","A","B","A","B","B")
color <- c("blue","green","blue","blue","green","yellow","pink","blue")
df <- data.frame(user,color)
df$color <- as.factor(df$color)

library(plyr)
library(dplyr)
library(modeest)

summary <- ddply(df,.(user),summarise,mode=mlv(color,method="mlv")[['M']])

Warning messages:
1: In discrete(x, ...) : NAs introduced by coercion
2: In discrete(x, ...) : NAs introduced by coercion

summary
   user mode
1    A   NA
2    B   NA

然而，我需要这个：

user  mode
A     blue
B     green

我做错了什么？我尝试使用其他方法，以及mlv(x=color)。根据{{3}}的帮助页面，它应该适用于各种因素。

我不想使用table（），因为我需要一个简单的函数来创建一个汇总表，就像这个问题一样：document，但是对于一个分类列。

Answer 1

您应该尝试table。例如，which.max(table(color))。

Answer 2

modeest::mlv.factor()不起作用的原因可能实际上是包中的错误。

在函数mlv.factor()中调用函数modeest:::discrete()。在那里，会发生这种情况：

f <- factor(color)
[1] blue   green  blue   blue   green  yellow pink   blue  
Levels: blue green pink yellow

tf <- tabulate(f)
[1] 4 2 1 1

as.numeric(levels(f)[tf == max(tf)])
[1] NA
Warning message:
NAs introduced by coercion

这是返回mlv.fator()的内容。但levels(f)[tf == max(tf)]等于[1] "blue"，因此as.numeric()无法将其转换为数字。

您可以通过查找唯一值并计算它们在矢量中出现的次数来计算模式。然后，您可以对最常出现的值（即模式）

的唯一值进行子集化

找到独特的颜色：

unique_colors <- unique(color)

match(color, unique_colors)返回color中unique_colors的第一个匹配位置。 tabulate()然后计算颜色发生的次数。 which.max()返回最高出现值的索引。然后可以使用该值对唯一颜色进行子集化。

unique_colors[which.max(tabulate(match(color, unique_colors)))]

使用dplyr

可能更具可读性

library(dplyr)
unique(color)[color %>%
                match(unique(color)) %>% 
                tabulate() %>%
                which.max()]

两个选项都返回：

[1] blue
Levels: blue green pink yellow

编辑：

最好的方法可能是创建自己的模式功能：

calculate_mode <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

然后在dplyr::summarise()中使用它：

library(dplyr)

df %>% 
  group_by(user) %>% 
  summarise(color = calculate_mode(color))

返回：

# A tibble: 2 x 2
    user  color
  <fctr> <fctr>
1      A   blue
2      B  green

Answer 3

用dplyr和purrr解决方案

您可以像这样使用@loudelouk的更广泛版本的正确答案：

df %>% 
  group_by(user) %>% 
  select_if(is.factor) %>% 
  summarise_all(function(x) { x %>% table %>% which.max %>% names })

或更短：

df %>% 
  group_by(user) %>% 
  summarise_if(is.factor, .funs = function(x) { x %>% table %>% which.max %>% names})

R中分类变量的统计模式（使用mlv）

3 个答案:

用dplyr和purrr解决方案