使用dplyr从数据框中分组中位数

时间:2015-05-31 06:40:41

标签: r dataframe dplyr median summary

计算中位数似乎是a bit of an achilles heel for R(即no data.frame method)。使用dplyr从数据框中获取组中位所需的最少打字量是什么?

my_data <- structure(list(group = c("Group 1", "Group 1", "Group 1", "Group 1", 
"Group 1", "Group 1", "Group 1", "Group 1", "Group 1", "Group 1", 
"Group 1", "Group 1", "Group 1", "Group 1", "Group 1", "Group 2", 
"Group 2", "Group 2", "Group 2", "Group 2", "Group 2", "Group 2", 
"Group 2", "Group 2", "Group 2", "Group 2", "Group 2", "Group 2", 
"Group 2", "Group 2"), value = c("5", "3", "6", "8", "10", "13", 
"1", "4", "18", "4", "7", "9", "14", "15", "17", "7", "3", "9", 
"10", "33", "15", "18", "6", "20", "30", NA, NA, NA, NA, NA)), .Names = c("group", 
"value"), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-30L))

library(dplyr)  

# groups 1 & 2
my_data_groups_1_and_2 <- my_data[my_data$group %in% c("Group 1", "Group 2"), ]

# compute medians per group
medians <- my_data_groups_1_and_2 %>%
  group_by(group) %>%
  summarize(the_medians = median(value, na.rm = TRUE)) 

给出了:

Error in summarise_impl(.data, dots) : 
  STRING_ELT() can only be applied to a 'character vector', not a 'double'
In addition: Warning message:
In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA

这里解决方法的最小努力是什么?

1 个答案:

答案 0 :(得分:1)

由ivyleavedtoadflax评论,错误是由median提供非数字或非逻辑参数引起的,因为value列的类型为character(您可以轻松通过查看引号来告诉他们不是numeric。以下是解决问题的两种简单方法:

my_data %>% 
  filter(group %in% c("Group 1", "Group 2")) %>%
  group_by(group) %>%
  summarize(the_medians = median(as.numeric(value), na.rm = TRUE)) 

my_data %>% 
  filter(group %in% c("Group 1", "Group 2")) %>%
  mutate(value = as.numeric(value))  %>%
  group_by(group) %>%
  summarize(the_medians = median(value, na.rm = TRUE)) 

要检查数据中包含type列的结构,您可以方便地使用

str(my_data)
#Classes ‘tbl_df’ and 'data.frame': 30 obs. of  2 variables:
# $ group: chr  "Group 1" "Group 1" "Group 1" "Group 1" ...
# $ value: chr  "5" "3" "6" "8" ...