使用过滤数据计算分类平均值

时间:2017-06-09 05:05:38

标签: r

我有以下数据集:

A -1
A 10
B  8
D -1
A  0
A  4
B  2
C  6

我想像这样添加列:

A -1 4,6
A 10 4,6
B  8 5,0
D -1 5,0
A  0 4,6
A  4 4,6
B  2 5,0
C  6 6,0

这里发生了什么?

我已经计算了每个分类字母变量的平均值,但忽略了负数,并将其作为新列值。

如果类别I只有负值,则给出总平均值(忽略负值)。

在SQL中,这将使用过滤后的组功能,然后是连接。在excel中,这将是有条件的vlookup。我是怎么用R做的?

修改

# Create dataset
category <- c("A","A","B","D","A","A","B","C")
value <- c(-1,10,8,-1,0,4,2,6)
dataset <- data.frame(category, value)

# Calculated means

fdata <- dataset[dataset[,'value']>-1,]
aggregate(fdata[,2], list(fdata$category), mean)

3 个答案:

答案 0 :(得分:4)

我们可以使用每个ave分组的基础R中的category,我们会检查all特定组中的value是否小于0,如果是我们选择整个mean的{​​{1}},如果不是,那么我们只选择该组的dataset

mean

答案 1 :(得分:3)

使用dplyr

dataset %>% 
  mutate(
    x = ifelse(value < 0, NA_integer_, value),
    meanAll = mean(x, na.rm = TRUE)) %>% 
  group_by(category) %>% 
  mutate(meanGroup = mean(x, na.rm = TRUE),
         meanGroup = ifelse(is.nan(meanGroup), meanAll, meanGroup))

# Source: local data frame [8 x 5]
# Groups: category [4]
# 
# # A tibble: 8 x 5
#   category value     x meanAll meanGroup
#     <fctr> <dbl> <dbl>   <dbl>     <dbl>
# 1        A    -1    NA       5  4.666667
# 2        A    10    10       5  4.666667
# 3        B     8     8       5  5.000000
# 4        D    -1    NA       5  5.000000
# 5        A     0     0       5  4.666667
# 6        A     4     4       5  4.666667
# 7        B     2     2       5  5.000000
# 8        C     6     6       5  6.000000

答案 2 :(得分:2)

OP已编写在SQL中,这将使用过滤后的组功能,然后加入。此方法可以使用data.table实现:

library(data.table)
# filter data and compute group means 
setDT(dataset)[value >= 0, .(grp.mean = mean(value)), category][
  # now join with dataset
  dataset, on = "category"][
    # fill empty group means with overall mean of filtered values
    is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]

返回

   category grp.mean value
1:        A 4.666667    -1
2:        A 4.666667    10
3:        B 5.000000     8
4:        D 5.000000    -1
5:        A 4.666667     0
6:        A 4.666667     4
7:        B 5.000000     2
8:        C 6.000000     6

这是一个更简洁的变体,它使用引用分配并避免连接操作(我不确定哪一个更快):

library(data.table)
# assign by reference of computed group means of filtered values
setDT(dataset)[, grp.mean := mean(value[value >=0]), category][
    # fill empty group means with overall mean of filtered values
    is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]