我有以下数据集:
A -1
A 10
B 8
D -1
A 0
A 4
B 2
C 6
我想像这样添加列:
A -1 4,6
A 10 4,6
B 8 5,0
D -1 5,0
A 0 4,6
A 4 4,6
B 2 5,0
C 6 6,0
这里发生了什么?
我已经计算了每个分类字母变量的平均值,但忽略了负数,并将其作为新列值。
如果类别I只有负值,则给出总平均值(忽略负值)。
在SQL中,这将使用过滤后的组功能,然后是连接。在excel中,这将是有条件的vlookup。我是怎么用R做的?
修改
# Create dataset
category <- c("A","A","B","D","A","A","B","C")
value <- c(-1,10,8,-1,0,4,2,6)
dataset <- data.frame(category, value)
# Calculated means
fdata <- dataset[dataset[,'value']>-1,]
aggregate(fdata[,2], list(fdata$category), mean)
答案 0 :(得分:4)
我们可以使用每个ave
分组的基础R中的category
,我们会检查all
特定组中的value
是否小于0,如果是我们选择整个mean
的{{1}},如果不是,那么我们只选择该组的dataset
。
mean
答案 1 :(得分:3)
使用dplyr
dataset %>%
mutate(
x = ifelse(value < 0, NA_integer_, value),
meanAll = mean(x, na.rm = TRUE)) %>%
group_by(category) %>%
mutate(meanGroup = mean(x, na.rm = TRUE),
meanGroup = ifelse(is.nan(meanGroup), meanAll, meanGroup))
# Source: local data frame [8 x 5]
# Groups: category [4]
#
# # A tibble: 8 x 5
# category value x meanAll meanGroup
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A -1 NA 5 4.666667
# 2 A 10 10 5 4.666667
# 3 B 8 8 5 5.000000
# 4 D -1 NA 5 5.000000
# 5 A 0 0 5 4.666667
# 6 A 4 4 5 4.666667
# 7 B 2 2 5 5.000000
# 8 C 6 6 5 6.000000
答案 2 :(得分:2)
OP已编写在SQL中,这将使用过滤后的组功能,然后加入。此方法可以使用data.table
实现:
library(data.table)
# filter data and compute group means
setDT(dataset)[value >= 0, .(grp.mean = mean(value)), category][
# now join with dataset
dataset, on = "category"][
# fill empty group means with overall mean of filtered values
is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]
返回
category grp.mean value
1: A 4.666667 -1
2: A 4.666667 10
3: B 5.000000 8
4: D 5.000000 -1
5: A 4.666667 0
6: A 4.666667 4
7: B 5.000000 2
8: C 6.000000 6
这是一个更简洁的变体,它使用引用分配并避免连接操作(我不确定哪一个更快):
library(data.table)
# assign by reference of computed group means of filtered values
setDT(dataset)[, grp.mean := mean(value[value >=0]), category][
# fill empty group means with overall mean of filtered values
is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]