我已经努力了一段时间,感觉应该是一个非常简单的操作,并且尝试了不同的方法,但是似乎都没有成果。
我有一个看起来像这样的数据集:
df <- data.frame(name = c("john", "paul", "ringo", "george", "john", "paul", "ringo", "george", "john", "paul", "ringo", "george"),
year = c(2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016),
station1 = c(1, 2, 3, NA, 2, NA, 5, 6, 7, 8, 9, 0),
station2 = c(NA, 6, 8, 1, 2, 6, NA, 1, NA, 1, 5, 3),
station3 = c(NA, 2, 3, 5, 1, NA, 1, 5, 3, 1, 2, 3),
station4 = c(9, 8, 7, 6, NA, 8, 12, 8, 83, 4, 3, NA))
现在,我需要的是创建一个新变量,我们称它为new_station,该变量的值取决于每个给定年份的每个名称。例如:
我尝试过以下几种过滤,选择和突变的组合:
df %>%
filter(name == "john") %>%
select(station1, station3) %>%
mutate(new_station = rowMeans(c(station1, station3)))
但是它不会让我仅将值分配给单个行的值。当我只需要特定年份的平均值时,其他尝试会导致新列中的每一行成为所有6个单元(2站x 3年)的平均值。我尝试过的其他方法无法处理存在某些缺失值的事实,而我需要将其省略。
我需要一种可扩展的循环,只需更改每个名称的条件,因为在现实生活中,我有一个包含21个名称和30个站点的数据集。
有什么想法吗?
注意:如果它说明了我要执行的操作,我知道如何在Stata中执行此操作。在Stata中,名称 john 类似于:
egen new_station = rowmean(station1 station3) if name == "john"
我只需要在R中做类似的事情。
谢谢!
答案 0 :(得分:3)
我将数据转换为长格式,然后使用case_when
。您可以根据需要将其转换回宽的范围。
df$id = 1:nrow(df)
library(tidyr)
df %>% pivot_longer(
cols = starts_with("station"),
names_to = "station", names_prefix = "station",
values_to = "value"
) %>%
group_by(name, year) %>%
mutate(result = case_when(
name == "john" ~ mean(value[station %in% c(1, 3)], na.rm = TRUE),
name %in% c("paul", "george") ~ value[station == 4],
name == "ringo" ~ mean(value[station %in% c(1, 2, 3)], na.rm = TRUE)
))
# # A tibble: 48 x 6
# # Groups: name, year [12]
# name year id station value result
# <fct> <dbl> <int> <chr> <dbl> <dbl>
# 1 john 2018 1 1 1 1
# 2 john 2018 1 2 NA 1
# 3 john 2018 1 3 NA 1
# 4 john 2018 1 4 9 1
# 5 paul 2018 2 1 2 8
# 6 paul 2018 2 2 6 8
# 7 paul 2018 2 3 2 8
# 8 paul 2018 2 4 8 8
# 9 ringo 2018 3 1 3 4.67
# 10 ringo 2018 3 2 8 4.67
# # ... with 38 more rows
答案 1 :(得分:0)
这是一个data.table解决方案。它依赖于创建查找表并获取数据子集的子集的rowMeans()
。 :
library(data.table)
dt <- as.data.table(DF)
dt[, name := as.character(name)]
lookup <- list(john = c('station1', 'station3'),
paul = 'station4',
ringo = c('station1','station2','station3'),
george = 'station4')
dt[,
new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
by = name]
dt
基于OP注释,将dt
放在lookup
表的名称上比较安全:
dt <- as.data.table(DF)
dt[, name := as.character(name)]
lookup[[4]] <- NULL
setdiff(dt[, name], names(lookup))
# error
dt[,
new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
by = name]
# OK
dt[name %in% names(lookup),
new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
by = name]
dt
为了更好地了解正在发生的事情,我建议运行以下行:
dt <- as.data.table(DF)
# what is .SD?
dt[, print(.SD), by = name]
dt[, .SD[,print(.SD) , .SDcols = lookup[[unlist(.BY)]]], by = name]
#what is .BY?
dt[, print(.BY), by = name]
dt[, print(unlist(.BY)), by = name]
dt[, name := as.character(name)]
dt[, print(unlist(.BY)), by = name]
参考:
关于 D 表的 S 分组的很好的解释:What does .SD stand for in data.table in R