Question

我已经努力了一段时间，感觉应该是一个非常简单的操作，并且尝试了不同的方法，但是似乎都没有成果。

我有一个看起来像这样的数据集：

df <- data.frame(name = c("john", "paul", "ringo", "george", "john", "paul", "ringo", "george", "john", "paul", "ringo", "george"), 
                 year = c(2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016),
                 station1 = c(1, 2, 3, NA, 2, NA, 5, 6, 7, 8, 9, 0),
                 station2 = c(NA, 6, 8, 1, 2, 6, NA, 1, NA, 1, 5, 3),
                 station3 = c(NA, 2, 3, 5, 1, NA, 1, 5, 3, 1, 2, 3),
                 station4 = c(9, 8, 7, 6, NA, 8, 12, 8, 83, 4, 3, NA))

现在，我需要的是创建一个新变量，我们称它为new_station，该变量的值取决于每个给定年份的每个名称。例如：

对于 john ，我需要 station1 和 station3 的平均值。
对于 paul ，我只需要 station 4 。
对于 ringo ，我需要 station1，station2， station3 ;和
对于 george ，我只需要 station4 。

我尝试过以下几种过滤，选择和突变的组合：

df %>%
  filter(name == "john") %>%
  select(station1, station3) %>%
  mutate(new_station = rowMeans(c(station1, station3)))

但是它不会让我仅将值分配给单个行的值。当我只需要特定年份的平均值时，其他尝试会导致新列中的每一行成为所有6个单元（2站x 3年）的平均值。我尝试过的其他方法无法处理存在某些缺失值的事实，而我需要将其省略。

我需要一种可扩展的循环，只需更改每个名称的条件，因为在现实生活中，我有一个包含21个名称和30个站点的数据集。

有什么想法吗？

注意：如果它说明了我要执行的操作，我知道如何在Stata中执行此操作。在Stata中，名称 john 类似于：

egen new_station = rowmean(station1 station3) if name == "john"

我只需要在R中做类似的事情。

谢谢！

Answer 1

我将数据转换为长格式，然后使用case_when。您可以根据需要将其转换回宽的范围。

df$id = 1:nrow(df)

library(tidyr) 
df %>% pivot_longer(
    cols = starts_with("station"), 
    names_to = "station", names_prefix = "station",
    values_to = "value"
  ) %>%
  group_by(name, year) %>%
  mutate(result = case_when(
    name == "john" ~ mean(value[station %in% c(1, 3)], na.rm = TRUE),
    name %in% c("paul", "george") ~ value[station == 4],
    name == "ringo" ~ mean(value[station %in% c(1, 2, 3)], na.rm = TRUE)
  ))
# # A tibble: 48 x 6
# # Groups:   name, year [12]
#    name   year    id station value result
#    <fct> <dbl> <int> <chr>   <dbl>  <dbl>
#  1 john   2018     1 1           1   1   
#  2 john   2018     1 2          NA   1   
#  3 john   2018     1 3          NA   1   
#  4 john   2018     1 4           9   1   
#  5 paul   2018     2 1           2   8   
#  6 paul   2018     2 2           6   8   
#  7 paul   2018     2 3           2   8   
#  8 paul   2018     2 4           8   8   
#  9 ringo  2018     3 1           3   4.67
# 10 ringo  2018     3 2           8   4.67
# # ... with 38 more rows

Answer 2

这是一个data.table解决方案。它依赖于创建查找表并获取数据子集的子集的rowMeans()。：

library(data.table)

dt <- as.data.table(DF)
dt[, name := as.character(name)]

lookup <- list(john = c('station1', 'station3'),
               paul = 'station4',
               ringo = c('station1','station2','station3'),
               george = 'station4')

dt[,
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]
dt

基于OP注释，将dt放在lookup表的名称上比较安全：

dt <- as.data.table(DF)
dt[, name := as.character(name)]

lookup[[4]] <- NULL
setdiff(dt[, name], names(lookup))

# error
dt[,
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]
# OK
dt[name %in% names(lookup),
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]

dt

为了更好地了解正在发生的事情，我建议运行以下行：

dt <- as.data.table(DF)
# what is .SD?
dt[, print(.SD), by = name]
dt[, .SD[,print(.SD) , .SDcols = lookup[[unlist(.BY)]]], by = name]

#what is .BY?
dt[, print(.BY), by = name]
dt[, print(unlist(.BY)), by = name]
dt[, name := as.character(name)]
dt[, print(unlist(.BY)), by = name]

参考：

关于 D 表的 S 分组的很好的解释：What does .SD stand for in data.table in R

使用条件均值和NA生成新变量

2 个答案: