Question

我从调查中得到一个非常混乱的数据集，其中每个复选框都是一个指标变量。因此，不要将性别（或种族）作为带有M / F作为条目的变量，而是有一个gender_m和一个带有指标的gender_f列。

简化示例：

df <- tribble(
  ~id, ~gender_m, ~gender_f,
  #--|----------|---------
  1L , 0        , 1,
  2L , 1        , 0,
  3L , 0        , 0,
  4L , 1        , 1
  )

我想要的输出是：

  result <- tribble(
    ~id, ~gender,
    #--|----------
  1L , 'f',
  2L , 'm',
  3L , 'Missing',
  4L , 'More than 1 selected'
)

对于类似于性别的东西，只有2列，它很容易硬编码，但我试图使它尽可能通用，因为像种族（或你使用的编程语言）这样的东西有多种可能性。

我有近千列但不到20个实际变量。所有列的格式均为<variable_name>_<potential_value>。

我确定我错过了一些整洁的功能，但是今天我的googlefu似乎很弱。

Answer 1

很多tidy函数在列中比行更好，所以如果你转换为long，这会变得容易一点：

df_long = df %>%
    gather(Item, Response, starts_with("gender"))

cleaned = df_long %>%
    mutate(Item = str_match(Item, "(.*)_(.*)")[, 3]) %>%
    group_by(id) %>%
    summarize(RespCleaned = case_when(
        sum(Response) == 0 ~ "Missing",
        sum(Response) == 1 ~ Item[Response == 1][1],
        sum(Response) > 1 ~ "More than 1 selected"
    ))

df = df %>% left_join(cleaned, by = "id")

如果您有大量具有此类0/1指标的项目用于回复，则使用回复总和应推广到具有2个以上选项的项目。您只需要将starts_with("gender")替换为另一个选择器而不是选择相关列。

Answer 2

这是一种基本方法（stringr除外）。应该很好地概括类似的情况，并很容易融入一个功能。按原样，它可以在整个数据框架上运行，其中包含1000列中的20个变量。

library(stringr)
sep = "_"
vars = unique(na.omit(str_extract(names(df), ".*(?=_)")))

for (i in seq_along(vars)) {
    these_vars = names(df)[str_detect(names(df), paste0("^", vars[i]))]
    result = character(nrow(df))
    rs = rowSums(df[these_vars])
    result[rs == 0] = "mising"
    result[rs > 1] = "more than 1 selected"
    result[rs == 1] = these_vars[apply(df[rs == 1, these_vars] == 1, 1, which)]
    df[i] = result
}

df
# # A tibble: 4 x 4
#      id gender_m gender_f               gender
#   <int>    <dbl>    <dbl>                <chr>
# 1     1        0        1             gender_f
# 2     2        1        0             gender_m
# 3     3        0        0               mising
# 4     4        1        1 more than 1 selected

将列名称整理成变量

2 个答案: