如何基于 R 中的两列创建虚拟对象

时间:2021-07-17 13:35:34

标签: r dummy-variable

假设我有一个数据框: 性别可以取F为女或M为男 种族可以将A作为亚洲人,W作为白人,B作为黑人,H作为西班牙人

| id | Gender | Race |
| --- | ----- | ---- |
| 1   | F    | W |
| 2   | F    | B |
| 3   | M    | A |
| 4   | F    | B |
| 5   | M    | W |
| 6   | M    | B |
| 7   | F    | H |

我想有一组基于性别和种族的列作为虚拟对象,数据框应该是这样的

| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1   | F    | W   |  1  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
| 2   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 3   | M    | A   |  0  |  0  |  0  |  0  |  0  |  0  |  1  |  0  |
| 4   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 5   | M    | W   |  0  |  0  |  0  |  0  |  1  |  0  |  0  |  0  |
| 6   | M    | B   |  0  |  0  |  0  |  0  |  0  |  1  |  0  |  0  |
| 7   | F    | H   |  0  |  0  |  0  |  1  |  0  |  0  |  0  |  0  |

我的实际数据包含的类别比此示例多得多,因此如果您能以更简洁的方式制作它,我将不胜感激。 语言是R。 感谢您的帮助。

5 个答案:

答案 0 :(得分:4)

除了列名之外,您还可以使用 model.matrix 函数和一个仅表达交互项并减去截距的公式来获得:

> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
   id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1   1      F    H             0             0             0             0
2   2      M    H             0             0             0             0
3   3      M    W             0             0             0             0
4   4      F    H             0             0             0             0
5   5      M    H             0             0             0             0
[etc]

如果您关心确切的名称,通过一些字符串处理很容易将它们分类。

> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
   id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1   1      F    H   0   0   0   0   1   0   0   0
2   2      M    H   0   0   0   0   0   1   0   0
3   3      M    W   0   0   0   0   0   0   0   1
4   4      F    H   0   0   0   0   1   0   0   0
5   5      M    H   0   0   0   0   0   1   0   0
6   6      F    H   0   0   0   0   1   0   0   0
7   7      F    H   0   0   0   0   1   0   0   0
8   8      M    A   0   1   0   0   0   0   0   0
9   9      M    W   0   0   0   0   0   0   0   1
10 10      F    B   0   0   1   0   0   0   0   0

如果您关心列顺序....

答案 1 :(得分:4)

带有 func_norule:myName Before calling func_with_rule func_with_rule:myName After calling func_with_rule base R 选项

table

数据

 cbind(df1, as.data.frame.matrix(table(transform(df1, 
    GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0

答案 2 :(得分:4)

另一个带有 xtabs 的基本 R 选项

cbind(
    df,
    as.data.frame.matrix(
        xtabs(
            ~ id + q,
            transform(
                df,
                q = paste0(Gender, "_", Race)
            )
        )
    )
)

给予

  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0

答案 3 :(得分:3)

我认为您可以使用以下解决方案。它实际上比您想要的输出少 2 个变量,但输出将为零。由于 pivot_wider 将传播可以在数据集中找到的所有组合。

library(dplyr)
library(tidyr)

df %>%
  mutate(grp = 1) %>%
  pivot_wider(names_from = c(Gender, Race), values_from = grp, 
              values_fill = 0, names_glue = "{Gender}_{Race}") %>%
  right_join(df, by = "id") %>%
  relocate(id, Gender, Race)

# A tibble: 7 x 9
     id Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <int> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 F      W         1     0     0     0     0     0
2     2 F      B         0     1     0     0     0     0
3     3 M      A         0     0     1     0     0     0
4     4 F      B         0     1     0     0     0     0
5     5 M      W         0     0     0     1     0     0
6     6 M      B         0     0     0     0     1     0
7     7 F      H         0     0     0     0     0     1

答案 4 :(得分:3)

除了 Anoushiravan R 的 tidyverse 解决方案。 这是 unitepivot_wideracrosscase_when

的另一个选项
library(tidyverse)
  df %>% 
    unite(comb, Gender:Race, remove = FALSE) %>% 
    pivot_wider(
      names_from = comb,
      values_from = comb
    ) %>% 
    mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H), 
                  ~ case_when(is.na(.) ~ 0, 
                              TRUE ~ 1)))

输出:

  id    Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <chr> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1     F      W         1     0     0     0     0     0
2 2     F      B         0     1     0     0     0     0
3 3     M      A         0     0     1     0     0     0
4 4     F      B         0     1     0     0     0     0
5 5     M      W         0     0     0     1     0     0
6 6     M      B         0     0     0     0     1     0
7 7     F      H         0     0     0     0     0     1
相关问题