假设我有一个数据框: 性别可以取F为女或M为男 种族可以将A作为亚洲人,W作为白人,B作为黑人,H作为西班牙人
| id | Gender | Race |
| --- | ----- | ---- |
| 1 | F | W |
| 2 | F | B |
| 3 | M | A |
| 4 | F | B |
| 5 | M | W |
| 6 | M | B |
| 7 | F | H |
我想有一组基于性别和种族的列作为虚拟对象,数据框应该是这样的
| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | F | W | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | M | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | M | W | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 6 | M | B | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7 | F | H | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
我的实际数据包含的类别比此示例多得多,因此如果您能以更简洁的方式制作它,我将不胜感激。 语言是R。 感谢您的帮助。
答案 0 :(得分:4)
除了列名之外,您还可以使用 model.matrix
函数和一个仅表达交互项并减去截距的公式来获得:
> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1 1 F H 0 0 0 0
2 2 M H 0 0 0 0
3 3 M W 0 0 0 0
4 4 F H 0 0 0 0
5 5 M H 0 0 0 0
[etc]
如果您关心确切的名称,通过一些字符串处理很容易将它们分类。
> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1 1 F H 0 0 0 0 1 0 0 0
2 2 M H 0 0 0 0 0 1 0 0
3 3 M W 0 0 0 0 0 0 0 1
4 4 F H 0 0 0 0 1 0 0 0
5 5 M H 0 0 0 0 0 1 0 0
6 6 F H 0 0 0 0 1 0 0 0
7 7 F H 0 0 0 0 1 0 0 0
8 8 M A 0 1 0 0 0 0 0 0
9 9 M W 0 0 0 0 0 0 0 1
10 10 F B 0 0 1 0 0 0 0 0
如果您关心列顺序....
答案 1 :(得分:4)
带有 func_norule:myName
Before calling func_with_rule
func_with_rule:myName
After calling func_with_rule
的 base R
选项
table
cbind(df1, as.data.frame.matrix(table(transform(df1,
GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0
答案 2 :(得分:4)
另一个带有 xtabs
的基本 R 选项
cbind(
df,
as.data.frame.matrix(
xtabs(
~ id + q,
transform(
df,
q = paste0(Gender, "_", Race)
)
)
)
)
给予
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0
答案 3 :(得分:3)
我认为您可以使用以下解决方案。它实际上比您想要的输出少 2 个变量,但输出将为零。由于 pivot_wider
将传播可以在数据集中找到的所有组合。
library(dplyr)
library(tidyr)
df %>%
mutate(grp = 1) %>%
pivot_wider(names_from = c(Gender, Race), values_from = grp,
values_fill = 0, names_glue = "{Gender}_{Race}") %>%
right_join(df, by = "id") %>%
relocate(id, Gender, Race)
# A tibble: 7 x 9
id Gender Race F_W F_B M_A M_W M_B F_H
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1
答案 4 :(得分:3)
除了 Anoushiravan R 的 tidyverse 解决方案。
这是 unite
、pivot_wider
、across
和 case_when
library(tidyverse)
df %>%
unite(comb, Gender:Race, remove = FALSE) %>%
pivot_wider(
names_from = comb,
values_from = comb
) %>%
mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H),
~ case_when(is.na(.) ~ 0,
TRUE ~ 1)))
输出:
id Gender Race F_W F_B M_A M_W M_B F_H
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1