我有一个这样的数据框:
# ID Gender
1 01 () (Male) (Female)
2 02 (Male)
3 03 (Female)
4 04 (Female) (Female)
5 05 (Male) (Male) (Male)
对于每个实例,我想添加三个新列:
# ID Gender Gender-Male Gender-Female Gender-Null
每个列都计算实例中有()(男)和(女)子字符串。从本质上讲,这意味着,例如,有3位男士参加了该会话,或者2位女士和1个空实体等。
实现此目标的最佳方法是什么?正则表达式的“ for”循环?还是我应该使用更好的库?
答案 0 :(得分:2)
1)在性别中将()
替换为Null
,并在性别中删除括号。然后将“性别”分成几行,并为每个ID和“性别”统计行数。最后将其散布成宽阔的形式。
library(dplyr)
library(tidyr)
counts <- DF %>%
mutate(Gender = gsub("()", "Null", Gender, fixed = TRUE),
Gender = gsub("[()]", "", Gender)) %>%
separate_rows(Gender) %>%
count(ID, Gender) %>%
spread(Gender, n, fill = 0)
left_join(DF, counts)
给予:
# ID Gender Female Male Null
1 1 1 () (Male) (Female) 1 1 1
2 2 2 (Male) 0 1 0
3 3 3 (Female) 1 0 0
4 4 4 (Female) (Female) 2 0 0
5 5 5 (Male) (Male) (Male) 0 3 0
2)或仅使用基数R将Gender字符串拆分为单独的字符串spl
的列表,然后将它们堆叠到数据帧long
中。最后用xtabs
将其制成表格。
spl <- setNames(strsplit(as.character(DF$Gender), " "), DF$ID)
long <- setNames(stack(spl), c("Gender", "ID"))
counttab <- xtabs(~ ID + Gender, long)
merge(DF, cbind(ID = rownames(counttab), as.data.frame.matrix(counttab)))
给予:
ID # Gender () (Female) (Male)
1 1 1 () (Male) (Female) 1 1 1
2 2 2 (Male) 0 0 1
3 3 3 (Female) 0 1 0
4 4 4 (Female) (Female) 0 2 0
5 5 5 (Male) (Male) (Male) 0 0 3
我们以此为输入
Lines <- "#,ID,Gender
1,01,() (Male) (Female)
2,02,(Male)
3,03,(Female)
4,04,(Female) (Female)
5,05,(Male) (Male) (Male)"
DF <- read.csv(text = Lines, check.names = FALSE)