计算R中子字符串的实例

时间:2018-09-29 15:06:11

标签: r

我有一个这样的数据框:

# ID  Gender
1 01  () (Male) (Female)
2 02  (Male)
3 03  (Female)
4 04  (Female) (Female)
5 05  (Male) (Male) (Male)

对于每个实例,我想添加三个新列:

# ID Gender Gender-Male Gender-Female Gender-Null

每个列都计算实例中有()(男)和(女)子字符串。从本质上讲,这意味着,例如,有3位男士参加了该会话,或者2位女士和1个空实体等。

实现此目标的最佳方法是什么?正则表达式的“ for”循环?还是我应该使用更好的库?

1 个答案:

答案 0 :(得分:2)

1)在性别中将()替换为Null,并在性别中删除括号。然后将“性别”分成几行,并为每个ID和“性别”统计行数。最后将其散布成宽阔的形式。

library(dplyr)
library(tidyr)

counts <- DF %>%
  mutate(Gender = gsub("()", "Null", Gender, fixed = TRUE), 
         Gender = gsub("[()]", "", Gender)) %>%
  separate_rows(Gender) %>%
  count(ID, Gender) %>%
  spread(Gender, n, fill = 0)

left_join(DF, counts)

给予:

  # ID               Gender Female Male Null
1 1  1   () (Male) (Female)      1    1    1
2 2  2               (Male)      0    1    0
3 3  3             (Female)      1    0    0
4 4  4    (Female) (Female)      2    0    0
5 5  5 (Male) (Male) (Male)      0    3    0

2)或仅使用基数R将Gender字符串拆分为单独的字符串spl的列表,然后将它们堆叠到数据帧long中。最后用xtabs将其制成表格。

spl <- setNames(strsplit(as.character(DF$Gender), " "), DF$ID)
long <- setNames(stack(spl), c("Gender", "ID"))
counttab <- xtabs(~ ID + Gender, long)

merge(DF, cbind(ID = rownames(counttab), as.data.frame.matrix(counttab)))

给予:

  ID #               Gender () (Female) (Male)
1  1 1   () (Male) (Female)  1        1      1
2  2 2               (Male)  0        0      1
3  3 3             (Female)  0        1      0
4  4 4    (Female) (Female)  0        2      0
5  5 5 (Male) (Male) (Male)  0        0      3

注意

我们以此为输入

Lines <- "#,ID,Gender
1,01,() (Male) (Female)
2,02,(Male)
3,03,(Female)
4,04,(Female) (Female)
5,05,(Male) (Male) (Male)"
DF <- read.csv(text = Lines, check.names = FALSE)