Question

我正在处理看起来像这样的数据集

Year   Column1     
2000   yes no    
2001   yes yes    
2002   yes       
2003   N/A yes   
2004   N/A N/A   
2005   no no

如您所见，一个单元格中有多个不同的字符串。我想创建两个新列，其中我有数值给我有关Column1的信息。我的最终产品可能看起来像这样

Year   Column1   any_yes   yes_count   
2000   yes no    1          1
2001   yes yes   1          2
2002   yes       1          1
2003   N/A yes   1          1
2004   N/A N/A   0          0
2005   no no     0          0

其中＆＃34; any_yes＆＃34;检查Column1中的单元格是否包含＆＃34; yes＆＃34;并返回1/0。在哪里＆＃34; yes_count＆＃34;计算＆＃34;是＆＃34;在Column1的单元格中，返回计数。如果我处理数字，我对any_yes的最佳猜测就是这样：

mydata1 <- mydata %>%
  mutate(any_yes = ifelse(Column1 = "yes", 1, 0)

由于我没有处理数字，我不确定它是如何工作的。我也不知道如何使yes_count发生。

Answer 1

我们可以使用str_count（来自stringr）和grep来执行此操作。

library(stringr)
library(dplyr)
df %>% 
     mutate(any_yes = +(grepl("yes", Column1)),
             yes_count = str_count(Column1, "yes"))
#    Year Column1 any_yes yes_count
#1 2000  yes no       1         1
#2 2001 yes yes       1         2
#3 2002     yes       1         1
#4 2003 N/A yes       1         1
#5 2004 N/A N/A       0         0
#6 2005   no no       0         0

我们也可以在没有dplyr

的情况下获得输出

transform(df, any_yes = +(grepl("yes", Column1)),
              yes_count = str_count(Column1, "yes"))

或者不使用任何包

within(df, {any_yes <- +(grepl("yes", Column1))
              yes_count <-  lengths(gregexpr("yes", Column1))* any_yes})
#   Year Column1 yes_count any_yes
#1 2000  yes no         1       1
#2 2001 yes yes         2       1
#3 2002     yes         1       1
#4 2003 N/A yes         1       1
#5 2004 N/A N/A         0       0
#6 2005   no no         0       0

Answer 2

dplyr的另一个选项。

按空格拆分column1，并使用yes查找每个列表lapply的出现次数。如果yes_count大于1，则any_yes应为1，否则为0。

library(dplyr)
df %>% 
mutate(yes_count=unlist(lapply(strsplit(df$Column1, " "),function(x)sum(grepl("yes", x))))
       ,any_yes = as.numeric(yes_count > 0))


#Year   Column1 yes_count any_yes
#1 2000  yes no         1       1
#2 2001 yes yes         2       1
#3 2002     yes         1       1
#4 2003 N/A yes         1       1
#5 2004 N/A N/A         0       0
#6 2005   no no         0       0

返回值以字符为条件

2 个答案: