计算r中一组单词中的出现次数

时间:2018-04-16 01:05:19

标签: r regex

我们假设我有一个数据集:

Col1
Mon,Tues,Wed,Thurs,Fri
Mon,Tues,Wed,Thurs
Mon,Tues,Wed
Mon,Tues
Thurs

我想通过计算一组单词给每一行打分。 说我有这套话:星期一,星期二,星期三

如何制作具有相应分数的专栏?这将导致:

Scores
3
3
3
2
0

提前谢谢!

2 个答案:

答案 0 :(得分:3)

以下是基础R解决方案:

words <- c("Mon", "Tues", "Wed");
sapply(strsplit(as.character(df$Col), ","), function(x) sum(x %in% words))
#[1] 3 3 3 2 0

或存储在Scores列中:

df$Scores <- sapply(strsplit(as.character(df$Col), ","), function(x) sum(x %in% words));
df;
#                    Col1 Scores
#1 Mon,Tues,Wed,Thurs,Fri      3
#2     Mon,Tues,Wed,Thurs      3
#3           Mon,Tues,Wed      3
#4               Mon,Tues      2
#5                  Thurs      0

或使用transformpurrr::map_int

library(purrr);
transform(df, Scores = map_int(Col1, function(x) 
    sum(unlist(strsplit(as.character(x), ",")) %in% words)))
#                    Col1 Scores
#1 Mon,Tues,Wed,Thurs,Fri      3
#2     Mon,Tues,Wed,Thurs      3
#3           Mon,Tues,Wed      3
#4               Mon,Tues      2
#5                  Thurs      0

样本数据

df <- read.table(text =
    "Col1
Mon,Tues,Wed,Thurs,Fri
Mon,Tues,Wed,Thurs
Mon,Tues,Wed
Mon,Tues
Thurs", header = T)

答案 1 :(得分:2)

我们可以str_count paste之后使用vector&#39;

library(stringr)
df1$Scores <- str_count(df1$Col1, paste(words, collapse="|"))
df1$Scores
#[1] 3 3 3 2 0

或其他选项gregexpr来自base R

res <- gregexpr(paste0(words, collapse="|"), df1$Col1)
df1$Scores <-  lengths(res) * !sapply(res, function(x) -1 %in% x)

数据

words <- c("Mon", "Tues", "Wed")
df1 <- structure(list(Col1 = c("Mon,Tues,Wed,Thurs,Fri", "Mon,Tues,Wed,Thurs", 
"Mon,Tues,Wed", "Mon,Tues", "Thurs")), .Names = "Col1",
  class = "data.frame", row.names = c(NA, 
 -5L))