计算列表中第一个关键字实例,R中没有重复计数

时间:2017-07-13 21:44:21

标签: r web-scraping stringr

我有一个关键字列表:

library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))

我希望将这些关键字与数据框列(df $ text)中的文本进行匹配,并计算关键字在不同data.frame(matchdf)中出现的次数:

matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match

但是,我注意到此方法会计算列中每个关键字的出现次数。例如)

"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"

然后返回2的计数。但是,我只想计算字段中“分解”的第一个实例。

我认为有一种方法可以只使用str_count计算第一个实例,但似乎没有。

1 个答案:

答案 0 :(得分:1)

在此示例中,字符串并非绝对必要,来自基本R的bar就足够了。也就是说,使用grepl代替str_detect,如果您更喜欢包功能(正如@ Chi-Pak在评论中指出的那样)

grepl

结果

library(stringr)

words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots", 
           "poor body", "poor","not suitable", "not possible")

df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")

matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)

# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))

# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))

matchdf