如何有效地计算在另一个字符串中出现的一个字符串的实例数?
以下是我迄今为止的代码。它成功识别是否在另一个字符串中出现了一个字符串的任何实例。但是,我不知道如何将它从TRUE / FALSE关系扩展到计数关系。
x <- ("Hello my name is Christopher. Some people call me Chris")
y <- ("Chris is an interesting person to be around")
z <- ("Because he plays sports and likes statistics")
lll <- tolower(list(x,y,z))
dict <- tolower(c("Chris", "Hell"))
mmm <- matrix(nrow=length(lll), ncol=length(dict), NA)
for (i in 1:length(lll)) {
for (j in 1:length(dict)) {
mmm[i,j] <- sum(grepl(dict[j],lll[i]))
}
}
mmm
它产生:
[,1] [,2]
[1,] 1 1
[2,] 1 0
[3,] 0 0
由于小写字符串“chris”在lll[1]
中出现两次,我希望mmm[1,1]
为2而不是1.
真实的例子是更高的维度...如果代码可以被矢量化而不是使用我的强力循环,那就太喜欢了。
答案 0 :(得分:7)
两个快速提示:
stringr
包library(stringr)
dict <- setNames(nm=dict) # simply for neatness
lapply(dict, str_count, string=lll)
# $chris
# [1] 2 1 0
#
# $hell
# [1] 1 0 0
# sapply(dict, str_count, string=lll)
# chris hell
# [1,] 2 1
# [2,] 1 0
# [3,] 0 0
答案 1 :(得分:2)
而不是sum(grepl(dict[j],lll[i]))
,请尝试sum(gregexpr(dict[j],lll[i])[[1]] > 0)
答案 2 :(得分:2)
您也可以这样做:
count.matches <- function(pat, vec) sapply(regmatches(vec, gregexpr(pat, vec)), length)
mapply(count.matches, c('chris', 'hell'), list(lll))
# chris hell
# [1,] 2 1
# [2,] 1 0
# [3,] 0 0
答案 3 :(得分:1)
llll<-rep(lll,length(dict))
dict1<-rep(dict,each=length(lll))
do.call(rbind,Map(function(x,y)list(y,sum(gregexpr(y,x)[[1]] > 0)), llll,dict1))
[,1] [,2]
hello my name is christopher. some people call me chris "chris" 2
chris is an interesting person to be around "chris" 1
because he plays sports and likes statistics "chris" 0
hello my name is christopher. some people call me chris "hell" 1
chris is an interesting person to be around "hell" 0
because he plays sports and likes statistics "hell" 0
然后,您可以使用reshape
来获得所需内容。
答案 4 :(得分:1)
这使用qdap包。 CRAN版本应该可以正常工作,但您可能需要dev version
library(qdap)
termco(c(x, y, z), 1:3, c('chris', 'hell'))
## 3 word.count chris hell
## 1 1 10 2(20.00%) 1(10.00%)
## 2 2 8 1(12.50%) 0
## 3 3 7 0 0
termco(c(x, y, z), 1:3, c('chris', 'hell'))$raw
## 3 word.count chris hell
## 1 1 10 2 1
## 2 2 8 1 0
## 3 3 7 0 0