我有(1)一组句子,(2)一组关键词,以及(3)每个关键词的得分(实数)。我需要为句子分配分数,其中句子的分数= sum_over_keywords(句子中的关键词数量*关键词分数)。
可重复的例子:
library(stringi)
# generate 200 synthetic sentences containing 15 5-character words each
set.seed(7122016)
sentences_splitted = lapply(1:200, function(x) stri_rand_strings(15, 5))
# randomly select some words from the sentences as our keywords
set.seed(7122016)
keywords = unlist(lapply(sentences_splitted, function(x) if(sample(c(TRUE,FALSE),size=1,prob=c(0.2,0.8))) x[1]))
len_keywords = length(keywords)
# assign scores to keywords
set.seed(7122016)
my_scores = round(runif(len_keywords),4)
现在,对句子进行评分:
res = system.time(replicate(100,
unlist(lapply(sentences_splitted, function (x)
sum(unlist(lapply(1:len_keywords, function(y)
length(grep(paste0("\\<",keywords[y],"\\>"),x))*my_scores[y]
)))))))
我尝试尽可能地优化代码,但仍然非常慢:
user system elapsed
11.81 0.01 11.89
我需要重复此操作超过200,000次......有什么比length(grep(paste0("\\<",keywords[y],"\\>"),x))
更快的速度吗?我应该使用除嵌套lapply
之外的其他内容吗?
注意:
答案 0 :(得分:3)
我们可以使用关键字命名my_scores
向量。请记住,R允许按名称进行子集化。因此,如果我们能够得到匹配的单词,我们也可以得到分数:
names(my_scores) <- keywords
res <- sapply(sentences_splitted, function(x) sum(my_scores[x[x %in% keywords]]))
这就是所需要的一切。我们可以用一个较小的可测试示例来测试它:
#Create sentences
sentences_splitted <- list(c("abc", "def", "ghi", "abc"), c("xyz", "abc", "mno", "xyz"))
keywords <- c("abc", "xyz")
my_scores <- c(10,20)
#We should expect
10 * 2 #first sentence
10 * 1 and 20 * 2 #second sentence
#Expected result
[1] 20 50
#Check that function works as expected
names(my_scores) <- keywords
sapply(sentences_splitted, function(x) sum(my_scores[x[x %in% keywords]]))
[1] 20 50