我有一个文本字符串向量,例如:
Sentences <- c("Lorem ipsum dolor sit amet, WORD consetetur LOOK sadipscing elitr, sed diam nonumy.",
"Eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"At vero eos LOOK et accusam et justo duo WORD dolores et ea rebum." ,
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem WORD ipsum dolor sit amet.",
"Lorem ipsum dolor sit amet, consetetur sadipscing LOOK elitr, sed diam nonumy eirmod tempor.",
"Invidunt ut labore et WORD dolore magna aliquyam erat, sed LOOK diam voluptua." ,
"Duis autem vel eum iriure dolor in hendrerit in LOOK vulputate velit esse LOOK molestie consequat.",
"El illum dolore eu feugiat nulla LOOK WORD",
"Facilisis at LOOK vero eros et accumsan et WORD iusto LOOK odio dignissim quit.",
"Blandit LOOK praesent WORD LOOK luptatum zzril delenit augue duis dolore te feugait nulla facilisi.")
我想 COUNT
结果应如下所示(最大距离:三):
Result <- c(1,0,0,0,0,0,0,1,1,2)
提前谢谢。
答案 0 :(得分:2)
这是一个可能的解决方案。我们编写一个函数,将句子,要比较的单词和最大距离作为输入,默认为3。我们拆分该字符串以获得单词的向量,并在该向量中找到两个单词的位置。使用expand.grid
,我们会创建一个包含所有单词位置组合的data.frame
,并找出距离小于最大距离的频率。然后返回该号码。
word1='LOOK'
word2='WORD'
count_word_dist <- function(x,word1,word2,max_dist=3)
{
x = strsplit(x," ")[[1]]
w1 = which(x==word1)
w2 = which(x==word2)
if(length(w1) >0 & length(w2)>0)
return(sum(with(expand.grid(w1,w2),abs(Var1-Var2))<=max_dist))
else
return(0)
}
result = unname(sapply(Sentences,function(y) {count_word_dist(y,word1,word2)}))
输出:
> result
[1] 1 0 0 0 0 0 0 1 1 2
希望这有帮助!