性能 - 如何在与R中的给定句子匹配的单词列表中获取这些单词

时间:2016-05-06 11:15:45

标签: regex r

我试图只从列表中获得给定句子中的单词。单词可以包括 bigram 单词 。例如,

wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")

myresult应该是:

"really good" "better"

我有 1000句这样我需要比较单词。 单词列表也是更大。我尝试使用grep命令的暴力方法,但它花了很多时间(如预期的那样)。我希望以一种表现更好的方式获得匹配的单词。

3 个答案:

答案 0 :(得分:2)

require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")

# get  unigrams  from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))

# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)

# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE), 
                      grams,
                      by=c('wordList'='grams'))

matches
     wordList
1 really good
2      better

答案 1 :(得分:0)

我可以稍微修改一下 @ epi99 的答案。

wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")

# get  unigrams  from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

# .. and combine into a single vector

grams=c(unigrams, bigrams)

# use match function to get the matching words

matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]

答案 2 :(得分:0)

怎么样?
unlist(sapply(wordList, function(x) grep(x, sentence)))