将精确单词与句子匹配并在R中形成数据框

时间:2016-09-26 12:23:46

标签: r

我希望将单词列表与句子列表进行匹配,并形成一个数据框,其中一列中匹配的单词(逗号分隔)和另一列中的相应句子。我希望这些单词与句子中的单词完全匹配,例如:

示例句子和单词:

sentences <- c("This is crap","You are awesome","A great app",
               "My advice would be to improve the look and feel of the app")
words <- c("crap","awesome","great","vice","advice","awe","prove","improve")

预期结果:

sentences                                                     words
This is crap                                                 "crap"  
You are awesome                                              "awesome"
A great app                                                  "great" 
My advice would be to improve the look and feel of the app   "advice","improve"

我有成千上万的这样的句子(28k)与成千上万的单词(65k)相匹配。我按照下面的方法来实现这个,但问题是我无法得到确切的单词匹配。

df <- data.frame(sentences) ; 
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]));

我遵循了不同的方法,但似乎没有比这更快。但我不能使用这种方法,因为这与确切的单词不匹配,而是匹配包含单词的字符串。有人可以建议我一个与确切单词同时匹配的解决方案不会失去太多性能吗?

3 个答案:

答案 0 :(得分:1)

您可以使用str_extract_all包中的stringr

library(stringi)
data.frame(sentences = sentences, 
           words = sapply(stri_extract_all_regex(sentences, paste(words, collapse = '|')), paste, collapse = ','), 
           stringsAsFactors = FALSE)

#                                                   sentences          words
#1                                               This is crap           crap
#2                                            You are awesome        awesome
#3                                                A great app          great
#4 My advice would be to improve the look and feel of the app advice,improve

答案 1 :(得分:1)

只需对prateek1592的简洁方法稍加调整,因为它的缺点是在第二句中返回字符串“ awe”的匹配项,而第二句仅包含该模式而不包含该特定字符串。将str_count()的输入从x更改为paste0(“ \ b”,x,“ \ b”),我们得到:

sentences <- c("This is crap","You are awesome","A great app",
               "My advice would be to improve the look and feel of the app")
words <- c("crap","awesome","great","vice","advice","awe","prove","improve")

library(stringr)
mat <- do.call(rbind,lapply(words, function(x) str_count(sentences, paste0("\\b", x, "\\b"))))
dimnames(mat) <- list(words,c())
data.frame(mat)

输出:

        X1 X2 X3 X4
crap     1  0  0  0
awesome  0  1  0  0
great    0  0  1  0
vice     0  0  0  0
advice   0  0  0  1
awe      0  0  0  0
prove    0  0  0  0
improve  0  0  0  1

答案 2 :(得分:0)

我认为@Sotos很好地回答了这个问题。但我想添加另一种表示方式,我认为这可能对进一步分析更有帮助。

sentences <- c("This is crap","You are awesome","A great app",
                   "My advice would be to improve the look and feel of the app")
words <- c("crap","awesome","great","vice","advice","awe","prove","improve")

library(stringr)
mat <- do.call(rbind,lapply(words, function(x) str_count(sentences, x)))
dimnames(mat) <- list(words,c())
data.frame(mat)

输出:

        X1 X2 X3 X4
crap     1  0  0  0
awesome  0  1  0  0
great    0  0  1  0
vice     0  0  0  1
advice   0  0  0  1
awe      0  1  0  0
prove    0  0  0  1
improve  0  0  0  1