为什么ngrams()函数会给出不同的bigrams?

时间:2015-09-29 17:25:04

标签: r nlp n-gram

我正在编写一个R脚本并使用库(ngram)。

假设我有一个字符串,

"良好的品质狗食买了重要的可以狗食产品发现好的质量产品看起来像炖过程肉味更好的拉布拉多Finicki欣赏产品更好&#34 ;

并希望找到bi-gram。

ngram库给我二维如下:

"欣赏产品" "加工肉类" "食品" "食物买了" " qualiti dog" "找到的产品" "产品外观" "看起来像" "喜欢炖" "良好的品质" "拉布拉多菲尼基" "买了服务器" " qualiti产品" "更好的拉布拉多" "狗粮" "气味更好" "至关重要的" "肉味" "发现很好" "切断重要的" "炖煮过程" "可以养狗" " finicki appreci" "产品更好"

因为句子包含"狗粮"两次,我想要两次bi-gram。但是我得到了一次!

在thengram库或任何其他库中是否有一个选项可以在R中提供我的句子的所有二维词?

5 个答案:

答案 0 :(得分:6)

ngram的开发版本采用get.phrasetable方法:

devtools::install_github("wrathematics/ngram")
library(ngram)

text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

ng <- ngram(text)
head(get.phrasetable(ng))
#            ngrams freq       prop
# 1    good qualiti    2 0.07692308
# 2        dog food    2 0.07692308
# 3 appreci product    1 0.03846154
# 4    process meat    1 0.03846154
# 5    food product    1 0.03846154
# 6     food bought    1 0.03846154

此外,您可以使用print()方法并指定output == "full"。那就是:

print(ng, output = "full")

# NOTE: more output not shown...
better labrador | 1 
finicki {1} | 

dog food | 2 
product {1} | bought {1} 
# NOTE: more output not shown...

答案 1 :(得分:5)

您可以使用stylo包。给出重复:

library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)

结果:

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"        "can dog"          "dog food"        
[10] "food product"     "product found"    "found good"       "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  
>

答案 2 :(得分:3)

你可以使用RWeka。结果你可以看到&#34;狗粮&#34;和#34;良好的品质&#34;出现两次

txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"


library(RWeka)
RWEKABigramTokenizer <- function(x) {
      NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
}

RWEKABigramTokenizer(txt)

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"       
 [8] "can dog"          "dog food"         "food product"     "product found"    "found good"       "good qualiti"     "qualiti product" 
[15] "product look"     "look like"        "like stew"        "stew process"     "process meat"     "meat smell"       "smell better"    
[22] "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  

或者将tm包与RWeka结合使用

library(tm)
library(RWeka)
my_corp <- Corpus(VectorSource(txt))
tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer))

#show the 2 bigrams
findFreqTerms(tdm_RWEKA, lowfreq = 2)

[1] "dog food"     "good qualiti"

#turn into matrix with frequency counts
tdm_matrix <- as.matrix(tdm_RWEKA)

答案 3 :(得分:3)

为了生成这样的二元组,您不需要任何特殊包。基本上,拆分文本并再次粘贴在一起。

t <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ug <- strsplit(t, " ")[[1]]
bg <- paste(ug, ug[2:length(ug)])

结果bg将是:

[1] "good qualiti"     "qualiti dog"      "dog food"
[4] "food bought"      "bought sever"     "sever vital"
[7] "vital can"        "can dog"          "dog food"
[10] "food product"     "product found"    "found good"
[13] "good qualiti"     "qualiti product"  "product look"
[16] "look like"        "like stew"        "stew process"
[19] "process meat"     "meat smell"       "smell better"
[22] "better labrador"  "labrador finicki" "finicki appreci"
[25] "appreci product"  "product better"   "better qualiti" 

答案 4 :(得分:1)

尝试 quanteda 套餐:

> quanteda::tokenize(txt, ngrams = 2, concatenator = " ")
[[1]]
 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"     
 [7] "vital can"        "can dog"          "dog food"         "food product"     "product found"    "found good"      
[13] "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci" 
[25] "appreci product"  "product better"  

ngrams提供了大量其他参数,包括获得n种大小的不同组合或skip-gram。