R中的“袋子字符”n-gram

时间:2016-01-03 20:09:59

标签: r machine-learning nlp tokenize n-gram

我想创建一个包含字符n-gram的Term文档矩阵。例如,请使用以下句子:

“在本文中,我们专注于一个不同但简单的文本表示。”

字符4克将是:| In_t |,| n_th |,| _thi |,| this |,| his__ |,| is_p |,| s_pa |,| _pap |,| pape |,| aper |,等

我已经使用R / Weka包来处理“词袋”n-gram,但是我很难适应下面的标记符来处理字符:

BigramTokenizer <- function(x){
    NGramTokenizer(x, Weka_control(min = 2, max = 2))}

tdm_bigram <- TermDocumentMatrix(corpus,
                                 control = list(
                                 tokenize = BigramTokenizer, wordLengths=c(2,Inf)))

关于如何使用R / Weka或其他包创建角色n-gram的任何想法?

2 个答案:

答案 0 :(得分:4)

我发现quanteda非常有用:

library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
#       Docs
# Terms  text1 text2
#   In_t     1     1
#   n_th     1     1
#   _thi     1     2
#   this     1     2
#   his_     1     1
#   is_p     1     0
#   s_pa     1     0
#   _pap     1     0
#   pape     1     0
#   aper     1     0
#   per.     1     0
#   is_l     0     1
#   s_li     0     1
#   _lin     0     1
#   line     0     1
#   ines     0     1
#   nes_     0     1
#   es_t     0     1
#   s_th     0     1
#   his.     0     1

答案 1 :(得分:1)

您需要使用CharacterNGramTokenizer代替。 NGramTokenizer分隔像空格这样的字符。

##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)


CharacterNGramTokenizer <- structure(function (x, control = NULL) 
{
  tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
  x <- Filter(nzchar, as.character(x))
  if (!length(x)) 
    return(character())
  .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer, 
                                                     "weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)), 
         .jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer", 
                         kind = "R_Weka_tokenizer_interface", class = "character", 
                         init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################

BigramTokenizer <- function(x){
    CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}

可悲的是,它默认不包含在RWeka中。 但是,如果你想使用weka,这似乎是一种整体版本