我想创建一个包含字符n-gram的Term文档矩阵。例如,请使用以下句子:
“在本文中,我们专注于一个不同但简单的文本表示。”
字符4克将是:| In_t |,| n_th |,| _thi |,| this |,| his__ |,| is_p |,| s_pa |,| _pap |,| pape |,| aper |,等
我已经使用R / Weka包来处理“词袋”n-gram,但是我很难适应下面的标记符来处理字符:
BigramTokenizer <- function(x){
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(corpus,
control = list(
tokenize = BigramTokenizer, wordLengths=c(2,Inf)))
关于如何使用R / Weka或其他包创建角色n-gram的任何想法?
答案 0 :(得分:4)
我发现quanteda
非常有用:
library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
# Docs
# Terms text1 text2
# In_t 1 1
# n_th 1 1
# _thi 1 2
# this 1 2
# his_ 1 1
# is_p 1 0
# s_pa 1 0
# _pap 1 0
# pape 1 0
# aper 1 0
# per. 1 0
# is_l 0 1
# s_li 0 1
# _lin 0 1
# line 0 1
# ines 0 1
# nes_ 0 1
# es_t 0 1
# s_th 0 1
# his. 0 1
答案 1 :(得分:1)
您需要使用CharacterNGramTokenizer
代替。
NGramTokenizer
分隔像空格这样的字符。
##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)
CharacterNGramTokenizer <- structure(function (x, control = NULL)
{
tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
x <- Filter(nzchar, as.character(x))
if (!length(x))
return(character())
.jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
"weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)),
.jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer",
kind = "R_Weka_tokenizer_interface", class = "character",
init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################
BigramTokenizer <- function(x){
CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}
可悲的是,它默认不包含在RWeka中。 但是,如果你想使用weka,这似乎是一种整体版本