我使用NGramTokenizer()进行1~3克分割,但似乎没有考虑标点符号,并删除标点符号。
因此,细分词对我来说并不理想。
(结果如:氧化剂氨基酸,氧化剂氨基酸,颗粒氧化剂等)。
是否有任何分段方式可以保留标点符号(我认为我可以使用POS标记来过滤掉分段工作后包含标点符号的字符串。)
或者有其他方式可以考虑使用标点符号进行分词吗?它会更多 对我来说很完美。
text <- "the slurry includes: attrition pellet, oxidant, amino acid and water."
corpus_text <- VCorpus(VectorSource(text))
content(corpus_text[[1]])
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
dtm <- DocumentTermMatrix(corpus_text, control = list(tokenize = BigramTokenizer))
mat <- as.matrix(dtm)
colnames(mat)
[1] "acid" "acid and" "acid and water"
[4] "amino" "amino acid" "amino acid and"
[7] "and" "and water" "attrition"
[10] "attrition pellet" "attrition pellet oxidant" "includes"
[13] "includes attrition" "includes attrition pellet" "oxidant"
[16] "oxidant amino" "oxidant amino acid" "pellet"
[19] "pellet oxidant" "pellet oxidant amino" "slurry"
[22] "slurry includes" "slurry includes attrition" "the"
[25] "the slurry" "the slurry includes" "water"
答案 0 :(得分:2)
您可以使用tokenize
包的quanteda
功能,如下所示:
library(quanteda)
text <- "some text, with commas, and semicolons; and even fullstop. to be toekinzed"
tokens(text, what = "word", remove_punct = FALSE, ngrams = 1:3)
输出:
tokens from 1 document.
text1 :
[1] "some" "text" "," "with"
[5] "commas" "," "and" "semicolons"
[9] ";" "and" "even" "fullstop"
[13] "." "to" "be" "toekinzed"
[17] "some text" "text ," ", with" "with commas"
[21] "commas ," ", and" "and semicolons" "semicolons ;"
[25] "; and" "and even" "even fullstop" "fullstop ."
[29] ". to" "to be" "be toekinzed" "some text ,"
[33] "text , with" ", with commas" "with commas ," "commas , and"
[37] ", and semicolons" "and semicolons ;" "semicolons ; and" "; and even"
[41] "and even fullstop" "even fullstop ." "fullstop . to" ". to be"
[45] "to be tokeinzed"
有关函数中每个参数的更多信息,请参阅documentation
<强>更新强> 有关文档术语频率,请查看Constructing a document-frequency matrix
作为示例,请尝试以下操作:
对于双字母(请注意,您不需要进行标记化):
dfm(text, remove_punct = FALSE, ngrams = 2, concatenator = " ")
答案 1 :(得分:1)
您可以在DTM之前通过tm_map传递语料库,例如
text <- "the slurry includes: attrition pellet, oxidant, amino acid and water."
corpus_text <- VCorpus(VectorSource(text))
content(corpus_text[[1]])
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation) #other common punctuation
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "and")) #ignoring "and"
return(corpus)
}
corpus_text <- clean_corpus(corpus_text)
content(clean_corpus(corpus_text)[[1]])
#" slurry includes attrition pellet oxidant amino acid water"
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
dtm <- DocumentTermMatrix(corpus_text, control = list(tokenize = BigramTokenizer))
mat <- as.matrix(dtm)
colnames(mat)