Question

我是R的新手，这是我第一次使用tm包进行文本挖掘。在用我的文本向量创建语料库和频率矩阵后，我注意到一些单词已经消失了。这是我的代码：

find({ancestors:"myStartingNodeId"})

运行之后，命令

setwd("C:/Users/M/Dropbox/Research")
library(tm)
data=read.table("abstract.tex.clean",stringsAsFactors=FALSE) 
data=unlist(data,use.names=FALSE) 
stopwords=read.table("stopwords.txt",stringsAsFactors=FALSE) 
stopwords=unlist(stopwords,use.names=FALSE) 
i=1
while(i<=length(stopwords)){data=data[data != stopwords[i]];i=i+1}

x = VCorpus(VectorSource(data))
dtm = DocumentTermMatrix(x)
dtm2 = as.matrix(dtm)
frequency = colSums(dtm2)
frequency = sort(frequency, decreasing=TRUE)

和

frequency["tas"]

产生相同的频率结果（35）。

然而，

length(which(data=="tas"))

返回N / A

，其中

frequency["ta"]

是77。

为什么这些条款消失了，我们将不胜感激！

Answer 1

默认情况下，当您拨打DocumentTermMatrix()时，它只会跟踪至少三个字符。您可以通过control=参数更改最小和最大字长。

words<-c("tas","ta","pas","pa")
Terms(DocumentTermMatrix(VCorpus(VectorSource(words))))
# [1] "pas" "tas"
Terms(DocumentTermMatrix(VCorpus(VectorSource(words)), control=list(wordLengths=c(1,Inf))))
# [1] "pa"  "pas" "ta"  "tas"

有关详细信息，建议您阅读?DocumentTermMatrix帮助页。

tm包R - 在为文本挖掘创建语料库时删除的术语

1 个答案: