保持ID与语料库和词干

时间:2017-08-25 10:57:09

标签: r text-mining corpus

大家下午好。

当我尝试执行文本挖掘操作时遇到了问题。问题是我有一个3000 obs的数据集。有几列主要是分类变量,一列是文本。例如:

| id    | header | cate1 | cate2 | 
| 75641 | <text> |   1   |   0   |
| 71245 | <text> |   0   |   0   |

当我执行文本挖掘技术同时保持语料库中的原始数据的id时,词干完全不起作用(将结果保留为类似的词)。虽然,其他功能工作正常。我从其他提出的问题中尝试了很多技巧,但它仍然不起作用。

这里我附上了部分代码:

dung<-read.csv("dung.csv") 
library(RTextTools)
library(fpc)   
library(cluster)
library(tm)
library(stringi)
library(stringr)
library(proxy)
library(wordcloud)
library(SnowballC) 

library(ggplot2)
library(slam)


##################################
#######  PREPROCESS HEADER #######
##################################

#Create new dataset
datah <- dung[,1:2] #Here I take only 2 columns of the original data where only id and text columns
remove(dung)
library(tm)
myReader <- readTabular(mapping=list(id="id", 
                                     content="header"))
mycorpus <- VCorpus(DataframeSource(datah), readerControl=list(reader=myReader))

##### Preprocessing #####
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " " , x))}) # Defying additional function

a <- tm_map(mycorpus, toSpace, "-")
a <- tm_map(mycorpus, toSpace, "/")
a <- tm_map(mycorpus, PlainTextDocument)
a <- tm_map(mycorpus, stemDocument, language = "russian")
skipWords <- function(x) removeWords(x, stopwords("russian"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)

mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10),
                                              weighting = function(x) weightTfIdf(x, normalize = FALSE)))
inspect(mydtm)

0 个答案:

没有答案