大家下午好。
当我尝试执行文本挖掘操作时遇到了问题。问题是我有一个3000 obs的数据集。有几列主要是分类变量,一列是文本。例如:
| id | header | cate1 | cate2 |
| 75641 | <text> | 1 | 0 |
| 71245 | <text> | 0 | 0 |
当我执行文本挖掘技术同时保持语料库中的原始数据的id时,词干完全不起作用(将结果保留为类似的词)。虽然,其他功能工作正常。我从其他提出的问题中尝试了很多技巧,但它仍然不起作用。
这里我附上了部分代码:
dung<-read.csv("dung.csv")
library(RTextTools)
library(fpc)
library(cluster)
library(tm)
library(stringi)
library(stringr)
library(proxy)
library(wordcloud)
library(SnowballC)
library(ggplot2)
library(slam)
##################################
####### PREPROCESS HEADER #######
##################################
#Create new dataset
datah <- dung[,1:2] #Here I take only 2 columns of the original data where only id and text columns
remove(dung)
library(tm)
myReader <- readTabular(mapping=list(id="id",
content="header"))
mycorpus <- VCorpus(DataframeSource(datah), readerControl=list(reader=myReader))
##### Preprocessing #####
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " " , x))}) # Defying additional function
a <- tm_map(mycorpus, toSpace, "-")
a <- tm_map(mycorpus, toSpace, "/")
a <- tm_map(mycorpus, PlainTextDocument)
a <- tm_map(mycorpus, stemDocument, language = "russian")
skipWords <- function(x) removeWords(x, stopwords("russian"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)
mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10),
weighting = function(x) weightTfIdf(x, normalize = FALSE)))
inspect(mydtm)