library(tm)
library(e1071)
library(plyr)
sample = c(
"An Inductive Inference Machine",
"Computing Machinery and Intelligence",
"On the translation of languages from left to right",
"First Draft of a Report on the EDVAC",
"The Rendering Equation")
corpus <- Corpus(VectorSource(sample))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
Category=c("Machine learning","Artificial intelligence","Compilers","Computer architecture","Computer graphics")
my.data=data.frame(as.matrix(dtm),Category)
my.data
sample = c(
"gprof: A Call Graph Execution Profiler",
"Architecture of the IBM System/360",
"A Case for Redundant Arrays of Inexpensive Disks (RAID)",
"Determining Optical Flow",
"A relational model for large shared data banks",
"some complementarity problems of z and lyoponov like transformations on edclidean jordan algebra")
corpus <- Corpus(VectorSource(sample))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
dtm1 <- DocumentTermMatrix(corpus)
inspect(dtm1)
答案 0 :(得分:0)
嗯,你的样本数据绝对没有重叠的术语,所以你可以做的并不多。 tm
库没有为单词赋予含义,只是测量它们的相关性。因此,您需要提供足够的重叠数据,以便有可能将新输入与现有语料库进行匹配。
实际拥有真实数据后,您可以选择多种方式来构建模型。您可以使用class
包中的kNN分类器,或rpart
包中的决策树,或nnet
包中的神经网络。 this presentation中有每个例子。但是由您决定什么是适合您的数据。那部分不是编程相关的问题。