在dtm的基础上测试dtm 1.因此1可以预测dtm1的类别

时间:2014-05-27 14:56:39

标签: r tm

库函数

     library(tm)
     library(e1071)
     library(plyr)

插入要分类的期刊名称

sample = c(
    "An Inductive Inference Machine",
    "Computing Machinery and Intelligence",
    "On the translation of languages from left to right",
    "First Draft of a Report on the EDVAC",
    "The Rendering Equation")
corpus <- Corpus(VectorSource(sample))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)

术语文档矩阵作为训练集

inspect(dtm)
Category=c("Machine learning","Artificial intelligence","Compilers","Computer   architecture","Computer graphics")

类别声明

my.data=data.frame(as.matrix(dtm),Category)
my.data 
sample = c(
    "gprof: A Call Graph Execution Profiler",
    "Architecture of the IBM System/360",
    "A Case for Redundant Arrays of Inexpensive Disks (RAID)",
    "Determining Optical Flow",
    "A relational model for large shared data banks",
    "some complementarity problems of z and lyoponov like transformations on       edclidean  jordan algebra")
corpus <- Corpus(VectorSource(sample))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
dtm1 <- DocumentTermMatrix(corpus)

术语文档矩阵作为测试集

inspect(dtm1) 

1 个答案:

答案 0 :(得分:0)

嗯,你的样本数据绝对没有重叠的术语,所以你可以做的并不多。 tm库没有为单词赋予含义,只是测量它们的相关性。因此,您需要提供足够的重叠数据,以便有可能将新输入与现有语料库进行匹配。

实际拥有真实数据后,您可以选择多种方式来构建模型。您可以使用class包中的kNN分类器,或rpart包中的决策树,或nnet包中的神经网络。 this presentation中有每个例子。但是由您决定什么是适合您的数据。那部分不是编程相关的问题。