R:支持向量机上的自然语言处理 - TermDocumentMatrix

时间:2016-06-15 14:51:19

标签: r nlp svm tm term-document-matrix

我已经开始研究一个项目,该项目需要自然语言处理并在R中的支持向量机(SVM)上构建模型。

我想生成包含所有令牌的Term Document Matrix。

示例:

testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",  "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)

[[1]]
 [1] "From"       "month"      "2"          "the"        "AST"        "and"        "total"     
 [8] "bilirubine" "were"       "not"        "measured"   "."         

[[2]]
 [1] "16:OTHER"                         "-"                               
 [3] "COMMENT"                          "REQUIRED"                        
 [5] "IN"                               "COMMENT"                         
 [7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"                      
 [9] "consent"                          "not"                             
[11] "offered"                          "until"                           
[13] "T4"                               "."                               

[[3]]
[1] "M6"     "is"     "13"     "days"   "out"    "of"     "the"    "visit"  "window" 

然后我生成了一个TDM:

tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity           : 0%
Maximal term length: 32
Weighting          : term frequency (tf)

                                  Docs
Terms                              NULL
  16:other                            1
  and                                 1
  ast                                 1
  bilirubine                          1
  column;07/02/2004/genotyping;sf-    1
  comment                             2
  consent                             1
  days                                1
  from                                1
  genotyping                          1
  measured                            1
  month                               1
  not                                 2
  offered                             1
  out                                 1
  required                            1
  the                                 2
  total                               1
  until                               1
  visit                               1
  were                                1
  window                              1

我实际上在数据集中有三个文档: &#34;从第2个月起,未测量AST和总胆红素。&#34;, &#34; 16:其他 - 评论栏中要求的评论; 07/02/2004 / GENOTYPING; SF-基因分型同意在T4之前不提供。&#34;,
&#34; M6距访问窗口13天#34;所以它应该显示3列文件。 但我这里只显示了一个专栏。

有人可以就此给我一些建议吗?

sessionInfo()
    R version 3.3.0 (2016-05-03)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-2       openxlsx_3.0.0 magrittr_1.5   RWeka_0.4-28   openNLP_0.2-6  NLP_0.1-9     
[7] rJava_0.9-8   

2 个答案:

答案 0 :(得分:0)

我认为你要做的是获取3个字符串的列表然后尝试将其变成语料库。我不确定在列表中是否有3个不同的字符串计为3个差异文档。

我把你的数据放入3个txt文件并运行它。

text_name <- file.path("C:\", "texts")
dir(text_name)

[1] "text1.txt" "text2.txt" "text3.txt"

如果您不想进行任何清洁,可以通过

直接将其转换为语料库
docs <- Corpus(DirSource(text_name)) 
summary(docs)
          Length Class             Mode
text1.txt 2      PlainTextDocument list
text2.txt 2      PlainTextDocument list
text3.txt 2      PlainTextDocument list

dtm <- DocumentTermMatrix(docs)   
dtm

<<DocumentTermMatrix (documents: 3, terms: 22)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

tdm <- TermDocumentMatrix(docs) 
tdm
TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

inspect(tdm)


<<TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

                              Docs
Terms                              text1.txt text2.txt text3.txt
16:other                                 0         1         0
and                                      1         0         0
ast                                      1         0         0
bilirubine                               1         0         0
column;07/02/2004/genotyping;sf-         0         1         0
comment                                  0         2         0
consent                                  0         1         0
days                                     0         0         1
from                                     1         0         0
genotyping                               0         1         0
measured.                                1         0         0
month                                    1         0         0
not                                      1         1         0
offered                                  0         1         0
out                                      0         0         1
required                                 0         1         0
the                                      1         0         1
total                                    1         0         0
until                                    0         1         0
visit                                    0         0         1
were                                     1         0         0
window                                   0         0         1

我想你可能想要创建3个不同的列表然后将其转换为语料库。如果这有帮助,请告诉我。

答案 1 :(得分:0)

因此,考虑您希望文本列中的每一行都作为文档 将列表转换为数据帧

df=data.frame(testset)
install.package("tm")
docs=Corpus(VectorSource(df$testset))
summary(docs)
  Length Class             Mode
1 2      PlainTextDocument list
2 2      PlainTextDocument list
3 2      PlainTextDocument list

在此之后按照上一个答案中提到的步骤来获取您的tdm。这应该可以解决你的问题