我已经开始研究一个项目,该项目需要自然语言处理并在R中的支持向量机(SVM)上构建模型。
我想生成包含所有令牌的Term Document Matrix。
示例:
testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)
[[1]]
[1] "From" "month" "2" "the" "AST" "and" "total"
[8] "bilirubine" "were" "not" "measured" "."
[[2]]
[1] "16:OTHER" "-"
[3] "COMMENT" "REQUIRED"
[5] "IN" "COMMENT"
[7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"
[9] "consent" "not"
[11] "offered" "until"
[13] "T4" "."
[[3]]
[1] "M6" "is" "13" "days" "out" "of" "the" "visit" "window"
然后我生成了一个TDM:
tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity : 0%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms NULL
16:other 1
and 1
ast 1
bilirubine 1
column;07/02/2004/genotyping;sf- 1
comment 2
consent 1
days 1
from 1
genotyping 1
measured 1
month 1
not 2
offered 1
out 1
required 1
the 2
total 1
until 1
visit 1
were 1
window 1
我实际上在数据集中有三个文档:
&#34;从第2个月起,未测量AST和总胆红素。&#34;,
&#34; 16:其他 - 评论栏中要求的评论; 07/02/2004 / GENOTYPING; SF-基因分型同意在T4之前不提供。&#34;,
&#34; M6距访问窗口13天#34;所以它应该显示3列文件。
但我这里只显示了一个专栏。
有人可以就此给我一些建议吗?
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-2 openxlsx_3.0.0 magrittr_1.5 RWeka_0.4-28 openNLP_0.2-6 NLP_0.1-9
[7] rJava_0.9-8
答案 0 :(得分:0)
我认为你要做的是获取3个字符串的列表然后尝试将其变成语料库。我不确定在列表中是否有3个不同的字符串计为3个差异文档。
我把你的数据放入3个txt文件并运行它。
text_name <- file.path("C:\", "texts")
dir(text_name)
[1] "text1.txt" "text2.txt" "text3.txt"
如果您不想进行任何清洁,可以通过
直接将其转换为语料库docs <- Corpus(DirSource(text_name))
summary(docs)
Length Class Mode
text1.txt 2 PlainTextDocument list
text2.txt 2 PlainTextDocument list
text3.txt 2 PlainTextDocument list
dtm <- DocumentTermMatrix(docs)
dtm
<<DocumentTermMatrix (documents: 3, terms: 22)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
tdm <- TermDocumentMatrix(docs)
tdm
TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms text1.txt text2.txt text3.txt
16:other 0 1 0
and 1 0 0
ast 1 0 0
bilirubine 1 0 0
column;07/02/2004/genotyping;sf- 0 1 0
comment 0 2 0
consent 0 1 0
days 0 0 1
from 1 0 0
genotyping 0 1 0
measured. 1 0 0
month 1 0 0
not 1 1 0
offered 0 1 0
out 0 0 1
required 0 1 0
the 1 0 1
total 1 0 0
until 0 1 0
visit 0 0 1
were 1 0 0
window 0 0 1
我想你可能想要创建3个不同的列表然后将其转换为语料库。如果这有帮助,请告诉我。
答案 1 :(得分:0)
因此,考虑您希望文本列中的每一行都作为文档 将列表转换为数据帧
df=data.frame(testset)
install.package("tm")
docs=Corpus(VectorSource(df$testset))
summary(docs)
Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
3 2 PlainTextDocument list
在此之后按照上一个答案中提到的步骤来获取您的tdm。这应该可以解决你的问题