每当我在tm R包中运行inspect()
函数时,我都会收到一个字数而不是文档的内容。无论我使用什么数据源,都会发生这种情况。
这是我的代码:
library(tm)
data <- c("one two three", "two three four", "three four five")
corp <- VCorpus(VectorSource(data))
inspect(corp)
我的输出示例:
inspect(corp)
VCorpus
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3
[[1]]
PlainTextDocument
Metadata: 7
Content: chars: 13
[[2]]
PlainTextDocument
Metadata: 7
Content: chars: 14
[[3]]
PlainTextDocument
Metadata: 7
Content: chars: 15
但我想要的是:
[[1]]
PlainTextDocument
Metadata: 7
one two three
[[2]]
PlainTextDocument
Metadata: 7
two three four
[[3]]
PlainTextDocument
Metadata: 7
three four five
下面是另一个使用Ovid文本文件的示例,这些文件默认带有TM Package,并在Ingo Feinerer开头的“Tm Package简介”中引用。 http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
代码:
txt <- system.file("texts", "txt", package = "tm")
ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
+ readerControl = list(language = "lat"))
inspect(ovid[1:2])
我想要什么以及它应该输出什么:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument (metadata: 7)>>
Si quis in hoc artem populo non novit amandi,
hoc legat et lecto carmine doctus amet.
arte citae veloque rates remoque moventur,
arte leves currus: arte regendus amor.
curribus Automedon lentisque erat aptus habenis,
Tiphys in Haemonia puppe magister erat:
me Venus artificem tenero praefecit Amori;
Tiphys et Automedon dicar Amoris ego.
ille quidem ferus est et qui mihi saepe repugnet:
sed puer est, aetas mollis et apta regi.
Phillyrides puerum cithara perfecit Achillem,
atque animos placida contudit arte feros.
qui totiens socios, totiens exterruit hostes,
creditur annosum pertimuisse senem.
[[2]]
<<PlainTextDocument (metadata: 7)>>
quas Hector sensurus erat, poscente magistro
verberibus iussas praebuit ille manus.
Aeacidae Chiron, ego sum praeceptor Amoris:
saevus uterque puer, natus uterque dea.
sed tamen et tauri cervix oneratur aratro,
frenaque magnanimi dente teruntur equi;
et mihi cedet Amor, quamvis mea vulneret arcu
pectora, iactatas excutiatque faces.
quo me fixit Amor, quo me violentius ussit,
它为我输出的内容:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 49
Content: chars: 48
Content: chars: 46
Content: chars: 47
Content: chars: 0
Content: chars: 52
Content: chars: 48
Content: chars: 46
Content: chars: 46
Content: chars: 53
Content: chars: 0
Content: chars: 49
Content: chars: 49
Content: chars: 50
Content: chars: 49
Content: chars: 44
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 48
Content: chars: 47
Content: chars: 47
Content: chars: 48
Content: chars: 46
Content: chars: 0
Content: chars: 48
Content: chars: 49
Content: chars: 45
Content: chars: 47
Content: chars: 45
Content: chars: 0
Content: chars: 51
Content: chars: 42
Content: chars: 45
Content: chars: 48
Content: chars: 44
答案 0 :(得分:3)
tm
包的0.6-1版改变了文档打印到屏幕的方式。它现在输出文档的紧凑表示而不是文档文本本身。
要获取文档文本,您需要将as.character()
函数应用于语料库中的文档。
例如,使用ovid示例(此处显示使用tm
版本0.6-2):
> txt <- system.file("texts", "txt", package = "tm")
> ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
readerControl = list(language = "lat"))
新的检查功能输出每个文件的紧凑表示:
> inspect(ovid[1:2])
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 676
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 700
要获取每个文档的全文,请将as.character()
函数应用于要检查的文档(请注意输出已被截断):
> as.character(ovid[[1]])
[1] " Si quis in hoc artem populo non novit amandi,"
[2] " hoc legat et lecto carmine doctus amet."
[3] " arte citae veloque rates remoque moventur,"
[4] " arte leves currus: arte regendus amor."
要清除显示输出,请将上述内容与writeLines()
功能结合使用:
> writeLines(as.character(ovid[[1]]))
Si quis in hoc artem populo non novit amandi,
hoc legat et lecto carmine doctus amet.
arte citae veloque rates remoque moventur,
arte leves currus: arte regendus amor.
要对语料库中的多个文档执行此操作,请将上述内容与lapply()
函数(输出截断)结合使用:
> lapply(ovid[1:2], as.character)
$ovid_1.txt
[1] " Si quis in hoc artem populo non novit amandi,"
[2] " hoc legat et lecto carmine doctus amet."
[3] " arte citae veloque rates remoque moventur,"
[4] " arte leves currus: arte regendus amor."
$ovid_2.txt
[1] " quas Hector sensurus erat, poscente magistro"
[2] " verberibus iussas praebuit ille manus."
[3] " Aeacidae Chiron, ego sum praeceptor Amoris:"
[4] " saevus uterque puer, natus uterque dea."
最后,要清理此输出并稍微复制先前的检查行为,请尝试使用l_ply()
包中的plyr
函数,如下所示(输出截断):
> l_ply(ovid[1:2], function(doc) {
print(doc) # output summary of document
writeLines("") # output blank line between results
writeLines(as.character(doc)) # output clean document text
writeLines("") # output blank line between results
})
<<PlainTextDocument>>
Metadata: 7
Content: chars: 676
Si quis in hoc artem populo non novit amandi,
hoc legat et lecto carmine doctus amet.
arte citae veloque rates remoque moventur,
arte leves currus: arte regendus amor.
<<PlainTextDocument>>
Metadata: 7
Content: chars: 700
quas Hector sensurus erat, poscente magistro
verberibus iussas praebuit ille manus.
Aeacidae Chiron, ego sum praeceptor Amoris:
saevus uterque puer, natus uterque dea.
希望这有帮助!
答案 1 :(得分:0)
之所以会发生这种情况,是因为在最新版本的软件包中修改了任何内容&#39; tm&#39; 0.6-1,于5月7日。我只检索了版本0.6并且有效。
在此链接下载档案:tm_0.6.tar.gz:http://cran.r-project.org/src/contrib/Archive/tm/
通过RStudio安装:工具&gt;安装包&gt;包存档文件&gt;选择tm_0.6.tar.gz并安装。
只有这个:)