tm包:inspect()返回char计数,而不是内容

时间:2015-05-11 21:05:10

标签: r tm

每当我在tm R包中运行inspect()函数时,我都会收到一个字数而不是文档的内容。无论我使用什么数据源,都会发生这种情况。

这是我的代码:

library(tm)

data <- c("one two three", "two three four", "three four five")

corp <- VCorpus(VectorSource(data))

inspect(corp)

我的输出示例:

inspect(corp)

VCorpus

Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

[[1]]
PlainTextDocument

Metadata:  7

Content: chars: 13

[[2]]
PlainTextDocument

Metadata:  7

Content:  chars: 14

[[3]]
PlainTextDocument
Metadata:  7

Content:  chars: 15

但我想要的是:

[[1]]
PlainTextDocument

Metadata:  7

one two three

[[2]]
PlainTextDocument

Metadata:  7

two three four

[[3]]
PlainTextDocument
Metadata:  7

three four five

下面是另一个使用Ovid文本文件的示例,这些文件默认带有TM Package,并在Ingo Feinerer开头的“Tm Package简介”中引用。 http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

代码:

txt <- system.file("texts", "txt", package = "tm")
ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
 + readerControl = list(language = "lat"))
inspect(ovid[1:2])

我想要什么以及它应该输出什么:

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

 [[1]]
<<PlainTextDocument (metadata: 7)>>
  Si quis in hoc artem populo non novit amandi,
hoc legat et lecto carmine doctus amet.
arte citae veloque rates remoque moventur,
arte leves currus: arte regendus amor.
curribus Automedon lentisque erat aptus habenis,
Tiphys in Haemonia puppe magister erat:
me Venus artificem tenero praefecit Amori;
Tiphys et Automedon dicar Amoris ego.
ille quidem ferus est et qui mihi saepe repugnet:
sed puer est, aetas mollis et apta regi.
Phillyrides puerum cithara perfecit Achillem,
atque animos placida contudit arte feros.
qui totiens socios, totiens exterruit hostes,
creditur annosum pertimuisse senem.
[[2]]
<<PlainTextDocument (metadata: 7)>>
quas Hector sensurus erat, poscente magistro
verberibus iussas praebuit ille manus.
Aeacidae Chiron, ego sum praeceptor Amoris:
saevus uterque puer, natus uterque dea.
sed tamen et tauri cervix oneratur aratro,
frenaque magnanimi dente teruntur equi;
et mihi cedet Amor, quamvis mea vulneret arcu
pectora, iactatas excutiatque faces.
quo me fixit Amor, quo me violentius ussit,

它为我输出的内容:

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49
Content:  chars: 48
Content:  chars: 46
Content:  chars: 47
Content:  chars: 0
Content:  chars: 52
Content:  chars: 48
Content:  chars: 46
Content:  chars: 46
Content:  chars: 53
Content:  chars: 0
Content:  chars: 49
Content:  chars: 49
Content:  chars: 50
Content:  chars: 49
Content:  chars: 44

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 48
Content:  chars: 47
Content:  chars: 47
Content:  chars: 48
Content:  chars: 46
Content:  chars: 0
Content:  chars: 48
Content:  chars: 49
Content:  chars: 45
Content:  chars: 47
Content:  chars: 45
Content:  chars: 0
Content:  chars: 51
Content:  chars: 42
Content:  chars: 45
Content:  chars: 48
Content:  chars: 44

2 个答案:

答案 0 :(得分:3)

tm包的0.6-1版改变了文档打印到屏幕的方式。它现在输出文档的紧凑表示而不是文档文本本身。

要获取文档文本,您需要将as.character()函数应用于语料库中的文档。

例如,使用ovid示例(此处显示使用tm版本0.6-2):

> txt <- system.file("texts", "txt", package = "tm")
> ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
    readerControl = list(language = "lat"))

新的检查功能输出每个文件的紧凑表示:

> inspect(ovid[1:2])
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 676

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 700

要获取每个文档的全文,请将as.character()函数应用于要检查的文档(请注意输出已被截断):

> as.character(ovid[[1]])
 [1] "    Si quis in hoc artem populo non novit amandi,"    
 [2] "         hoc legat et lecto carmine doctus amet."     
 [3] "    arte citae veloque rates remoque moventur,"       
 [4] "         arte leves currus: arte regendus amor." 

要清除显示输出,请将上述内容与writeLines()功能结合使用:

> writeLines(as.character(ovid[[1]]))
    Si quis in hoc artem populo non novit amandi,
         hoc legat et lecto carmine doctus amet.
    arte citae veloque rates remoque moventur,
         arte leves currus: arte regendus amor.

要对语料库中的多个文档执行此操作,请将上述内容与lapply()函数(输出截断)结合使用:

> lapply(ovid[1:2], as.character)
$ovid_1.txt
 [1] "    Si quis in hoc artem populo non novit amandi,"    
 [2] "         hoc legat et lecto carmine doctus amet."     
 [3] "    arte citae veloque rates remoque moventur,"       
 [4] "         arte leves currus: arte regendus amor." 

$ovid_2.txt
 [1] "    quas Hector sensurus erat, poscente magistro"   
 [2] "         verberibus iussas praebuit ille manus."    
 [3] "    Aeacidae Chiron, ego sum praeceptor Amoris:"    
 [4] "         saevus uterque puer, natus uterque dea."

最后,要清理此输出并稍微复制先前的检查行为,请尝试使用l_ply()包中的plyr函数,如下所示(输出截断):

> l_ply(ovid[1:2], function(doc) { 
    print(doc) # output summary of document
    writeLines("") # output blank line between results
    writeLines(as.character(doc)) # output clean document text
    writeLines("") # output blank line between results
  })

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 676

    Si quis in hoc artem populo non novit amandi,
         hoc legat et lecto carmine doctus amet.
    arte citae veloque rates remoque moventur,
         arte leves currus: arte regendus amor.

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 700

    quas Hector sensurus erat, poscente magistro
         verberibus iussas praebuit ille manus.
    Aeacidae Chiron, ego sum praeceptor Amoris:
         saevus uterque puer, natus uterque dea.

希望这有帮助!

答案 1 :(得分:0)

之所以会发生这种情况,是因为在最新版本的软件包中修改了任何内容&#39; tm&#39; 0.6-1,于5月7日。我只检索了版本0.6并且有效。

  1. 在此链接下载档案:tm_0.6.tar.gz:http://cran.r-project.org/src/contrib/Archive/tm/

  2. 通过RStudio安装:工具&gt;安装包&gt;包存档文件&gt;选择tm_0.6.tar.gz并安装。

  3. 只有这个:)