R

时间:2017-07-02 03:12:37

标签: r tm

我正在研究R中的文本挖掘,在删除标点符号,数字,URL和停用词后,我的语料库中的文档很少。

 myStopwords <- setdiff(myStopwords, c("r", "big"))
 myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
 myCorpus <- tm_map(myCorpus, stripWhitespace)
 myCorpusCopy <- myCorpus
 for (i in c(1:2, 320))
 {
   cat(paste0("[", i, "] "))
   writeLines(strwrap(as.character(myCorpus[[i]]), 60))
 }

[1] examples calling java code r
[2] simulating mapreduce r big data analysis using flights data
rbloggers
[320] r reference card data mining now cran lists many useful r
functions packages data mining applications

之后,我正在尝试如下所述,

myCorpus <- tm_map(myCorpus, stemDocument)
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

当我尝试运行for循环时,它显示NA,如下所示

 for (i in c(1:2, 320))
 {
 cat(paste0("[", i, "] "))
 writeLines(strwrap(as.character(myCorpus[[i]]), 60))
 }

[1] NA
[2] NA
[320] NA

知道我哪里错了吗?

1 个答案:

答案 0 :(得分:0)

我使用内置数据集重现了您的问题:

data("crude")

myCorpus     <- as.VCorpus(crude)
myCorpusCopy <- myCorpus  
myCorpus <- tm_map(myCorpus, stemDocument)
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

我发现在最后一行之后,myCorpus对象的元素在其结构中有更多字段,例如metacontent在我的情况下,现在元素被命名为字符向量。

您仍然可以访问元素:

myCorpus[[1]]

Diamond Shamrock Corp said that\neffect today it had cut it contract price for crude oil by\n1.50 dlrs a barrel.\n    The reduct bring it post price for West Texas\nIntermedi to 16.00 dlrs a barrel, the copani said.\n    "The price reduct today was made in the light of falling\noil product price and a weak crude oil market," a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil compani that\nhav cut it contract, or posted, price over the last two days\ncit weak oil markets.\n Reuter 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "content" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           <NA> 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         "meta"

但是as.character()方法正在触及对象元素的新结构(str())的相反部分,而不是你想要的。现在正文实际上存储为names

我能够像这样修复循环:

for (i in c(1:2, length(myCorpus)))
{
  cat(paste0("[", i, "] "))
  writeLines(strwrap(as.character(names(myCorpus[[i]])), 60))
}
[1] Diamond Shamrock Corp said that effect today it had cut it
contract price for crude oil by 1.50 dlrs a barrel.  The
reduct bring it post price for West Texas Intermedi to
16.00 dlrs a barrel, the copani said.  "The price reduct
today was made in the light of falling oil product price
and a weak crude oil market," a company spokeswoman said.
Diamond is the latest in a line of U.S. oil compani that
hav cut it contract, or posted, price over the last two
days cit weak oil markets.  Reuter

[2] OPEC may be forc to meet befor a schedul June session to
readdress it product cutting agr if the organ want to halt
the current slide in oil prices, oil industri analyst said.
"The movement to higher oil price was never to be as easy a
OPEC thought. They may need an emerg meet to sort out th
problems," said Daniel Yergin, director of Cambridg Energy
Research Associates, CERA.  Analyst and oil industri sourc
said the problem OPEC face is excess oil suppli in world
oil markets.  "OPEC problem is not a price problem but a
production issu and must be address in that way," said Paul
Mlotok, oil analyst with Salomon Brother Inc.  He said the
market earlier optim about OPEC and its abl to keep product
under control have given way to a pessimist outlook that
the organ must address soon if it wish to regain the initi
in oil prices.  But some other analyst were uncertain that
even an emerg meet would address the problem of OPEC
production abov the 15.8 mln bpd quota set last December.
"OPEC has to learn that in a buyer market you cannot have
deem quotas, fix price and set differentials," said the
region manag for one of the major oil compani who spoke on
condit that he not be named. "The market is now tri to
teach them that lesson again," he added.  David T. Mizrahi,
editor of Mideast reports, expect OPEC to meet befor June,
although not immediately. However, he is not optimist that
OPEC can address it princip problems.  "They will not meet
now as they tri to take advantag of the wint demand to sell
their oil, but in late March and April when demand
slackens," Mizrahi said.  But Mizrahi said that OPEC is
unlik to do anyth more than reiter it agreement to keep
output at 15.8 mln bpd."  Analyst said that the next two
month will be critic for OPEC abil to hold togeth price and
output.  "OPEC must hold to it pact for the next six to
eight weeks sinc buyer will come back into the market
then," said Dillard Sprigg of Petroleum Analysi Ltd in New
York.  But Bijan Moussavar-Rahmani of Harvard Univers
Energy and Environ Polici Center said that the demand for
OPEC oil ha been rise through the first quarter and this
may have prompt excess in it production.  "Demand for their
(OPEC) oil is clear abov 15.8 mln bpd and is probabl closer
to 17 mln bpd or higher now so what we ar see character as
cheat is OPEC meet this demand through current production,"
he told Reuter in a telephon interview.  Reuter
[20] Argentin crude oil product was down 10.8 pct in Januari
1987 to 12.32 mln barrels, from 13.81 mln barrel in Januari
1986, Yacimiento Petrolifero Fiscales said.  Januari 1987
natur gas output total 1.15 billion cubic metrers, 3.6 pct
higher than 1.11 billion cubic metr produced in Januari
1986, Yacimiento Petrolifero Fiscal added.  Reuter