使用GoogleFinanceSource函数使用tm.plugin.webmining包进行文本挖掘

时间:2017-12-13 09:59:05

标签: r text-mining tm

我正在网上书http://tidytextmining.com/上学习文本挖掘。 在第五章: http://tidytextmining.com/dtm.html#financial

以下代码:

library(tm.plugin.webmining)
library(purrr)

company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook",
             "Twitter", "IBM", "Yahoo", "Netflix")
symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX")

download_articles <- function(symbol) {
    WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol)))
}
stock_articles <- data_frame(company = company,
                             symbol = symbol) %>%
    mutate(corpus = map(symbol, download_articles))

给我错误:

StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document

任何提示? 有人建议删除与“Twitter”相关的公司和符号,但它仍然不起作用并返回相同的错误。 非常感谢提前

3 个答案:

答案 0 :(得分:5)

我有同样的问题,但是,已经略微缩小了它。这段代码会导致相同的错误。

<div class="selectContainer">
  <select class="form-control pickerSelectClass" id="select">
    <option value="1" style="color:red">Red</option>
    <option value="2" style="color:blue">Blue</option>
  </select>
</div>
GoogleFinanceSource("NASDAQ:MSFT")

我还看到其他人建议删除Twitter的地方。由于Twitter不在纳斯达克上市,我明白它会失败。我尝试了建议的&#34;纽约证券交易所:TWTR&#34;然而,得到了相同的结果。

我试图使用GoogleNewsSource来查看我是否会遇到同样的问题并且得到了一个不同的错误,github上的这篇文章建议是由解析器引起的。我想知道这两个问题是否有关系。 github.com/mannau/tm.plugin.webmining/issues/14。

StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document
GoogleNewsSource("Microsoft")

总而言之,我找到了一个使用修改后的股票代码和YahooFinanceSource的工作如下:

Unknown IO error failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"
Error: 1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"

答案 1 :(得分:0)

问题是软件包tm.plugin.webmining已过期。

在回复时,只有YahooFinanceSourceYahooNewsSource仍然有效。


这里是快速参考和测试。

从作者写的Vignette page开始,应该有8个可能的源站点:

  1. GoogleBlogSearchSource
  2. GoogleFinaceSource
  3. GoogleNewsSource
  4. NYTimesSource
  5. ReutersNewsSource
  6. YahooFinanceSource
  7. YahooInplaySource
  8. YahooNewsSource

但是根据Github page,第一个“ GoogleBlogSearchSource”已被证明已停产。对于剩下的7个来源,我做了一个简单的测试,看它们是否有效:

library(tm)
library(tm.plugin.webmining)

googlefinance <- WebCorpus(GoogleFinanceSource("A"))
googlenews <- WebCorpus(GoogleNewsSource("A"))
nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
reutersnews <- WebCorpus(ReutersNewsSource("A"))
yahoofinance <- WebCorpus(YahooFinanceSource("A"))
yahooinplay <- WebCorpus(YahooInplaySource())
yahoonews <- WebCorpus(YahooNewsSource("M"))

结果表明,从技术上讲,所有yahoo的选项都仍在运行,但是无论我选择了什么参数,YahooInplaySource都会返回0个文档。

> googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlefinance <- WebCorpus(GoogleFinanceSource("A"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlenews <- WebCorpus(GoogleNewsSource("A"))
Unknown IO errorfailed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
Error in inherits(x, "WebSource") : 
  1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
> nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
Error in inherits(x, "WebSource") : object 'nytimes_appid' not found
> reutersnews <- WebCorpus(ReutersNewsSource("A"))
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error in inherits(x, "WebSource") : 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1
> yahoofinance <- WebCorpus(YahooFinanceSource("A"))
> yahoofinance
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 16
> yahooinplay <- WebCorpus(YahooInplaySource())
> yahooinplay
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("A"))
> yahoonews
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("M"))
> yahoonews
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 10

还值得一提的是,即使YahooFinanceSourse在工作,它也不会返回与GoogleFinanceSource一样的内容。如果您想使用 中的示例,我想您可以将YahooNewsSource与自定义查询列表一起使用。

答案 2 :(得分:-1)

在下面的代码行中,尝试更改默认值ie =&#34; utf-8&#34; to ie =&#34; ansi&#34;。尝试将其应用到您的脚本中,它应该可以正常工作。

WebCorpus(GoogleFinanceSource("NASDAQ:MSFT", params = list(hl = "en", q = "NASDAQ:MSFT", ie = "ansi", start = 0, num = 20, output = "rss")))