Cannot scrape Google news BY DATE in R using R package 'tm.plugin.webmining'

时间:2016-12-09 12:48:25

标签: r time-series tm

I'm using x64 windows I would like to scrape google news data and I want a time series from 2004-2016 of news articles(headlines) per month or day, in order to conduct some analysis. I tried some sources in R that is GoogleNewsSource(), getURIAsynchronous, read_html() ... first,

library("tm")
library("tm.plugin.webmining")
googlenews <- GoogleNewsSource("yen", since="1-2-2015", until="31-2-2015")

(Someone answered about similar to this problem, add a option as_drrb=b. But not work)

Second,

url <- "https://www.google.co.kr/search?q=yen&num=100&hl=en&tbm=nws&tbs=cdr:1,cd_min:4/20/2014,cd_max:1/14/2015"
uris = c(url)
txt = getURIAsynchronous(uris)</i>

When I run this code, news are newest like 'Dec 9, 2016' NOT 2015. In the results, url is changed that.

https://www.google.co.kr/search?q=yen&num=100&hl=en&tbm=nws&tbs=cdr:1,cd_min:4/20/2014,cd_max:1/14/2015

> https://www.google.co.kr/search?q=yen&num=100&hl=en&tbm=nws&gbv=1&tbs=cdr:1,cd_min:4/20/2014,cd_max:1/14/2015

I think that gbv=1 works to ignore search periods. But I can't find why changed this link.

Third,

library(rvest)
headlines = read_html("https://www.google.co.kr/search?q=yen&num=100&hl=en&tbm=nws&output=rss&tbs=cdr:1,cd_min:4/20/2014,cd_max:1/14/2015") %>%
html_nodes(".r") %>% 
html_text()

It has same problem about gbv=1.

I found the option gbv=1:without JAVA, gbv=2: with JAVA.

I want to know solution any method.

0 个答案:

没有答案