在GoogleNewsSource R上​​按日期选择或按日期排序

时间:2015-02-04 19:00:58

标签: r google-api text-mining tm google-news

我正在使用R包tm.plugin.webmining。使用函数GoogleNewsSource()我想查询按日期和特定日期排序的新闻。查询特定日期的新闻是否有任何参数?

library(tm)
library(tm.plugin.webmining)

searchTerm <- "Data Mining" 
corpusGoog <- WebCorpus(GoogleNewsSource(params=list(hl="en", q=searchTerm, 
                                         ie="utf-8", num=10, output="rss"  )))
headers <- meta(corpusGoog,tag="datetimestamp")

2 个答案:

答案 0 :(得分:0)

如果您正在寻找类似数据框架的结构,那就是您要创建它的方式(注意:并非所有字段都是从语料库中提取的):

library(dplyr)

make_row <- function(elem) {
  data.frame(timestamp=elem[[2]]$datetimestamp,
             heading=elem[[2]]$heading,
             description=elem[[2]]$description,
             content=elem$content, 
             stringsAsFactors=FALSE)
}

dat <- bind_rows(lapply(corpusGoog, make_row))
str(dat)

## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of  4 variables:
##  $ timestamp  : POSIXct, format: "2015-02-03 13:08:16" "2015-01-11 23:37:45" ...
##  $ heading    : chr  "A guide to data mining with Hadoop - Information Age" "Barack Obama to seek limits on student data mining - Politico" "Is data mining riddled with risk or a natural hazard of the internet? - INTHEBLACK" "Why an obscure British data-mining company is worth $3 billion - Quartz" ...
##  $ description: chr  "Information AgeA guide to data mining with HadoopInformation AgeWith the advent of the Internet of Things and the transition fr"| __truncated__ "PoliticoBarack Obama to seek limits on student data miningPoliticoPresident Barack Obama on Monday is expected to call for toug"| __truncated__ "INTHEBLACKIs data mining riddled with risk or a natural hazard of the internet?INTHEBLACKData mining is now viewed as a serious"| __truncated__ "QuartzWhy an obscure British data-mining company is worth $3 billionQuartzTesco, the troubled British retail group, is starting"| __truncated__ ...
##  $ content    : chr  "A guide to data mining with Hadoop\nHow businesses can realise and capitalise on the opportunities that Hadoop offers\nPosted b"| __truncated__ "By Stephanie Simon\n1/11/15 6:32 PM EST\nPresident Barack Obama on Monday is expected to call for tough legislation to protect "| __truncated__ "By Adam                             Courtenay\nData mining is now viewed as a serious security threat, but with all the hype, s"| __truncated__ "How We Buy\nJanuary 12, 2015\nTesco, the troubled British retail group, is starting over. After an accounting scandal , a serie"| __truncated__ ...

然后,你可以做任何你想做的事情。例如:

dat %>%
  arrange(timestamp) %>%
  select(heading) %>%
  head

## Source: local data frame [6 x 1]
## 
##                                                                                      heading
## 1 The potential of fighting corruption through data mining - Transparency International (pre
## 2                              Barack Obama to seek limits on student data mining - Politico
## 3                    Why an obscure British data-mining company is worth $3 billion - Quartz
## 4              Parks and Rec Recap: Treat Yo Self to Some Data Mining - Indianapolis Monthly
## 5    Fraud and data mining in Vancouverâ\u0080¦just Outside the Lines - Vancouver Sun (blog)
## 6     'Parks and Rec' Data-Mining Episode Was Eerily True To Life - MediaPost Communications

如果你想要/需要别的东西,你需要更清楚你的问题。

答案 1 :(得分:0)

我正在查看Google查询字符串,并注意到如果单击页面右侧的日期,它们会在查询中传递startdate和enddate标记。

您可以使用相同的标记名称,结果将限制在开始日期和结束日期之内。

GoogleFinanceSource(query, params = list(hl = "en", q = query, ie = "utf-8",
               start = 0, num = 25, output = "rss", 
               startdate='2015-10-26', enddate = '2015-10-28'))