每年从NYT和WSJ获取关于某个主题的文章数量?

时间:2014-03-12 18:44:52

标签: r web-scraping

我想创建一个数据框架,用于刮擦NYT和WSJ,并且每年都有关于给定主题的文章数量。那就是:

      NYT   WSJ
2011   2     3
2012   10    7

我为NYT找到了this教程,但对我不起作用:_(。当我到达第30行时,我收到此错误:

> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) : 
  length of 'dimnames' [1] not equal to array extent

非常感谢任何帮助。

谢谢!

PS:这是我的代码无效(需要NYT api密钥http://developer.nytimes.com/apps/register

# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)

### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!

q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above

# calculate parameter for offset
os <- 0:(records/10-1)

# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector

# read in the rest via loop
for (i in 2:length(os)) {
  # concatenate URL for each offset
  uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
  raw.data <- readLines(uri, warn="F")
  res <- fromJSON(raw.data)
  dat <- append(dat, unlist(res$results)) # append
}

# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))

# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days

# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)

plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)

1 个答案:

答案 0 :(得分:2)

  

更新:回购现在位于https://github.com/rOpenGov/rtimes

Duncan Temple-Lang https://github.com/omegahat/RNYTimes创建了一个RNYTimes包 - 但它已经过时,因为NYTimes API现在在v2上。我一直只为一个政治终点而努力,但与你无关。

我正在重新布线RNYTimes ...从github安装。您需要先安装devtools才能获得install_github

install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")

然后尝试搜索,例如

library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")

这可以为您提供找到的文章数量

moocs$response$meta$hits

[1] 121

您可以通过

获取每篇文章的字数
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))

[1]  157  362 1316  312 2936 2973  355 1364   16  880