我在R中使用以下代码从Google趋势中下载数据,我主要从这里开始http://christophriedl.net/2013/08/22/google-trends-with-r/
############################################
## Query GoogleTrends from R
##
## by Christoph Riedl, Northeastern University
## Additional help and bug-fixing re cookies by
## Philippe Massicotte Université du Québec à Trois-Rivières (UQTR)
############################################
# Load required libraries
library(RCurl) # For getURL() and curl handler / cookie / google login
library(stringr) # For str_trim() to trip whitespace from strings
# Google account settings
username <- "USERNAME"
password <- "PASSWORD"
# URLs
loginURL <- "https://accounts.google.com/accounts/ServiceLogin"
authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth"
trendsURL <- "http://www.google.com/trends/TrendsRepport?"
############################################
## This gets the GALX cookie which we need to pass back with the login form
############################################
getGALX <- function(curl) {
txt = basicTextGatherer()
curlPerform( url=loginURL, curl=curl, writefunction=txt$update, header=TRUE, ssl.verifypeer=FALSE )
tmp <- txt$value()
val <- grep("Cookie: GALX", strsplit(tmp, "\n")[[1]], val = TRUE)
strsplit(val, "[:=;]")[[1]][3]
return( strsplit( val, "[:=;]")[[1]][3])
}
############################################
## Function to perform Google login and get cookies ready
############################################
gLogin <- function(username, password) {
ch <- getCurlHandle()
ans <- (curlSetOpt(curl = ch,
ssl.verifypeer = FALSE,
useragent = getOption('HTTPUserAgent', "R"),
timeout = 60,
followlocation = TRUE,
cookiejar = "./cookies",
cookiefile = ""))
galx <- getGALX(ch)
authenticatePage <- postForm(authenticateURL, .params=list(Email=username, Passwd=password, GALX=galx, PersistentCookie="yes", continue="http://www.google.com/trends"), curl=ch)
authenticatePage2 <- getURL("http://www.google.com", curl=ch)
if(getCurlInfo(ch)$response.code == 200) {
print("Google login successful!")
} else {
print("Google login failed!")
}
return(ch)
}
##
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
get_interest_over_time <- function(res, clean.col.names = TRUE) {
# remove all text before "Interest over time" data block begins
data <- gsub(".*Interest over time", "", res)
# remove all text after "Interest over time" data block ends
data <- gsub("\n\n.*", "", data)
# convert "interest over time" data block into data.frame
data.df <- read.table(text = data, sep =",", header=TRUE)
# Split data range into to only end of week date
data.df$Week <- gsub(".*\\s-\\s", "", data.df$Week)
data.df$Week <- as.Date(data.df$Week)
# clean column names
if(clean.col.names == TRUE) colnames(data.df) <- gsub("\\.\\..*", "", colnames(data.df))
# return "interest over time" data.frame
return(data.df)
}
############################################
## Read data for a query
############################################
ch <- gLogin( username, password )
authenticatePage2 <- getURL("http://www.google.com", curl=ch)
res <- getForm(trendsURL, q="sugar", geo="US", content=1, export=1, graph="all_csv", curl=ch)
# Check if quota limit reached
if( grepl( "You have reached your quota limit", res ) ) {
stop( "Quota limit reached; You should wait a while and try again lateer" )
}
df <- get_interest_over_time(res)
head(df)
write.csv(df,"sugar.csv")
当我搜索美国或任何一个国家时,一切正常,但我需要更多分散数据,在大都会区。但是,我无法使用此脚本获取这些查询。每当我这样做时,通过在地理字段中输入“US-IL”,我会收到错误:
Error in read.table(text = data, sep = ",", header = TRUE) :
more columns than column names
如果我试图采取大都会区的趋势(例如,使用类似“US-IL-602”的芝加哥),也会发生同样的情况。有谁知道我怎么能修改这个脚本才能使它工作?
非常感谢,
布赖恩。