谷歌的趋势是通过Metro Area与R一起刮痧

时间:2015-07-02 23:27:33

标签: r web-scraping google-trends

我在R中使用以下代码从Google趋势中下载数据,我主要从这里开始http://christophriedl.net/2013/08/22/google-trends-with-r/

    ############################################
##    Query GoogleTrends from R
##
## by Christoph Riedl, Northeastern University
## Additional help and bug-fixing re cookies by
## Philippe Massicotte Université du Québec à Trois-Rivières (UQTR)
############################################

# Load required libraries
library(RCurl)      # For getURL() and curl handler / cookie / google login
library(stringr)    # For str_trim() to trip whitespace from strings
# Google account settings
username <- "USERNAME"
password <- "PASSWORD"

# URLs
loginURL        <- "https://accounts.google.com/accounts/ServiceLogin"
authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth"
trendsURL       <- "http://www.google.com/trends/TrendsRepport?"

############################################
## This gets the GALX cookie which we need to pass back with the login form
############################################
getGALX <- function(curl) {
  txt = basicTextGatherer()
  curlPerform( url=loginURL, curl=curl, writefunction=txt$update, header=TRUE, ssl.verifypeer=FALSE )

  tmp <- txt$value()

  val <- grep("Cookie: GALX", strsplit(tmp, "\n")[[1]], val = TRUE)
  strsplit(val, "[:=;]")[[1]][3]

  return( strsplit( val, "[:=;]")[[1]][3]) 
}


############################################
## Function to perform Google login and get cookies ready
############################################
gLogin <- function(username, password) {
  ch <- getCurlHandle()

  ans <- (curlSetOpt(curl = ch,
                     ssl.verifypeer = FALSE,
                     useragent = getOption('HTTPUserAgent', "R"),
                     timeout = 60,         
                     followlocation = TRUE,
                     cookiejar = "./cookies",
                     cookiefile = ""))

  galx <- getGALX(ch)
  authenticatePage <- postForm(authenticateURL, .params=list(Email=username, Passwd=password, GALX=galx, PersistentCookie="yes", continue="http://www.google.com/trends"), curl=ch)

  authenticatePage2 <- getURL("http://www.google.com", curl=ch)

  if(getCurlInfo(ch)$response.code == 200) {
    print("Google login successful!")
  } else {
    print("Google login failed!")
  }
  return(ch)
}

##

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

get_interest_over_time <- function(res, clean.col.names = TRUE) {
  # remove all text before "Interest over time" data block begins
  data <- gsub(".*Interest over time", "", res)

  # remove all text after "Interest over time" data block ends
  data <- gsub("\n\n.*", "", data)

  # convert "interest over time" data block into data.frame
  data.df <- read.table(text = data, sep =",", header=TRUE)

  # Split data range into to only end of week date 
  data.df$Week <- gsub(".*\\s-\\s", "", data.df$Week)
  data.df$Week <- as.Date(data.df$Week)

  # clean column names
  if(clean.col.names == TRUE) colnames(data.df) <- gsub("\\.\\..*", "", colnames(data.df))


  # return "interest over time" data.frame
  return(data.df)
}

############################################
## Read data for a query
############################################
ch <- gLogin( username, password )
authenticatePage2 <- getURL("http://www.google.com", curl=ch)

res <- getForm(trendsURL, q="sugar", geo="US", content=1, export=1, graph="all_csv", curl=ch)
# Check if quota limit reached
if( grepl( "You have reached your quota limit", res ) ) {
  stop( "Quota limit reached; You should wait a while and try again lateer" )
}
df <- get_interest_over_time(res)
head(df)

write.csv(df,"sugar.csv")

当我搜索美国或任何一个国家时,一切正常,但我需要更多分散数据,在大都会区。但是,我无法使用此脚本获取这些查询。每当我这样做时,通过在地理字段中输入“US-IL”,我会收到错误:

Error in read.table(text = data, sep = ",", header = TRUE) : 
more columns than column names 

如果我试图采取大都会区的趋势(例如,使用类似“US-IL-602”的芝加哥),也会发生同样的情况。有谁知道我怎么能修改这个脚本才能使它工作?

非常感谢,

布赖恩。

0 个答案:

没有答案