R 3.4.1 - 为RSiteCatalyst排队报告智能使用while循环

时间:2017-09-18 10:11:16

标签: r error-handling while-loop adobe-analytics

实际

我一直在使用RSiteCatalyst包一段时间了。对于那些不了解它的人来说,这使得通过API从Adobe Analytics获取数据的过程变得更加容易。

到目前为止,工作流程如下:

  1. 提出请求,例如:
  2.     key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
                       metrics = c("pageviews"), date.granularity = "month",
                       max.attempts = 500, interval.seconds = 20) 
    
    1. 等待将保存为data.frame(示例结构)的响应:

      > View(head(key_metrics,1)) 
          datetime      name         year   month   day    pageviews 
        1 2015-07-01    July 2015    2015   7       1      45825
      
    2. 进行一些数据转换(例如:

      key_metrics$datetime <- as.Date(key_metrics$datetime)

    3. 这个工作流程的问题在于,有时(因为请求复杂性),我们可以等待很长时间,直到响应最终到来。如果R脚本包含40-50个相同复杂的API请求,则意味着我们将等待40-50次,直到数据最终到来并且我们可以执行新请求。这显然在我的ETL过程中生成了一个bootleneck。

      目标

      在软件包的大多数功能中都有一个参数enqueueOnly,它告诉Adobe在将报告ID作为响应提交时处理请求:

      key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
                     metrics = c("pageviews"), date.granularity = "month",
                     max.attempts = 500, interval.seconds = 20,
                     enqueueOnly = TRUE)
      
      > key_metrics
      [1] 1154642436 
      

      我可以获得真实的&#34;通过使用以下函数随时响应(这与数据):

      key_metrics <- GetReport(key_metrics)
      

      在每个请求中,我在生成报告ID和报告名称列表时添加参数enqueueOnly = TRUE

      queueFromIds <- c(queueFromIds, key_metrics)
      queueFromNames <- c(queueFromNames, "key_metrics")
      

      这种方法最重要的区别是我的所有请求都是由Adobe同时处理的,因此等待时间大大减少。

      问题

      然而,通过有效地获取数据,我遇到了问题。我正在尝试使用while循环,一旦获得数据,就会从先前的向量中删除密钥ID和密钥名称:

      while (length(queueFromNames)>0)
      {
        assign(queueFromNames[1], GetReport(queueFromIds[1],
                                            max.attempts = 3,
                                            interval.seconds = 5))
        queueFromNames <- queueFromNames[-1]
        queueFromIds <- queueFromIds[-1]
      }
      

      但是,只有在请求足够简单以便在几秒钟内处理时,这才有效。当请求足够复杂,无法在3次尝试中处理,间隔为5秒时,循环将停止并显示以下错误:

        

      ApiRequest中的错误(body = toJSON(request.body),func.name =   &#34; Report.Get&#34;,:错误:超出最大尝试次数   https://api3.omniture.com/admin/1.4/rest/?method=Report.Get

      哪些函数可以帮助我控制所有API请求被正确处理,并且在最佳情况下,需要额外时间(它们生成错误)的API请求被跳过,直到循环结束,他们又被要求了吗?

1 个答案:

答案 0 :(得分:2)

我使用了几个函数来独立生成/检索报告ID。这样,处理报告需要多长时间都没关系。我通常会在生成报告ID后12小时回来找他们。我认为它们会在48小时左右后过期。这些功能当然依赖于RSiteCatalyst。功能如下:

#' Generate report IDs to be retrieved later
#'
#' @description This function works in tandem with other functions to programatically extract big datasets from Adobe Analytics.
#' @param suite Report suite ID.
#' @param dateBegin Start date in the following format: YYYY-MM-DD.
#' @param dateFinish End date in the following format: YYYY-MM-DD.
#' @param metrics Vector containing up to 30 required metrics IDs.
#' @param elements Vector containing element IDs.
#' @param classification Vector containing classification IDs.
#'@param valueStart Integer value pointing to row to start report with.
#' @return A data frame containing all the report IDs per day. They are required to obtain all trended reports during the specified time frame.
#' @examples
#' \dontrun{
#' ReportsIDs <- reportsGenerator(suite,dateBegin,dateFinish,metrics, elements,classification)
#'}
#' @export
    reportsGenerator <- function(suite,
                                 dateBegin,
                                 dateFinish,
                                 metrics,
                                 elements,
                                 classification,
                                 valueStart) {

      #Convert dates to date format.
      #Deduct one from dateBegin to
      #neutralize the initial +1 in the loop.

      dateBegin <-  as.Date(dateBegin, "%Y-%m-%d") - 1
      dateFinish <-  as.Date(dateFinish, "%Y-%m-%d")
      timeRange <- dateFinish - dateBegin

      #Create data frame to store dates and report IDs
      VisitorActivityReports <-
        data.frame(matrix(NA, nrow = timeRange, ncol = 2))
      names(VisitorActivityReports) <- c("Date", "ReportID")

      #Run a loop to retrieve one ReportID for each day in the time period.
      for (i in 1:timeRange) {
        dailyDate <- as.character(dateBegin + i)
        print(i) #Visibility to end user
        print(dailyDate) #Visibility to end user
        VisitorActivityReports[i, 1] <- dailyDate


        VisitorActivityReports[i, 2] <-
          RSiteCatalyst::QueueTrended(
            reportsuite.id = suite,
            date.from = dailyDate,
            date.to = dailyDate,
            metrics = metrics,
            elements = elements,
            classification = classification,
            top = 50000,
            max.attempts = 500,
            start = valueStart,
            enqueueOnly = T
          )
      }
      return(VisitorActivityReports)
    }

您应该将上一个函数的输出分配给变量。然后使用该变量作为以下函数的输入。还要将 reportsRetriever 的结果分配给一个变量。输出将是一个数据框。只要它们共享相同的结构,该函数就会将所有报告一起 rbind 。不要尝试合并具有不同结构的报告。

#' Retrieve all reports stored as output of reportsGenerator function and consolidate them.
#'
#' @param dataFrameReports This is the output from reportsGenerator function. It MUST contain a column titled: ReportID
#' @details It is recommended to break the input data frame in chunks of 50 rows in order to prevent memory issues if the reports are too large. Otherwise the server or local computer might run out of memory.
#' @return A data frame containing all the consolidated reports defined by the reportsGenerator function.
#' @examples
#' \dontrun{
#' visitorActivity <- reportsRetriever(dataFrameReports)
#'}
#'
#' @export    

reportsRetriever <- function(dataFrameReports) {

      visitor.activity.list <- lapply(dataFrameReports$ReportID, tryCatch(GetReport))
      visitor.activity.df <- as.data.frame(do.call(rbind, visitor.activity.list))

      #Validate report integrity

      if (identical(as.character(unique(visitor.activity.df$datetime)), dataFrameReports$Date)) {
        print("Ok. All reports available")
        return(visitor.activity.df)
      } else {
        print("Some reports may have been missed.")
        missingReportsIndex <- !(as.character(unique(visitor.activity.df$datetime)) %in% dataFrameReports$Date)

        return(visitor.activity.df)
      }

    }