如何记录输出foreachPartition?

时间:2019-07-16 14:15:35

标签: apache-spark pyspark

在Pyspark中,我正在使用foreachPartition(makeHTTPRequests)发布请求以按分区传输数据。考虑到foreachPartition在工作程序节点上,如何收集响应? (我知道打印仅适用于工作节点日志)

我的代码结构如下:

def add_scores(spark, XXXXXX):
    headers = login()
    results = ResultsModels(spark) # to get sparksql model
    scores = results.get_scores(execution_id)
    scores = scores.repartition("id")
    url = "XXXXXXX"
    scores.foreachPartition(make_score_api_call(url, headers))

def make_score_api_call(url, headers):
    def make_call_function(rows):
        payload = []
        for row in rows:
            rowdict = row.asDict()
            rowdict['rules_aggregation'] = json.loads(row.asDict()['rules_aggregation'])
            payload.append(rowdict)
        response = requests.post(url, json=payload, headers=headers)
        print(response.status_code)
        print(response.text) 

    return make_call_function

1 个答案:

答案 0 :(得分:0)

您应该使用log4j,并且由于log4j不可序列化(必须从执行者登录),您应该这样使用它:

object LogHolder extends Serializable { // object to log within the executors
    @transient lazy val log = LogManager.getRootLogger // @transient lazy makes so it only initilized when you use it in the machine.
    log.setLevel(Level.INFO)
}

一旦将收集您的Spark作业,所有这些数据都将被收集。因此,不要使用print:

LogHolder.log.info("response status_code=" + response.status_code)
LogHolder.log.info("response text=" + response.text) 

仅当您想收集数据时,这才有助于记录响应。