在Pyspark中,我正在使用foreachPartition(makeHTTPRequests)发布请求以按分区传输数据。考虑到foreachPartition在工作程序节点上,如何收集响应? (我知道打印仅适用于工作节点日志)
我的代码结构如下:
def add_scores(spark, XXXXXX):
headers = login()
results = ResultsModels(spark) # to get sparksql model
scores = results.get_scores(execution_id)
scores = scores.repartition("id")
url = "XXXXXXX"
scores.foreachPartition(make_score_api_call(url, headers))
def make_score_api_call(url, headers):
def make_call_function(rows):
payload = []
for row in rows:
rowdict = row.asDict()
rowdict['rules_aggregation'] = json.loads(row.asDict()['rules_aggregation'])
payload.append(rowdict)
response = requests.post(url, json=payload, headers=headers)
print(response.status_code)
print(response.text)
return make_call_function
答案 0 :(得分:0)
您应该使用log4j,并且由于log4j不可序列化(必须从执行者登录),您应该这样使用它:
object LogHolder extends Serializable { // object to log within the executors
@transient lazy val log = LogManager.getRootLogger // @transient lazy makes so it only initilized when you use it in the machine.
log.setLevel(Level.INFO)
}
一旦将收集您的Spark作业,所有这些数据都将被收集。因此,不要使用print:
LogHolder.log.info("response status_code=" + response.status_code)
LogHolder.log.info("response text=" + response.text)
仅当您想收集数据时,这才有助于记录响应。