我试图使用Spark从Kafka消费,更具体地说是PySpark和Structured Streaming。
import os
import time
import time
from ast import literal_eval
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col, struct, explode
from pyspark.sql import SparkSession
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell'
spark = SparkSession \
.builder \
.appName("Structured Streaming") \
.getOrCreate()
requests = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "ip-ec2:9092") \
.option("subscribe", "ssp.requests") \
.option("startingOffsets", "earliest") \
.load()
requests.printSchema()
# root |-- key: binary (nullable = true) |-- value: binary (nullable =
# true) |-- topic: string (nullable = true) |-- partition: integer
# (nullable = true) |-- offset: long (nullable = true) |-- timestamp:
# timestamp (nullable = true) |-- timestampType: integer (nullable =
# true)
当我运行下一行代码时
rawQuery = requests \
.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream.trigger(processingTime="5 seconds") \
.format("parquet") \
.option("checkpointLocation", "/home/user/folder/applicationHistory") \
.option("path", "/home/user/folder") \
.start()
rawQuery.awaitTermination()
Py4JJavaError Traceback(最近一次调用 最后)/opt/conda/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(* a,** kw) 62尝试: ---> 63返回f(* a,** kw) 64除了py4j.protocol.Py4JJavaError为e:
/opt/conda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer,gateway_client,target_id,name) 319"调用{0} {1} {2}时发生错误。\ n"。 - > 320格式(target_id,"。",名称),值) 321其他:
Py4JJavaError:调用o70.awaitTermination时发生错误。 : org.apache.spark.sql.streaming.StreamingQueryException:作业已中止。 ===流式查询===标识符:[id = c2b48840-5ba4-416e-a192-dcae94007856,runId = 4afcca20-00cd-4187-a70b-1b742f1f5c0d]当前承诺的偏移量:{} 当前可用偏移量:{KafkaSource [订阅[ssp.requests]]:
我无法理解此错误的原因
Py4JJavaError:调用o70.awaitTermination
时发生错误
答案 0 :(得分:0)
我刚刚将 rawQuery.awaitTermination()替换为
print(rawQuery.status)
time.sleep(60)
print(rawQuery.status)
rawQuery.stop()
它有效。