我正在尝试实现Spark流媒体应用程序,但我得到了一个例外:" py4j.Py4JException:方法 getnewargs ([])不存在"
我不明白这个例外的来源。我读到here我不能在驱动程序之外使用SparkSession实例。但是,我不知道我是否这样做。我不明白如何判断一些代码是在驱动程序还是执行程序上执行 - 我理解转换和操作之间的区别(我认为),但是当涉及到流和foreachRDD时,我迷路了。
该应用程序是一个Spark流媒体应用程序,在AWS EMR上运行,从AWS Kinesis读取数据。我们通过spark-submit和--deploy-mode集群提交Spark应用程序。流中的每个记录都是以下形式的JSON对象:
{"type":"some string","state":"an escaped JSON string"}
E.g:
{"type":"type1","state":"{\"some_property\":\"some value\"}"}
这是我的应用程序当前状态:
# Each handler subclasses from BaseHandler and
# has the method
# def process(self, df, df_writer, base_writer_path)
# Each handler's process method performs additional transformations.
# df_writer is a function which writes a Dataframe to some S3 location.
HANDLER_MAP = {
'type1': Type1Handler(),
'type2': Type2Handler(),
'type3': Type3Handler()
}
FORMAT = 'MyProject %(asctime)s %(levelname)s %(name)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
# Use a closure and lambda to create streaming context
create = lambda: create_streaming_context(
spark_app_name=spark_app_name,
kinesis_stream_name=kinesis_stream_name,
kinesis_endpoint=kinesis_endpoint,
kinesis_region=kinesis_region,
initial_position=InitialPositionInStream.LATEST,
checkpoint_interval=checkpoint_interval,
checkpoint_s3_path=checkpoint_s3_path,
data_s3_path=data_s3_path)
streaming_context = StreamingContext.getOrCreate(checkpoint_s3_path, create)
streaming_context.start()
streaming_context.awaitTermination()
创建流式上下文的功能:
def create_streaming_context(
spark_app_name, kinesis_stream_name, kinesis_endpoint,
kinesis_region, initial_position, checkpoint_interval,
data_s3_path, checkpoint_s3_path):
"""Create a new streaming context or reuse a checkpointed one."""
# Spark configuration
spark_conf = SparkConf()
spark_conf.set('spark.streaming.blockInterval', 37500)
spark_conf.setAppName(spark_app_name)
# Spark context
spark_context = SparkContext(conf=spark_conf)
# Spark streaming context
streaming_context = StreamingContext(spark_context, batchDuration=300)
streaming_context.checkpoint(checkpoint_s3_path)
# Spark session
spark_session = get_spark_session_instance(spark_conf)
# Set up stream processing
stream = KinesisUtils.createStream(
streaming_context, spark_app_name, kinesis_stream_name,
kinesis_endpoint, kinesis_region, initial_position,
checkpoint_interval)
# Each record in the stream is a JSON object in the form:
# {"type": "some string", "state": "an escaped JSON string"}
json_stream = stream.map(json.loads)
for state_type in HANDLER_MAP.iterkeys():
filter_stream(json_stream, spark_session, state_type, data_s3_path)
return streaming_context
函数get_spark_session_instance返回一个全局SparkSession实例(从here复制):
def get_spark_session_instance(spark_conf):
"""Lazily instantiated global instance of SparkSession"""
logger.info('Obtaining global SparkSession instance...')
if 'sparkSessionSingletonInstance' not in globals():
logger.info('Global SparkSession instance does not exist, creating it...')
globals()['sparkSessionSingletonInstance'] = SparkSession\
.builder\
.config(conf=spark_conf)\
.getOrCreate()
return globals()['sparkSessionSingletonInstance']
filter_stream函数旨在通过JSON中的type属性过滤流。目的是将流转换为流,其中每个记录是来自"状态"的转义JSON字符串。原始JSON中的属性:
def filter_stream(json_stream, spark_session, state_type, data_s3_path):
"""Filter stream by type and process the stream."""
state_type_stream = json_stream\
.filter(lambda jsonObj: jsonObj['type'] == state_type)\
.map(lambda jsonObj: jsonObj['state'])
state_type_stream.foreachRDD(lambda rdd: process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path))
process_rdd函数旨在使用正确的模式将JSON加载到Dataframe中,具体取决于原始JSON对象中的类型。处理程序实例返回一个有效的Spark模式,并有一个进程方法,它对数据帧执行进一步的转换(之后调用df_writer,并将Dataframe写入S3):
def process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path):
"""Process an RDD by state type."""
if rdd.isEmpty():
logger.info('RDD is empty, returning early.')
return
handler = HANDLER_MAP[state_type]
df = spark_session.read.json(rdd, handler.get_schema())
handler.process(df, df_writer, data_s3_path)
基本上我对异常的来源感到困惑。它与我如何使用spark_session.read.json有关吗?如果是这样,它是如何相关的?如果没有,我的代码中还有其他不正确的内容吗?
如果我只是用create_streaming_context方法的内容替换对StreamingContext.getOrCreate的调用,那么一切似乎都能正常工作。我对此错了 - 我得到了同样的异常。我认为检查站的东西是一个红色的鲱鱼...我显然做了别的错误。
我非常感谢您对此问题的任何帮助,我很乐意澄清任何内容或添加其他信息!