Spark流和py4j.Py4JException:方法__getnewargs __([])不存在

时间:2017-03-24 23:18:51

标签: python apache-spark pyspark pyspark-sql

我正在尝试实现Spark流媒体应用程序,但我得到了一个例外:" py4j.Py4JException:方法 getnewargs ([])不存在"

我不明白这个例外的来源。我读到here我不能在驱动程序之外使用SparkSession实例。但是,我不知道我是否这样做。我不明白如何判断一些代码是在驱动程序还是执行程序上执行 - 我理解转换和操作之间的区别(我认为),但是当涉及到流和foreachRDD时,我迷路了。

该应用程序是一个Spark流媒体应用程序,在AWS EMR上运行,从AWS Kinesis读取数据。我们通过spark-submit和--deploy-mode集群提交Spark应用程序。流中的每个记录都是以下形式的JSON对象:

{"type":"some string","state":"an escaped JSON string"}

E.g:

{"type":"type1","state":"{\"some_property\":\"some value\"}"}

这是我的应用程序当前状态:

# Each handler subclasses from BaseHandler and
# has the method
# def process(self, df, df_writer, base_writer_path)
# Each handler's process method performs additional transformations.
# df_writer is a function which writes a Dataframe to some S3 location.

HANDLER_MAP = {
    'type1': Type1Handler(),
    'type2': Type2Handler(),
    'type3': Type3Handler()
}

FORMAT = 'MyProject %(asctime)s %(levelname)s %(name)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)

# Use a closure and lambda to create streaming context
create = lambda: create_streaming_context(
    spark_app_name=spark_app_name,
    kinesis_stream_name=kinesis_stream_name,
    kinesis_endpoint=kinesis_endpoint,
    kinesis_region=kinesis_region,
    initial_position=InitialPositionInStream.LATEST,
    checkpoint_interval=checkpoint_interval,
    checkpoint_s3_path=checkpoint_s3_path,
    data_s3_path=data_s3_path)

streaming_context = StreamingContext.getOrCreate(checkpoint_s3_path, create)

streaming_context.start()
streaming_context.awaitTermination()

创建流式上下文的功能:

def create_streaming_context(
    spark_app_name, kinesis_stream_name, kinesis_endpoint,
    kinesis_region, initial_position, checkpoint_interval,
    data_s3_path, checkpoint_s3_path):
    """Create a new streaming context or reuse a checkpointed one."""

    # Spark configuration
    spark_conf = SparkConf()
    spark_conf.set('spark.streaming.blockInterval', 37500)
    spark_conf.setAppName(spark_app_name)

    # Spark context
    spark_context = SparkContext(conf=spark_conf)

    # Spark streaming context
    streaming_context = StreamingContext(spark_context, batchDuration=300)
    streaming_context.checkpoint(checkpoint_s3_path)

    # Spark session
    spark_session = get_spark_session_instance(spark_conf)

    # Set up stream processing
    stream = KinesisUtils.createStream(
        streaming_context, spark_app_name, kinesis_stream_name,
        kinesis_endpoint, kinesis_region, initial_position,
        checkpoint_interval)

    # Each record in the stream is a JSON object in the form:
    # {"type": "some string", "state": "an escaped JSON string"}
    json_stream = stream.map(json.loads)

    for state_type in HANDLER_MAP.iterkeys():
        filter_stream(json_stream, spark_session, state_type, data_s3_path)

    return streaming_context

函数get_spark_session_instance返回一个全局SparkSession实例(从here复制):

def get_spark_session_instance(spark_conf):
    """Lazily instantiated global instance of SparkSession"""

    logger.info('Obtaining global SparkSession instance...')
    if 'sparkSessionSingletonInstance' not in globals():
        logger.info('Global SparkSession instance does not exist, creating it...')

        globals()['sparkSessionSingletonInstance'] = SparkSession\
            .builder\
            .config(conf=spark_conf)\
            .getOrCreate()

    return globals()['sparkSessionSingletonInstance']

filter_stream函数旨在通过JSON中的type属性过滤流。目的是将流转换为流,其中每个记录是来自"状态"的转义JSON字符串。原始JSON中的属性:

def filter_stream(json_stream, spark_session, state_type, data_s3_path):
    """Filter stream by type and process the stream."""

    state_type_stream = json_stream\
        .filter(lambda jsonObj: jsonObj['type'] == state_type)\
        .map(lambda jsonObj: jsonObj['state'])

    state_type_stream.foreachRDD(lambda rdd: process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path))

process_rdd函数旨在使用正确的模式将JSON加载到Dataframe中,具体取决于原始JSON对象中的类型。处理程序实例返回一个有效的Spark模式,并有一个进程方法,它对数据帧执行进一步的转换(之后调用df_writer,并将Dataframe写入S3):

def process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path):
    """Process an RDD by state type."""

    if rdd.isEmpty():
        logger.info('RDD is empty, returning early.')
        return

    handler = HANDLER_MAP[state_type]
    df = spark_session.read.json(rdd, handler.get_schema())
    handler.process(df, df_writer, data_s3_path)

基本上我对异常的来源感到困惑。它与我如何使用spark_session.read.json有关吗?如果是这样,它是如何相关的?如果没有,我的代码中还有其他不正确的内容吗?

如果我只是用create_streaming_context方法的内容替换对StreamingContext.getOrCreate的调用,那么一切似乎都能正常工作。我对此错了 - 我得到了同样的异常。我认为检查站的东西是一个红色的鲱鱼...我显然做了别的错误。

我非常感谢您对此问题的任何帮助,我很乐意澄清任何内容或添加其他信息!

0 个答案:

没有答案