Spark Streaming维护批次中的流数据

时间:2018-04-24 18:18:51

标签: apache-spark pyspark apache-kafka spark-streaming

我正在尝试使用pyspark和kafka构建一个etl管道。我需要保留流以便将来对它们进行操作。我尝试使用updateStateByKey进行有状态流式传输,并且它在spark取消rdds(来自检查点)之前的短时间内工作,并尝试再次访问它们并且应用程序崩溃并发生FileNotFound异常。以下是我的代码:

        sc = SparkContext(conf=spark_conf)
        sc.setLogLevel("INFO")
        spark = SparkSession(sparkContext=sc)

        ssc = StreamingContext(sc, ssc_config['batchDuration'])

        config = getConfig()['kafka']
        kafka_stream = KafkaStream.create_source(ssc, config, "mytopic")
        new_msg = get_new_msg_stream(kafka_stream)
        transformed_msg = transform_msg(new_msg).updateStateByKey(updateState)

        ssc.checkpoint("./")
        ssc.start()
        ssc.awaitTermination()

        def updateState(new_state, old_state):
            if len(new_state) > 0:
               return new_state
            return old_state

我不明白为什么会发生这种情况,或者使用有状态流对我的用例来说是一个好主意,因为我需要永远保持状态。这是驱动程序堆栈跟踪:

JobScheduler:54 - Finished job streaming job 1524592150000 ms.0 from job set of time 1524592150000 ms
PythonRDD:54 - Removing RDD 60 from persistence list
BlockManager:54 - Removing RDD 60
CheckpointWriter:54 - Submitted checkpoint of time 1524592150000 ms to writer queue
CheckpointWriter:54 - Saving checkpoint for time 1524592150000 ms to file 'file:/home/jovyan/ampath/checkpoint-1524592195000'
SparkContext:54 - Starting job: runJob at PythonRDD.scala:141
DAGScheduler:54 - Got job 38 (runJob at PythonRDD.scala:141) with 1 output partitions
DAGScheduler:54 - Submitting ResultStage 46 (PythonRDD[138] at RDD at PythonRDD.scala:48), which has no missing parents
TaskSchedulerImpl:54 - Cancelling stage 46
DAGScheduler:54 - ResultStage 46 (runJob at PythonRDD.scala:141) failed in Unknown s due to Job aborted due to stage failure: Task creation failed: java.io.FileNotFoundException: File file:/home/jovyan/ampath/c1949961-34bc-4f6b-b846-247fb8f73ea4/rdd-60/part-00000 does not exist
java.io.FileNotFoundException: File file:/home/jovyan/ampath/c1949961-34bc-4f6b-b846-247fb8f73ea4/rdd-60/part-00000 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)

以下是transform_msg方法的定义:

    def transform_msg(msg_stream):
    transformed_stream = msg_stream.transform(lambda rdd: transform(rdd))

    def transform(rdd):
        if not rdd.isEmpty():
            msg_df = rdd.toDF()
            transformed_msg = group_obs(msg_df).rdd
            return transformed_msg
    return transformed_stream

def group_msgs(obs_df):
    cols = [f.when(~f.col(x).isin("null"), f.col(x)).alias(x) for x in obs_df.columns if x != "obs_group_id"]

    obs = obs_df.select(*cols,"obs_group_id")

    grouped_by_obsgroup = filtered_obs_with_value\
                                    .withColumn("strObs", f.struct(f.col("obs_id"), f.col("obs_voided"),
                                    f.col("concept_id"), f.col("value"), f.col("value_type"), f.col("obs_date").alias("obs_datetime")))\
                                    .groupBy("obs_group_id", "encounter_id") \
                                    .agg(f.struct(f.col("obs_group_id"),f.collect_list("strObs").alias("obs")).alias("obs"))

    grouped_by_encounter = grouped_by_obsgroup \
        .groupBy("encounter_id")\
        .agg(f.to_json(f.collect_list(f.col("obs")).alias("obs")))

    return grouped_by_encounter

我的目标是查看我可以用来丰富我的流。有一个更好的方法吗? 非常感谢任何帮助。

0 个答案:

没有答案