我正在尝试使用pyspark和kafka构建一个etl管道。我需要保留流以便将来对它们进行操作。我尝试使用updateStateByKey进行有状态流式传输,并且它在spark取消rdds(来自检查点)之前的短时间内工作,并尝试再次访问它们并且应用程序崩溃并发生FileNotFound异常。以下是我的代码:
sc = SparkContext(conf=spark_conf)
sc.setLogLevel("INFO")
spark = SparkSession(sparkContext=sc)
ssc = StreamingContext(sc, ssc_config['batchDuration'])
config = getConfig()['kafka']
kafka_stream = KafkaStream.create_source(ssc, config, "mytopic")
new_msg = get_new_msg_stream(kafka_stream)
transformed_msg = transform_msg(new_msg).updateStateByKey(updateState)
ssc.checkpoint("./")
ssc.start()
ssc.awaitTermination()
def updateState(new_state, old_state):
if len(new_state) > 0:
return new_state
return old_state
我不明白为什么会发生这种情况,或者使用有状态流对我的用例来说是一个好主意,因为我需要永远保持状态。这是驱动程序堆栈跟踪:
JobScheduler:54 - Finished job streaming job 1524592150000 ms.0 from job set of time 1524592150000 ms
PythonRDD:54 - Removing RDD 60 from persistence list
BlockManager:54 - Removing RDD 60
CheckpointWriter:54 - Submitted checkpoint of time 1524592150000 ms to writer queue
CheckpointWriter:54 - Saving checkpoint for time 1524592150000 ms to file 'file:/home/jovyan/ampath/checkpoint-1524592195000'
SparkContext:54 - Starting job: runJob at PythonRDD.scala:141
DAGScheduler:54 - Got job 38 (runJob at PythonRDD.scala:141) with 1 output partitions
DAGScheduler:54 - Submitting ResultStage 46 (PythonRDD[138] at RDD at PythonRDD.scala:48), which has no missing parents
TaskSchedulerImpl:54 - Cancelling stage 46
DAGScheduler:54 - ResultStage 46 (runJob at PythonRDD.scala:141) failed in Unknown s due to Job aborted due to stage failure: Task creation failed: java.io.FileNotFoundException: File file:/home/jovyan/ampath/c1949961-34bc-4f6b-b846-247fb8f73ea4/rdd-60/part-00000 does not exist
java.io.FileNotFoundException: File file:/home/jovyan/ampath/c1949961-34bc-4f6b-b846-247fb8f73ea4/rdd-60/part-00000 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
以下是transform_msg方法的定义:
def transform_msg(msg_stream):
transformed_stream = msg_stream.transform(lambda rdd: transform(rdd))
def transform(rdd):
if not rdd.isEmpty():
msg_df = rdd.toDF()
transformed_msg = group_obs(msg_df).rdd
return transformed_msg
return transformed_stream
def group_msgs(obs_df):
cols = [f.when(~f.col(x).isin("null"), f.col(x)).alias(x) for x in obs_df.columns if x != "obs_group_id"]
obs = obs_df.select(*cols,"obs_group_id")
grouped_by_obsgroup = filtered_obs_with_value\
.withColumn("strObs", f.struct(f.col("obs_id"), f.col("obs_voided"),
f.col("concept_id"), f.col("value"), f.col("value_type"), f.col("obs_date").alias("obs_datetime")))\
.groupBy("obs_group_id", "encounter_id") \
.agg(f.struct(f.col("obs_group_id"),f.collect_list("strObs").alias("obs")).alias("obs"))
grouped_by_encounter = grouped_by_obsgroup \
.groupBy("encounter_id")\
.agg(f.to_json(f.collect_list(f.col("obs")).alias("obs")))
return grouped_by_encounter
我的目标是查看我可以用来丰富我的流。有一个更好的方法吗? 非常感谢任何帮助。