将流水线的RDD转换为数据帧

时间:2018-08-07 15:02:27

标签: dataframe pyspark rdd

我正在从Datalake加载数据,然后选择存储在csv文件中的字段。然后我想显示结果,然后出现此错误:

AttributeError: 'PipelinedRDD' object has no attribute 'show'

当我尝试将PipelinedRDD转换为功能为DF()的数据帧时,出现此错误:

Py4JError: An error occurred while calling o2871.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist

有人可以帮我吗?

jump_change_raw = sqlContext.read.format("com.databricks.spark.avro")\
    .load("JUMP_CHANGES/SV1/HISTORY/*.avro")


ddlk = sqlContext.read.format("com.databricks.spark.csv").load("/user/ddlk.csv")

label_fields = ddlk.select(split(ddlk.C0, ";").alias("fields"))

dlk_fields = label_fields.select(
    label_fields.fields[1].alias("jump_dlk"), 
    label_fields.fields[3].alias("impulse_dlk"), 
    label_fields.fields[4].alias("dlk")
).filter(col("dlk") != "")


jump = jump_change_raw.select(
    jump_change_raw.columns
).map(lambda x: dlk_fields.jump_dlk.contains(x)).toDF()

jump.show()

0 个答案:

没有答案