Question

我正在尝试使用pyspark读取位于Google云存储上的日期分区的avro文件。以下是我的文件夹结构，这是来自第三方的层次结构，无法更改。

email.bounce
   - event_type=users.messages.email.Bounce 
              - date=2019-01-14-04/
                 - 422/
                     - prod03
                        -data.avro 

              - date=2019-01-15-04/
                     - 422/
                       - prod03
                          -data.avro 

              - date=2019-01-16-04/
                     - 422/
                       - prod03
                          -data.avro

以下是data.avro文件的外观：

Objavro.schemaì{"type":"record","name":"Bounce","namespace":"some value","doc":"when an email bounces","fields":[{"name":"id","type":"string","doc":"globally unique id for this event"},{"name":"user_id","type":"string","doc":"BSON id of the user that this email was sent to"},{"name":"external_user_id","type":["null","string"],"doc":"external user id of the user","default":null},{"name":"time","type":"int","doc":"unix timestamp at which the email bounced"},{"name":"timezone","type":["null","string"],"doc":"timezone of the user","default":null}

我正在尝试使用pyspark执行以下操作：

file_path = "gs://bucket/email.bounce/event_type=users.messages.email.Bounce/*"

df = spark.read.format("com.databricks.spark.avro").load(file_path)

df.show()

2个错误：

1-找不到avro文件，不确定在哪里设置。我尝试通过

将avro.mapred.ignore.inputs.without.extension设置为false

conf = SparkConf().set("avro.mapred.ignore.inputs.without.extension", "false")
sc = pyspark.SparkContext(conf=conf)

2-如果我给出了avro文件的确切路径，该路径直到data.avro并使用pyspark读取，我得到以下错误：

An error occurred while calling o782.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, temp-spark-w-1.c.namshi-analytics.internal, executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

任何建议都会有所帮助。

pyspark-从Google云端存储读取日期分区的avro文件

0 个答案: