我正在尝试使用pyspark读取位于Google云存储上的日期分区的avro文件。以下是我的文件夹结构,这是来自第三方的层次结构,无法更改。
email.bounce
- event_type=users.messages.email.Bounce
- date=2019-01-14-04/
- 422/
- prod03
-data.avro
- date=2019-01-15-04/
- 422/
- prod03
-data.avro
- date=2019-01-16-04/
- 422/
- prod03
-data.avro
以下是data.avro文件的外观:
Objavro.schemaì{"type":"record","name":"Bounce","namespace":"some value","doc":"when an email bounces","fields":[{"name":"id","type":"string","doc":"globally unique id for this event"},{"name":"user_id","type":"string","doc":"BSON id of the user that this email was sent to"},{"name":"external_user_id","type":["null","string"],"doc":"external user id of the user","default":null},{"name":"time","type":"int","doc":"unix timestamp at which the email bounced"},{"name":"timezone","type":["null","string"],"doc":"timezone of the user","default":null}
我正在尝试使用pyspark执行以下操作:
file_path = "gs://bucket/email.bounce/event_type=users.messages.email.Bounce/*"
df = spark.read.format("com.databricks.spark.avro").load(file_path)
df.show()
2个错误:
1-找不到avro文件,不确定在哪里设置。我尝试通过
将avro.mapred.ignore.inputs.without.extension设置为falseconf = SparkConf().set("avro.mapred.ignore.inputs.without.extension", "false")
sc = pyspark.SparkContext(conf=conf)
2-如果我给出了avro文件的确切路径,该路径直到data.avro并使用pyspark读取,我得到以下错误:
An error occurred while calling o782.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, temp-spark-w-1.c.namshi-analytics.internal, executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
任何建议都会有所帮助。