如何从S3到Spark读取Avro中的不同分区格式?

时间:2018-11-12 06:41:50

标签: apache-spark amazon-s3 apache-spark-sql avro

我有一个具有两种分区格式的S3存储桶:

  1. S3:// bucketname / tablename / year / month / day
  2. S3://存储区名称/表名/设备/年/月/日

文件格式为Avro。

我尝试通过val df = spark.read.format("com.databricks.spark.avro").load("s3://S3://bucketname/tablename")阅读。

错误信息是

java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

    Partition column name list #0: xx, yy
    Partition column name list #1: xx

For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent partition column names:

1 个答案:

答案 0 :(得分:1)

您不能同时阅读它们。如错误本身所述,

  

相同级别的目录应具有相同的分区列   名称。

分别读取它们(使用2条s3路径直到叶子),然后如果架构匹配,则可以$detailJournal = '<!DOCTYPE html> <html> <table> <tr> <table border = 0 cellspacing = 0 cellpadding = 0 align = center> <tr> <td width=55><span style=font-size:10pt; line-height:19px; text-align:left>'.$coaTitle[0]['COA_CODE'].'</span></td> <td width=170><span style=font-size:10pt; line-height:19px; text-align:center>'.$coaTitle[0]['COA_TITLE'].'</span></td> <td width=55><span style=font-size:10pt; line-height:19px; text-align:center>IDR</span></td> <td width=81><span style=font-size:10pt; line-height:19px; text-align:right>'.$journalDetail[0]['ORIG'].'</span></td> <td width=89><span style=font-size:10pt; line-height:19px; text-align:right>0</span></td> <td width=89><span style=font-size:10pt; line-height:19px; text-align:right>'.$journalDetail[0]['SUM'].'</span></td> </tr><br><br> </table> </html>'; return $detailJournal; } 输入DF。