使用Spark从S3A读取Parquet文件时出现重复的列异常

时间:2016-10-03 19:03:53

标签: apache-spark amazon-s3 parquet

我有一个包含多个Int8和String列的模式,我已将其写入Parquet格式并存储在S3A存储桶中供以后使用。

当我尝试使用Sub TransferTRA015() Dim strPath2 As String Dim strPath3 As String Dim strPath4 As String Dim wbkWorkbook1 As Workbook Dim wbkWorkbook2 As Workbook Dim wbkWorkbook3 As Workbook Dim wbkWorkbook4 As Workbook strPath2 = "C:\Users\transducer1.CCS\Desktop\LabVIEW Data\TRA015\TRA015_TEST_Room.xlsx" strPath3 = "C:\Users\transducer1.CCS\Desktop\LabVIEW Data\TRA015\TRA015_TEST_Cold.xlsx" strPath4 = "C:\Users\transducer1.CCS\Desktop\LabVIEW Data\TRA015\TRA015_TEST_Hot.xlsx" Set wbkWorkbook1 = ThisWorkbook '### changed this Set wbkWorkbook2 = Workbooks.Open(strPath2) Set wbkWorkbook3 = Workbooks.Open(strPath3) Set wbkWorkbook4 = Workbooks.Open(strPath4) 'copy the values across '### change the sheet and range to what you need wbkWorkbook1.Worksheets("RAW DATA").Range("A13:Y36").Value = _ wbkWorkbook2.Worksheets("sheet1").Range("A2:Y25").Value wbkWorkbook1.Worksheets("RAW DATA").Range("A5:Y8").Value = _ wbkWorkbook4.Worksheets("sheet1").Range("A2:Y5").Value wbkWorkbook1.Worksheets("RAW DATA").Range("A40:Y43").Value = _ wbkWorkbook3.Worksheets("sheet1").Range("A2:Y5").Value wbkWorkbook2.Close (True) wbkWorkbook3.Close (True) wbkWorkbook4.Close (True) End Sub 读取此镶木地板文件时,我收到以下异常。

我尝试使用镶木地板工具(使用架构和元选项)来读取镶木地板文件,但是我收到了一个未知的命令错误。

SqlContext.read.option("mergeSchema","false").parquet("s3a://....")

如何确保正确编写镶木地板文件?有人知道如何解决这个重复的列错误吗?

2 个答案:

答案 0 :(得分:1)

问题是由于镶木地板文件被破坏了。一旦我使用镶木地板工具确保镶木地板格式正确,我就可以从镶木地板文件中读回Spark。

答案 1 :(得分:0)

试试这个

SqlContext.read.option("mergeSchema","true").parquet("s3a://....")

这是documentation