从PySpark在Hive中编写结构类型

时间:2018-01-03 16:53:12

标签: apache-spark hive pyspark parquet

我正在尝试将DF写入hive:

df_block_identity.printSchema()

root
 |-- HUB_ID: long (nullable = false)
 |-- ClientId: string (nullable = true)
 |-- publicID: string (nullable = true)
 |-- CreationAppSource: string (nullable = true)
 |-- LastUpdateAppSource: string (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- Email: string (nullable = true)
 |-- publicID_address: string (nullable = true)
 |-- CreationAppSource_address: string (nullable = true)
 |-- LastUpdateAppSource_address: string (nullable = true)
 |-- AddressNameDesc: string (nullable = true)
 |-- AddressObjective: string (nullable = true)
 |-- AddressQuality: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- ExtraData: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Street1: string (nullable = true)
 |-- Street2: string (nullable = true)
 |-- Street3: string (nullable = true)
 |-- Street4: string (nullable = true)
 |-- ZipCode: string (nullable = true)
 |-- IsPrimaryAddress: string (nullable = true)
 |-- ExternalAddressID: string (nullable = true)
 |-- publicID_MOBILE: string (nullable = true)
 |-- CreationAppSource_MOBILE: string (nullable = true)
 |-- LastUpdateAppSource_MOBILE: string (nullable = true)
 |-- MOBILE: string (nullable = true)
 |-- publicID_FIXE: string (nullable = true)
 |-- CreationAppSource_FIXE: string (nullable = true)
 |-- LastUpdateAppSource_FIXE: string (nullable = true)
 |-- FIXE: string (nullable = true)
 |-- service: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- publicID_Services: string (nullable = true)
 |    |    |-- CreationAppSource_Services: string (nullable = true)
 |    |    |-- LastUpdateAppSource_Services: string (nullable = true)
 |    |    |-- ServiceTypeId: string (nullable = true)
 |    |    |-- ServiceId: string (nullable = true)
 |    |    |-- ServiceStatus: boolean (nullable = true)
 |    |    |-- ActivationDate: timestamp (nullable = true)
 |    |    |-- DeactivationDate: timestamp (nullable = true)
 |-- publicID_Title: string (nullable = true)
 |-- CreationAppSource_Title: string (nullable = true)
 |-- LastUpdateAppSource_Title: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- publicID_Civility: string (nullable = true)
 |-- CreationAppSource_Civility: string (nullable = true)
 |-- LastUpdateAppSource_Civility: string (nullable = true)
 |-- Civility: string (nullable = true)
 |-- publicID_Gender: string (nullable = true)
 |-- CreationAppSource_Gender: string (nullable = true)
 |-- LastUpdateAppSource_Gender: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- publicID_MaritalStatus: string (nullable = true)
 |-- CreationAppSource_MaritalStatus: string (nullable = true)
 |-- LastUpdateAppSource_MaritalStatus: string (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- publicID_BirthDate: string (nullable = true)
 |-- CreationAppSource_BirthDate: string (nullable = true)
 |-- LastUpdateAppSource_BirthDate: string (nullable = true)
 |-- BirthDate: date (nullable = true)
 |-- publicID_CSP: string (nullable = true)
 |-- CreationAppSource_CSP: string (nullable = true)
 |-- LastUpdateAppSource_CSP: string (nullable = true)
 |-- CSP: string (nullable = true)
 |-- publicID_NbChildren: string (nullable = true)
 |-- CreationAppSource_NbChildren: string (nullable = true)
 |-- LastUpdateAppSource_NbChildren: string (nullable = true)
 |-- NbChildren: string (nullable = true)
 |-- publicID_PMR: string (nullable = true)
 |-- CreationAppSource_PMR: string (nullable = true)
 |-- LastUpdateAppSource_PMR: string (nullable = true)
 |-- PMR: string (nullable = true)
 |-- publicID_DegreeDisability: string (nullable = true)
 |-- CreationAppSource_DegreeDisability: string (nullable = true)
 |-- LastUpdateAppSource_DegreeDisability: string (nullable = true)
 |-- DegreeDisability: string (nullable = true)
 |-- publicID_CompanyName: string (nullable = true)
 |-- CreationAppSource_CompanyName: string (nullable = true)
 |-- LastUpdateAppSource_CompanyName: string (nullable = true)
 |-- CompanyName: string (nullable = true)
 |-- publicID_LanguageId: string (nullable = true)
 |-- CreationAppSource_LanguageId: string (nullable = true)
 |-- LastUpdateAppSource_LanguageId: string (nullable = true)
 |-- LanguageId: string (nullable = true)
 |-- publicID_NationalityId: string (nullable = true)
 |-- CreationAppSource_NationalityId: string (nullable = true)
 |-- LastUpdateAppSource_NationalityId: string (nullable = true)
 |-- NationalityId: string (nullable = true)

此架构后的示例数据:

AHA d4cd8d01-6a4f-446c-838e-ded98c1e8d53    TOTO    TOTO    NULL    .   xxx@gmail.com   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    [{"publicID_Services":"d4cd8d01-6a4f-446c-838e-ded98c1e8d53","CreationAppSource_Services":"TOTO","LastUpdateAppSource_Services":"TOTO","ServiceTypeId":"OPTINS","ServiceId":"PARTENAIRES","ServiceStatus":true,"ActivationDate":"2015-09-18 00:00:00","DeactivationDate":"9999-12-31 23:59:59.999"},{"publicID_Services":"d4cd8d01-6a4f-446c-838e-ded98c1e8d53","CreationAppSource_Services":"TOTO","LastUpdateAppSource_Services":"TOTO","ServiceTypeId":"OPTINS","ServiceId":"NEWSLETTER","ServiceStatus":true,"ActivationDate":"2015-09-18 00:00:00","DeactivationDate":"9999-12-31 23:59:59.999"}]  NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL

我使用命令:df_block_identity.write.saveAsTable('sb_party_hub_dev.golden', mode='overwrite', format="parquet") 此命令完成正常。我可以在Hive Metastore看到这张桌子。

但是当我尝试使用select * from sb_party_hub_dev.golden从hive请求时,我收到错误:

  

java.io.IOException:org.apache.parquet.io.ParquetDecodingException:   无法读取文件中块-1中0的值   ADL://home/hive/warehouse/sb_party_hub_dev.db/golden/part-r-00000-e3dcac27-021e-43e8-8687-01ae305d5b5d.snappy.parquet

当我删除作为数组类型的字段service时,select将检索表的内容。

在PySpark代码中,我应该更改哪些内容,以便在Hive中编写表格并能够无错误地查询它?

编辑:

我尝试了另一种格式: df_block_identity.write.saveAsTable('sb_party_hub_dev.golden', mode='overwrite', format="orc")

使用这种格式,我可以通过HIVE访问我的数据。那为什么镶木地板会出现问题?

0 个答案:

没有答案