DF insertInto不会为混合结构化数据(json,string)保留所有列

时间:2018-03-07 22:16:25

标签: scala apache-spark apache-spark-sql spark-dataframe

DataFrame saveAsTable正确保存所有列值,但insertInto函数不存储所有列,尤其是json数据被截断,后续列未存储的hive表。

我们的环境

  • Spark 2.2.0
  • EMR 5.10.0
  • Scala 2.11.8

样本数据是

 a8f11f90-20c9-11e8-b93e-2fc569d27605   efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010    NonDemarcated       {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23
 23:29:11.20","vs":[5.0,1.7999999523162842,1.5]}

DF SaveAsTable

val spark = SparkSession.builder().appName("Spark SQL Test").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
enableHiveSupport().getOrCreate()

val zoneStatus = spark.table("zone_status")

zoneStatus.select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).saveAsTable("dwh_zone_status")

在结果表中正确存储数据:

a8f11f90-20c9-11e8-b93e-2fc569d27605    efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010    NonDemarcated   0   {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23 23:29:11.20","vs":[5.0,1.7999999523162842,1.5]} 1520453589

DF insertInto

zoneStatus.
  select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
  write.mode(SaveMode.Overwrite).insertInto("zone_status_insert")

但是insertInto并不是持久化所有内容。 json字符串存储部分,后续列不存储。

a8f11f90-20c9-11e8-b93e-2fc569d27605    efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010    NonDemarcated   0   {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13"  NULL

我们在项目中使用insertInto函数,最近在解析json数据时会遇到其他指标。我们注意到配置内容未完全存储。计划更改为saveAsTable但我们可以避免代码更改,如果可以在spark配置中添加任何可用的解决方法。

2 个答案:

答案 0 :(得分:0)

您可以使用以下替代方法将数据插入表格。

val zoneStatusDF = zoneStatus.
  select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts"))

zoneStatusDF.registerTempTable("zone_status_insert ")

或者

zoneStatus.sqlContext.sql("create table zone_status_insert as select * from zone_status")  

答案 1 :(得分:0)

原因是使用

创建了架构
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE

删除 ROW FORMAT DIIMITED FIELDS TERMINATED BY'' 可以使用insertInto保存整个内容。