分区列在结果集数据框Spark中消失

时间:2019-05-30 18:21:25

标签: scala apache-spark apache-spark-sql avro

我尝试通过时间戳列const Child = ({ function1, function2, value1, value2, match, dispatch, list, }) => { return ( <div> .. </div> } 拆分Spark数据帧,并将其写入具有定义的Avro模式的HDFS中。但是,在调用repartition方法后,我得到了以下异常:

update_database_time

我认为用于分区的列在结果中消失了。我该如何重新定义该操作,使其不会发生?

这是我使用的代码:

Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(random_pk,DecimalType(38,0),true), StructField(random_string,StringType,true), StructField(code,StringType,true), StructField(random_bool,BooleanType,true), StructField(random_int,IntegerType,true), StructField(random_float,DoubleType,true), StructField(random_double,DoubleType,true), StructField(random_enum,StringType,true), StructField(random_date,DateType,true), StructField(random_decimal,DecimalType(4,2),true), StructField(update_database_time_tz,TimestampType,true), StructField(random_money,DecimalType(19,4),true)) to Avro type {"type":"record","name":"TestData","namespace":"DWH","fields":[{"name":"random_pk","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"random_string","type":["string","null"]},{"name":"code","type":["string","null"]},{"name":"random_bool","type":["boolean","null"]},{"name":"random_int","type":["int","null"]},{"name":"random_float","type":["double","null"]},{"name":"random_double","type":["double","null"]},{"name":"random_enum","type":["null",{"type":"enum","name":"enumType","symbols":["VAL_1","VAL_2","VAL_3"]}]},{"name":"random_date","type":["null",{"type":"int","logicalType":"date"}]},{"name":"random_decimal","type":["null",{"type":"bytes","logicalType":"decimal","precision":4,"scale":2}]},{"name":"update_database_time","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"update_database_time_tz","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"random_money","type":["null",{"type":"bytes","logicalType":"decimal","precision":19,"scale":4}]}]}.

1 个答案:

答案 0 :(得分:1)

通过提供的异常,该错误似乎源于所获取的AVRO架构与Spark的架构之间的架构不兼容。快速浏览一下,最令人担忧的部分可能是这些部分:

  1. (可能的催化剂不知道如何将字符串转换为enumType)

火花模式:

StructField(random_enum,StringType,true)

AVRO模式:

{
      "name": "random_enum",
      "type": [
        "null",
        {
          "type": "enum",
          "name": "enumType",
          "symbols": [
            "VAL_1",
            "VAL_2",
            "VAL_3"
          ]
        }
      ]
    }
  1. ({update_databse_time_tz在数据框的架构中仅出现一次,而在AVRO架构中出现两次)

火花模式:

StructField(update_database_time_tz,TimestampType,true)

AVRO模式:

{
      "name": "update_database_time",
      "type": [
        "null",
        {
          "type": "long",
          "logicalType": "timestamp-millis"
        }
      ]
    },
    {
      "name": "update_database_time_tz",
      "type": [
        "null",
        {
          "type": "long",
          "logicalType": "timestamp-millis"
        }
      ]
    }

在进入其他可能的分区问题之前,我建议先合并架构并摆脱该异常。

编辑:关于数字2,我错过了AVRO模式中存在不同名称的面孔,这导致缺少数据框中的列update_database_time的问题。