Question

我想使嵌套数据框中的某些值无效并将其写入Amazon Redshift，但是我得到了java.lang.NullPointerException

关于我的用例以及到目前为止我所做的事情，有更多的背景信息

我正在使用spark-redshift（不幸的是，DataBrick决定将其设为私有）来将我的数据框写到redshirt

var writer = df
      .coalesce(numPartitions)
      .write
      .format("com.databricks.spark.redshift")
      .option("forward_spark_s3_credentials", true)
      .option("url", url)
      .option("dbtable", destTableName)
      .option("tempdir", s3tempDir)
      .option("postactions", s"grant select on table ${destTableName} to readonly")
      .mode(SaveMode.Append)

我正在使用selectExpr来nullify在嵌套模式中所需的那些值

val sourceDFNull = sourceDF.selectExpr(
      """
      named_struct(
          'event_id', event_id,              
          'user', named_struct(
            'country', user.country,
            'id', user.id,
            'state', named_struct('level', null, 'xp', user.state.xp)
          )
      ) as named_struct
    """).select("named_struct.*")

表的架构为：

create table mySchema.myTable
(      
  event_id                                       varchar(256),
  country                                        varchar(256),
  user_id                                        varchar(256),
  level                                          double precision,
  xp                                             bigint
);

因此，我的Spark代码基本上在嵌套模式上应用了一些逻辑，并生成一个数据帧并将其写入redshirt。

这是写入红移之前的最后一个数据帧

+---------------------+----------------+----------+--------+-------+
|       event_id      |     country    |  user_id |  level |  xp   |
+---------------------+----------------+----------+--------+-------+
| 54d69802-c414-4ab4  |      GB        |   123    |  null  |  12   |
+---------------------+----------------+----------+--------+-------+

但是一旦我尝试编写此数据帧，我就会得到空指针异常

我尝试将此记录手动插入到redshirt中，并且效果很好。

INSERT into mySchema.myTable(event_id,country,user_id,level,xp) values ('54d69802-c414-4ab4', 'GB', 123, null, 12);

已经说过，我的表的级别接受null，并且如果在写入redshift之前打印数据框的架构，我将得到此信息，表明它也接受null：

 root
 |-- event_id: string (nullable = true)  
 |-- country: string (nullable = true)
 |-- user_id: string (nullable = true) 
 |-- level: null (nullable = true)
 |-- xp: null (nullable = true)

我唯一怀疑的是我在selectExpr的{{1}}中设置null的方式！

如果我删除列

named_struct('level', null, 'xp', user.state.xp)

并尝试将数据帧写入将要存储val result = resulttmp.drop("level")的红衫军。但我不想删除该列。

关于如何解决此问题的任何建议？

Spark SQL named_struct值为NULL

0 个答案: