为少数列创建具有空值的DataFrame

时间:2016-09-13 07:24:37

标签: scala apache-spark spark-dataframe apache-spark-dataset

我正在尝试使用DataFrame创建RDD

首先,我使用以下代码创建RDD -

val account = sc.parallelize(Seq(
                                 (1, null, 2,"F"), 
                                 (2, 2, 4, "F"),
                                 (3, 3, 6, "N"),
                                 (4,null,8,"F")))

工作正常 -

  

account:org.apache.spark.rdd.RDD [(Int,Any,Int,String)] =   ParallelCollectionRDD [0]并行化:27

但尝试使用以下代码

DataFrame创建RDD
account.toDF("ACCT_ID", "M_CD", "C_CD","IND")

我收到以下错误

  

java.lang.UnsupportedOperationException:类型为Any的架构不是   支持的

我分析了每当我在null中加上Seq值时,我才会收到错误。

有没有办法添加空值?

2 个答案:

答案 0 :(得分:12)

问题是Any太普通了,Spark根本不知道如何序列化它。在您的案例Integer中,您应该明确提供一些特定类型。由于无法将空值分配给Scala中的原始类型,因此您可以使用java.lang.Integer。所以试试这个:

val account = sc.parallelize(Seq(
                                 (1, null.asInstanceOf[Integer], 2,"F"), 
                                 (2, new Integer(2), 4, "F"),
                                 (3, new Integer(3), 6, "N"),
                                 (4, null.asInstanceOf[Integer],8,"F")))

这是一个输出:

rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24

和相应的DataFrame:

scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND")

df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]

scala> df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
|      1|null|   2|  F|
|      2|   2|   4|  F|
|      3|   3|   6|  N|
|      4|null|   8|  F|
+-------+----+----+---+

你也可以考虑一些更简洁的方法来声明空整数值,如:

object Constants {
  val NullInteger: java.lang.Integer = null
}

答案 1 :(得分:9)

不使用RDD的替代方式:

import spark.implicits._

val df = spark.createDataFrame(Seq(
  (1, None,    2, "F"),
  (2, Some(2), 4, "F"),
  (3, Some(3), 6, "N"),
  (4, None,    8, "F")
)).toDF("ACCT_ID", "M_CD", "C_CD","IND")

df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
|      1|null|   2|  F|
|      2|   2|   4|  F|
|      3|   3|   6|  N|
|      4|null|   8|  F|
+-------+----+----+---+

df.printSchema
root
 |-- ACCT_ID: integer (nullable = false)
 |-- M_CD: integer (nullable = true)
 |-- C_CD: integer (nullable = false)
 |-- IND: string (nullable = true)