Question

我想知道如何处理可以为空的数据集列（Option [T]）。我的目标是使用spark数据集API（例如＆＃34; Map＆＃34;）并从编译时间类型优势中受益。（我不想使用这样的数据帧API＆＃34;选择＆＃34;）

举个例子：我喜欢在列上应用函数。只有在列不可为空时才能正常工作。

val schema = List(
    StructField("name", StringType, false)
  , StructField("age", IntegerType, true)
  , StructField("children", IntegerType, false)
)

val data = Seq(
  Row("miguel", null, 0),
  Row("luisa", 21, 1)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

case class Person(name: String, age: Option[Int], children: Int)
//                                    ^
//                                    |
//                                 age is nullable
df.as[Person].map(x => x.children * 12).show
//+-----+
//|value|
//+-----+
//|    0|
//|   12|
//+-----+
df.as[Person].map(x => x.age * 12).show
//<console>:36: error: value * is not a member of Option[Int]
//       df.as[Person].map(x => x.age * 12).show

有人能指出我将这个可空的年龄栏乘以12的简单方法吗？

由于

Answer 1

由于它是Option，您可以直接对其进行转换。取而代之的是map：

df.as[Person].map(x => x.age.map(_ * 12)).show

// +-----+
// |value|
// +-----+
// | null|
// |  252|
// +-----+

在实践中，我只是select：

df.select(($"age" * 12).as[Int]).show
// +----------+
// |(age * 12)|
// +----------+
// |      null|
// |       252|
// +----------+

它会表现得更好，当你致电as[Person]时，你已经失去了大部分静态类型检查的好处。

Spark数据集 - 映射选项[T]字段

1 个答案: