Question

这让我很困惑。我正在使用“spark-testing-base_2.11”％“2.0.0_0.5.0”进行测试。任何人都可以解释为什么map函数在使用数据集时会更改架构，但是如果我使用RDD则可以工作吗？任何见解都非常感激。

import com.holdenkarau.spark.testing.SharedSparkContext
import org.apache.spark.sql.{ Encoders, SparkSession }
import org.scalatest.{ FunSpec, Matchers }

class TransformSpec extends FunSpec with Matchers with SharedSparkContext {
  describe("data transformation") {
    it("the rdd maintains the schema") {
      val spark = SparkSession.builder.getOrCreate()
      import spark.implicits._

      val personEncoder = Encoders.product[TestPerson]
      val personDS = Seq(TestPerson("JoeBob", 29)).toDS
      personDS.schema shouldEqual personEncoder.schema

      val mappedSet = personDS.rdd.map { p: TestPerson => p.copy(age = p.age + 1) }.toDS
      personEncoder.schema shouldEqual mappedSet.schema
    }

    it("datasets choke on explicit schema") {
      val spark = SparkSession.builder.getOrCreate()
      import spark.implicits._

      val personEncoder = Encoders.product[TestPerson]
      val personDS = Seq(TestPerson("JoeBob", 29)).toDS

      personDS.schema shouldEqual personEncoder.schema

      val mappedSet = personDS.map[TestPerson] { p: TestPerson => p.copy(age = p.age + 1) }
      personEncoder.schema shouldEqual mappedSet.schema
    }
  }
}

case class TestPerson(name: String, age: Int)

Answer 1

在这里，有几件事正在密谋反对你。 Spark看起来对于它认为可以为空的类型有特殊的外壳。

case class TestTypes(
        scalaString: String, 
        javaString: java.lang.String,
        myString: MyString,
        scalaInt: Int,
        javaInt: java.lang.Integer,
        myInt: MyInt
    )

    Encoders.product[TestTypes].schema.printTreeString results in:
    root
     |-- scalaString: string (nullable = true)
     |-- javaString: string (nullable = true)
     |-- myString: struct (nullable = true)
     |    |-- value: string (nullable = true)
     |-- scalaInt: integer (nullable = false)
     |-- javaInt: integer (nullable = true)
     |-- myInt: struct (nullable = true)
     |    |-- value: integer (nullable = false)

但是如果你映射了类型，你最终会得到一切可以为空的

val testTypes: Seq[TestTypes] = Nil
val testDS = testTypes.toDS
testDS.map(foo => foo).mapped.schema.printTreeString results in everything being nullable:
root
 |-- scalaString: string (nullable = true)
 |-- javaString: string (nullable = true)
 |-- myString: struct (nullable = true)
 |    |-- value: string (nullable = true)
 |-- scalaInt: integer (nullable = true)
 |-- javaInt: integer (nullable = true)
 |-- myInt: struct (nullable = true)
 |    |-- value: integer (nullable = true)

即使您强制架构正确，Spark在应用架构时也明显忽略了可空性比较，这就是为什么当您转换回打字表示时，您将失去几个可空性保证。

您可以丰富您的类型以强制使用nonNull模式：

implicit class StructImprovements(s: StructType) {
    def nonNull: StructType = StructType(s.map(_.copy(nullable = false)))
  }

 implicit class DsImprovements[T: Encoder](ds: Dataset[T]) {
    def nonNull: Dataset[T] = {
      val nnSchema = ds.schema.nonNull
      applySchema(ds.toDF, nnSchema).as[T]
    }
  }

val mappedSet = personDS.map { p =>
    p.copy(age = p.age + 1)
  }.nonNull

但你会发现它在应用任何有趣的操作时会消失，然后再次比较模式时如果形状相同，除了可行性Spark会将它传递给它。

这似乎是设计https://github.com/apache/spark/pull/11785

Answer 2

地图是对数据的转换操作。它接受输入和函数，并将该函数应用于输入数据的所有元素。输出是此函数的返回值集。因此输出数据的schmea取决于函数的返回类型。映射操作是函数式编程中相当标准且使用频繁的操作。如果您想了解更多内容，请查看https://en.m.wikipedia.org/wiki/Map_(higher-order_function)。

火花地图操作更改架构

2 个答案: