Question

我有多个如下所示的架构，具有不同的列名和数据类型。我想使用DataFrame和Scala为每个架构生成测试/模拟数据，并将其保存到镶木地板文件中。

下面是示例架构（来自示例json），用于通过其中的虚拟值动态生成数据。

val schema1 = StructType(
  List(
    StructField("a", DoubleType, true),
    StructField("aa", StringType, true)
    StructField("p", LongType, true),
    StructField("pp", StringType, true)
  )
)

我需要像这样的rdd / dataframe，每个rdd / dataframe都有1000行，具体取决于上述架构中的列数。

val data = Seq(
  Row(1d, "happy", 1L, "Iam"),
  Row(2d, "sad", 2L, "Iam"),
  Row(3d, "glad", 3L, "Iam")
)

基本上..像这样的200个数据集，我需要动态地生成数据，对于我来说，为每种方案编写单独的程序根本是不可能的。

请。帮助我提出您的想法或建议。因为我刚接触火花。

是否可以基于不同类型的架构生成动态数据？

Answer 1

使用@JacekLaskowski的建议，您可以根据期望的字段/类型，使用带有ScalaCheck（Gen）的生成器生成动态数据。

它看起来像这样：

import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import org.scalacheck._

import scala.collection.JavaConverters._

val dynamicValues: Map[(String, DataType), Gen[Any]] = Map(
  ("a", DoubleType) -> Gen.choose(0.0, 100.0),
  ("aa", StringType) -> Gen.oneOf("happy", "sad", "glad"),
  ("p", LongType) -> Gen.choose(0L, 10L),
  ("pp", StringType) -> Gen.oneOf("Iam", "You're")
)

val schemas = Map(
  "schema1" -> StructType(
    List(
      StructField("a", DoubleType, true),
      StructField("aa", StringType, true),
      StructField("p", LongType, true),
      StructField("pp", StringType, true)
    )),
  "schema2" -> StructType(
    List(
      StructField("a", DoubleType, true),
      StructField("pp", StringType, true),
      StructField("p", LongType, true)
    )
  )
)

val numRecords = 1000

schemas.foreach {
  case (name, schema) =>
    // create a data frame
    spark.createDataFrame(
      // of #numRecords records
      (0 until numRecords).map { _ =>
        // each of them a row
        Row.fromSeq(schema.fields.map(field => {
          // with fields based on the schema's fieldname & type else null
          dynamicValues.get((field.name, field.dataType)).flatMap(_.sample).orNull
        }))
      }.asJava, schema)
      // store to parquet
      .write.mode(SaveMode.Overwrite).parquet(name)
}

Answer 2

ScalaCheck是一个用于生成数据的框架，您可以使用自定义生成器根据架构生成原始数据。

访问ScalaCheck Documentation。

Answer 3

您可以这样做

import org.apache.spark.SparkConf
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.json4s
import org.json4s.JsonAST._
import org.json4s.jackson.JsonMethods._

import scala.util.Random

object Test extends App {

  val structType: StructType = StructType(
    List(
      StructField("a", DoubleType, true),
      StructField("aa", StringType, true),
      StructField("p", LongType, true),
      StructField("pp", StringType, true)
    )
  )

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .config(new SparkConf())
    .getOrCreate()

  import spark.implicits._

  val df = createRandomDF(structType, 1000)

  def createRandomDF(structType: StructType, size: Int, rnd: Random = new Random()): DataFrame ={
    spark.read.schema(structType).json((0 to size).map { _ => compact(randomJson(rnd, structType))}.toDS())
  }

  def randomJson(rnd: Random, dataType: DataType): JValue = {

    dataType match {
      case v: DoubleType =>
        json4s.JDouble(rnd.nextDouble())
      case v: StringType =>
        JString(rnd.nextString(10))
      case v: IntegerType =>
        JInt(rnd.nextInt())
      case v: LongType =>
        JInt(rnd.nextLong())
      case v: FloatType =>
        JDouble(rnd.nextFloat())
      case v: BooleanType =>
        JBool(rnd.nextBoolean())
      case v: ArrayType =>
        val size = rnd.nextInt(10)
        JArray(
          (0 to size).map(_ => randomJson(rnd, v.elementType)).toList
        )
      case v: StructType =>
        JObject(
          v.fields.flatMap {
            f =>
              if (f.nullable && rnd.nextBoolean())
                None
              else
                Some(JField(f.name, randomJson(rnd, f.dataType)))
          }.toList
        )
    }
  }
}

如何基于模式动态生成数据集？

3 个答案: