Question

我希望使用Spark 1.6上的DataFrames API构建Spark Streaming应用程序。在我走得太远之前，我希望有人可以帮助我理解DataFrames如何处理具有不同模式的数据。

这个想法是消息将通过Avro架构流入Kafka。我们应该能够以向后兼容的方式发展模式，而无需重新启动流应用程序（应用程序逻辑仍然可以工作）。

使用模式注册表和消息中嵌入的模式ID使用KafkaUtils创建直接流和AvroKafkaDecoder（来自Confluent）来反序列化新版本的消息似乎微不足道。这让我有了DStream。

问题＃1：在该DStream中，将存在具有不同版本的模式的对象。因此，当我将每个文件转换为Row对象时，我应该传入一个读者模式，这是正确迁移数据的最新模式，我需要将最新的模式传递给sqlContext.createDataFrame（rowRdd，schema）调用。 DStream中的对象是GenericData.Record类型，据我所知，没有简单的方法可以判断哪个是最新版本。我看到了两种可能的解决方案，一种是调用模式注册表以在每个微分集上获取最新版本的模式。另一种是修改解码器以附加架构ID。然后我可以遍历rdd以找到最高的id并从本地缓存中获取模式。

我希望有人已经以可重复使用的方式很好地解决了这个问题。

问题/问题＃2： Spark将为每个分区从Kafka中提取不同的执行程序。当一个遗嘱执行人收到不同的＆＃34;最新的＆＃34;架构比其他架构。由一个执行程序创建的DataFrame将具有与同一时间窗口不同的模式。我实际上并不知道这是不是真的有问题。我无法可视化数据流，以及会出现哪些类型的操作问题。如果这是一个问题，那就意味着需要在执行者之间进行一些数据共享，这听起来既复杂又低效。

我需要担心吗？如果我这样做，如何解决架构差异？

谢谢， --Ben

Answer 1

我相信我已经解决了这个问题。我正在使用Confluent的架构注册表和KafkaAvroDecoder。简化代码如下：

// Get the latest schema here. This schema will be used inside the
// closure below to ensure that all executors are using the same 
// version for this time slice.
val sr : CachedSchemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000)
val m = sr.getLatestSchemaMetadata(subject)
val schemaId = m.getId
val schemaString = m.getSchema

val outRdd = rdd.mapPartitions(partitions => {
  // Note: we cannot use the schema registry from above because this code
  // will execute on remote machines, requiring the schema registry to be
  // serialized. We could use a pool of these.
  val schemaRegistry : CachedSchemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000)
  val decoder: KafkaAvroDecoder = new KafkaAvroDecoder(schemaRegistry)
  val parser = new Schema.Parser()
  val avroSchema = parser.parse(schemaString)
  val avroRecordConverter = AvroSchemaConverter.createConverterToSQL(avroSchema)

  partitions.map(input => {
    // Decode the message using the latest version of the schema.
    // This will apply Avro's standard schema evolution rules 
    // (for compatible schemas) to migrate the message to the 
    // latest version of the schema.
    val record = decoder.fromBytes(messageBytes, avroSchema).asInstanceOf[GenericData.Record]
    // Convert record into a DataFrame with columns according to the schema
    avroRecordConverter(record).asInstanceOf[Row]
  })
})

// Get a Spark StructType representation of the schema to apply 
// to the DataFrame.
val sparkSchema = AvroSchemaConverter.toSqlType(
      new Schema.Parser().parse(schemaString)
    ).dataType.asInstanceOf[StructType]
sqlContext.createDataFrame(outRdd, sparkSchema)

Answer 2

我仅使用结构化流媒体实现了这一目标。

case class DeserializedFromKafkaRecord( value: String)

    val brokers = "...:9092"
    val schemaRegistryURL = "...:8081"
    var topicRead = "mytopic"


    val kafkaParams = Map[String, String](
      "kafka.bootstrap.servers" -> brokers,
      "group.id" -> "structured-kafka",
      "failOnDataLoss"-> "false",
      "schema.registry.url" -> schemaRegistryURL
    )

    object topicDeserializerWrapper {
      val props = new Properties()
      props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL)
      props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true")
      val vProps = new kafka.utils.VerifiableProperties(props)
      val deser = new KafkaAvroDecoder(vProps)
      val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(topicRead + "-value")
      val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
    }

    val df = {spark
      .readStream
      .format("kafka")
      .option("subscribe", topicRead)
      .option("kafka.bootstrap.servers", brokers)
      .option("auto.offset.reset", "latest")
      .option("failOnDataLoss", false)
      .option("startingOffsets", "latest")
      .load()
      .map(x => {
        DeserializedFromKafkaRecord(DeserializerWrapper.deser.fromBytes(x.getAs[Array[Byte]]("value"), DeserializerWrapper.messageSchema).asInstanceOf[GenericData.Record].toString)
      })}

处理运行Spark Streaming应用程序

2 个答案: