Kafka-spark同步流式处理作业

时间:2018-05-16 10:59:57

标签: scala apache-spark apache-kafka spark-streaming confluent-kafka

我正在尝试一个简单的测试,我使用Kafka-connect和spark

我写了一个创建此源记录的自定义kafka-connect

SourceRecord sr = new SourceRecord(null,
                    null,
                    destTopic,
                   Schema.STRING_SCHEMA,
                    cleanPath);
火花中的

我收到这样的消息

val kafkaConsumerParams = Map[String, String](
      "metadata.broker.list" -> prop.getProperty("kafka_host"),
      "zookeeper.connect" -> prop.getProperty("zookeeper_host"),
      "group.id" -> prop.getProperty("kafka_group_id"),
      "schema.registry.url" -> prop.getProperty("schema_registry_url"),
      "auto.offset.reset" -> prop.getProperty("auto_offset_reset")
    )
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConsumerParams, topicsSet)

val ds = messages.foreachRDD(rdd => {
          val toPrint = rdd.map(t => {
            val file_path = t._2

            val startTime = DateTime.now()


            Thread.sleep(1000 * 60)
            1
      }).sum()
        LogUtils.getLogger(classOf[DeviceManager]).info(" toPrint = " + toPrint +" (number of flows calculated)")
      })
    }

当我使用连接器向所需主题发送多条消息时(在我的测试中它有6个分区) sleep线程获取所有消息,但是同步预先形成它们而不是异步。

当我创建一个简单的测试生成器时,睡眠是异步完成的。

我还创建了2个简单的使用者,并尝试了连接器和生产者,并且这两个任务都是异步使用的 这意味着我的问题在于火花接收从连接器发送的消息的方式。 我不知道为什么这些任务的行为与我从生产者发送时的行为方式不同。

我甚至打印了火花收到的记录,它们完全一样

生产者发送记录

1: {partition=2, offset=11, value=something, key=null}
2: {partition=5, offset=9, value=something2, key=null}

连接已发送的记录

1: {partition=3, offset=9, value=something, key=null}

我的项目中使用的版本是

    <scala.version>2.11.7</scala.version>
    <confluent.version>4.0.0</confluent.version>
    <kafka.version>1.0.0</kafka.version>
    <java.version>1.8</java.version>
    <spark.version>2.0.0</spark.version>

依赖

 <dependency>
            <groupId>io.confluent</groupId>
            <artifactId>kafka-avro-serializer</artifactId>
            <version>${confluent.version}</version>
        </dependency>
        <dependency>
            <groupId>io.confluent</groupId>
            <artifactId>kafka-schema-registry-client</artifactId>
            <version>${confluent.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.8.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka_2.11</artifactId>
            <version>1.6.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector_2.11</artifactId>
            <version>2.0.0-RC1</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.8.0</version>
        </dependency>
<dependency>
            <groupId>io.confluent</groupId>
            <artifactId>kafka-avro-serializer</artifactId>
            <version>${confluent.version}</version>
            <scope>${global.scope}</scope>
        </dependency>
        <dependency>
            <groupId>io.confluent</groupId>
            <artifactId>kafka-connect-avro-converter</artifactId>
            <version>${confluent.version}</version>
            <scope>${global.scope}</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>connect-api</artifactId>
            <version>${kafka.version}</version>
        </dependency>

1 个答案:

答案 0 :(得分:1)

我们无法异步运行Spark-Kafka流式传输作业。但我们可以像Kafka消费者那样并行运行它们。为此,我们需要在$(document).ready(function() { $.validator.addMethod("gst", function(value3, element3) { var gst_value = value3.toUpperCase(); var reg = /^([0-9]{2}[a-zA-Z]{4}([a-zA-Z]{1}|[0-9]{1})[0-9]{4}[a-zA-Z]{1}([a-zA-Z]|[0-9]){3}){0,15}$/; if (this.optional(element3)) { return true; } if (gst_value.match(reg)) { return true; } else { return false; } }, "Please specify a valid GSTTIN Number"); $('#myform').validate({ // initialize the plugin rules: { gst: { required: true, gst: true } }, submitHandler: function(form) { alert('valid form submitted'); return false; } }); });中设置以下配置:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery-validate/1.17.0/jquery.validate.js"></script>


<form id="myform" action="" method="post">
  <div>
    <label>GSTTIN #</label>
    <div>
      <input type="text" name="gst" value="" id="input-gst" />
    </div>
  </div>
  <button type="submit">Register</button>
</form>

默认情况下,其值为SparkConf()。但我们可以将其覆盖到更高的价值。

我希望这有帮助!