Spark Structure Streaming无法读取来自Kafka主题的消息

时间:2020-10-29 16:58:42

标签: scala apache-spark apache-kafka spark-structured-streaming

我正在使用以下代码从kafka主题中读取消息。但是,在我的集群中时,它无法读取来自kafka主题的消息。但是在具有本地kafka设置的本地计算机中也是如此。

我的集群Spark版本是: 20/10/29 13:15:20 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.2.0

我的集群Kafka版本为: 20/10/29 13:15:29 INFO utils.AppInfoParser: Kafka version : 2.1.0-cdh6.2.0

我的流式Kafka代码是:

import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession

object KafkaTest {
    def main(args: Array[String]): Unit = {

   val spark: SparkSession = SparkSession
    .builder()
    .getOrCreate()
  
    spark.sparkContext.setLogLevel('DEBUG')
  
  val df = spark.read
        .format("kafka")
        .option("kafka.bootstrap.servers", "data1.company.com:9092")
        .option("subscribe", "onepartitiontopic")
        .option("startingOffsets", "earliest") // From starting
        .option("endingOffsets", "latest") // From starting
        .load()

df.printSchema()
df.show()
}
}

下面是我的pom.xml文件:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.test</groupId>
  <artifactId>kafkatest</artifactId>
  <version>0.0.1-SNAPSHOT</version>

        
  <packaging>jar</packaging>

  <name>kafkatest</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
            <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
            <scope>provided</scope>
        </dependency>
                <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>

下面是该程序在群集中的控制台输出:

20/10/29 16:52:36 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

20/10/29 16:52:38 INFO consumer.ConsumerConfig: ConsumerConfig values:
        auto.commit.interval.ms = 5000
        auto.offset.reset = earliest
        bootstrap.servers = [data1.company.com:9092]
        check.crcs = true
        client.dns.lookup = default
        client.id =
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = false
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = spark-kafka-relation-9b6084ee-3efc-4ddc-ab81-5946d01576bb-driver-0
        heartbeat.interval.ms = 3000
        interceptor.classes = []
        internal.leave.group.on.close = true
        isolation.level = read_uncommitted
        key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 1
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = null
        sasl.jaas.config = null
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = GSSAPI
        security.protocol = PLAINTEXT
        send.buffer.bytes = 131072
        session.timeout.ms = 10000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        ssl.endpoint.identification.algorithm = null
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLS
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer

如果有人对此有解决方案,请帮助我。

1 个答案:

答案 0 :(得分:0)

您面临的问题是您正在每个Spark执行器上本地打印Dataframe。但是,您的日志似乎只适用于在另一台计算机上运行的驱动程序。

在本地模式下运行此代码时,不会有任何问题。

如果只想在控制台中打印出数据,则可以执行以下操作:

val query = df 
.writeStream() 
.format("console") 
.outputMode("append") 
.option("checkpointLocation", "path/to/checkpoint/dir")
.start()

query.awaitTermination()