我想尝试一个非常简单的kafka + sparkstreaming集成。
在kafka方面,我克隆了这个存储库(https://github.com/confluentinc/cp-docker-images)并做了一个docker-compose以获得zookeeper和kafka的实例运行。我创建了一个名为" foo"并添加了消息。在这种情况下,kafka正在端口29092上运行。
在火花方面,我的build.sbt文件如下所示:
name := "KafkaSpark"
version := "0.1"
scalaVersion := "2.11.12"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion
)
我能够通过从终端使用数据来运行以下代码片段:
import org.apache.spark._
import org.apache.spark.streaming._
object SparkTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(3))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
因此,星火流正在发挥作用。
现在,我创建了以下内容来使用kafka:
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.count
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.sql.types.{StringType, StructType, TimestampType}
object KafkaTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local")
.appName("Spark Word Count")
.getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(3))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:29092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("foo")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD { (rdd, time) =>
val data = rdd.map(record => record.value)
data.foreach(println)
println(time)
}
ssc.start() // Start the computation
ssc.awaitTermination()
}
}
当它运行时,我在控制台中得到以下内容(我在intellij中运行它)。这个过程只是在"订阅"之后的最后一行。到了主题。我尝试创建一个不存在的主题,并得到相同的结果,即尽管没有主题存在,但它似乎没有出现错误。如果我创建一个不存在的代理,我会收到一个错误(线程中的异常" main" org.apache.kafka.common.KafkaException:无法构造kafka使用者)所以它必须在找到代理时我确实使用了正确的端口。
有关如何纠正此问题的任何建议?
这是日志文件:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/23 05:29:42 INFO SparkContext: Running Spark version 2.2.0
17/11/23 05:29:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/23 05:29:48 INFO SparkContext: Submitted application: Spark Word Count
17/11/23 05:29:48 INFO SecurityManager: Changing view acls to: jonathandick
17/11/23 05:29:48 INFO SecurityManager: Changing modify acls to: jonathandick
17/11/23 05:29:48 INFO SecurityManager: Changing view acls groups to:
17/11/23 05:29:48 INFO SecurityManager: Changing modify acls groups to:
17/11/23 05:29:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jonathandick); groups with view permissions: Set(); users with modify permissions: Set(jonathandick); groups with modify permissions: Set()
17/11/23 05:29:48 INFO Utils: Successfully started service 'sparkDriver' on port 59606.
17/11/23 05:29:48 DEBUG SparkEnv: Using serializer: class org.apache.spark.serializer.JavaSerializer
17/11/23 05:29:48 INFO SparkEnv: Registering MapOutputTracker
17/11/23 05:29:48 INFO SparkEnv: Registering BlockManagerMaster
17/11/23 05:29:48 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/23 05:29:48 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/23 05:29:48 INFO DiskBlockManager: Created local directory at /private/var/folders/w2/njgz3jnd097cdybxcvp9c2hw0000gn/T/blockmgr-3a3feb00-0fdb-4bc5-867d-808ac65d7c8f
17/11/23 05:29:48 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
17/11/23 05:29:48 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/23 05:29:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/11/23 05:29:49 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/11/23 05:29:49 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
17/11/23 05:29:49 INFO Utils: Successfully started service 'SparkUI' on port 4043.
17/11/23 05:29:49 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.67:4043
17/11/23 05:29:49 INFO Executor: Starting executor ID driver on host localhost
17/11/23 05:29:49 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59613.
17/11/23 05:29:49 INFO NettyBlockTransferService: Server created on 192.168.1.67:59613
17/11/23 05:29:49 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/23 05:29:49 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.67:59613 with 2004.6 MB RAM, BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/jonathandick/IdeaProjects/KafkaSpark/spark-warehouse/').
17/11/23 05:29:49 INFO SharedState: Warehouse path is 'file:/Users/jonathandick/IdeaProjects/KafkaSpark/spark-warehouse/'.
17/11/23 05:29:50 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/11/23 05:29:50 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
17/11/23 05:29:50 WARN KafkaUtils: overriding enable.auto.commit to false for executor
17/11/23 05:29:50 WARN KafkaUtils: overriding auto.offset.reset to none for executor
17/11/23 05:29:50 WARN KafkaUtils: overriding executor group.id to spark-executor-stream_group_id
17/11/23 05:29:50 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Slide time = 3000 ms
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Checkpoint interval = null
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Remember interval = 3000 ms
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@1a38eb73
17/11/23 05:29:50 INFO ForEachDStream: Slide time = 3000 ms
17/11/23 05:29:50 INFO ForEachDStream: Storage level = Serialized 1x Replicated
17/11/23 05:29:50 INFO ForEachDStream: Checkpoint interval = null
17/11/23 05:29:50 INFO ForEachDStream: Remember interval = 3000 ms
17/11/23 05:29:50 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@1e801ce2
17/11/23 05:29:50 INFO ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [localhost:29092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id =
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
group.id = stream_group_id
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = latest
17/11/23 05:29:50 DEBUG KafkaConsumer: Starting the Kafka consumer
17/11/23 05:29:50 INFO ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [localhost:29092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id = consumer-1
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
group.id = stream_group_id
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = latest
17/11/23 05:29:50 INFO AppInfoParser: Kafka version : 0.10.0.1
17/11/23 05:29:50 INFO AppInfoParser: Kafka commitId : a7a17cdec9eaa6c5
17/11/23 05:29:50 DEBUG KafkaConsumer: Kafka consumer created
17/11/23 05:29:50 DEBUG KafkaConsumer: Subscribed to topic(s): foo