我正在尝试运行一个项目,该项目使用Apache Kafka获取推文,使用Spark Streaming处理它们,最后将这些推文存储在MongoDB中(出于教育目的)。该项目在这里:https://github.com/alonsoir/hello-kafka-twitter-scala
我正在遵循所有说明: 1)启动Zookeeper服务器+ kafka服务器 2)转到项目目录并执行命令“ sbt”(Scala构建工具),然后执行“ pack”,这将成功构建项目 3)启动。\ twitter-producer,它开始在新的命令提示符中显示推文
4)启动。\ kafka-connector,它应该初始化Spark上下文,但是我得到以下信息:
.\kafka-connector 192.168.59.3:9092 Obq6c
Initializing Streaming Spark Context and kafka connector...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/05/25 15:00:53 INFO SparkContext: Running Spark version 1.6.1
19/05/25 15:00:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/05/25 15:00:53 INFO SecurityManager: Changing view acls to: Denis.Denchev
19/05/25 15:00:53 INFO SecurityManager: Changing modify acls to: Denis.Denchev
19/05/25 15:00:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Denis.Denchev); users with modify permissions: Set(Denis.Denchev)
19/05/25 15:00:54 INFO Utils: Successfully started service 'sparkDriver' on port 54008.
19/05/25 15:00:55 INFO Slf4jLogger: Slf4jLogger started
19/05/25 15:00:55 INFO Remoting: Starting remoting
19/05/25 15:00:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.0.100:54021]
19/05/25 15:00:55 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54021.
19/05/25 15:00:55 INFO SparkEnv: Registering MapOutputTracker
19/05/25 15:00:55 INFO SparkEnv: Registering BlockManagerMaster
19/05/25 15:00:55 INFO DiskBlockManager: Created local directory at C:\Users\Denis.Denchev\AppData\Local\Temp\blockmgr-221b28fc-2862-4b29-933d-d48c5f914759
19/05/25 15:00:55 INFO MemoryStore: MemoryStore started with capacity 1125.8 MB
19/05/25 15:00:55 INFO SparkEnv: Registering OutputCommitCoordinator
19/05/25 15:00:55 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/05/25 15:00:55 INFO SparkUI: Started SparkUI at http://192.168.0.100:4040
19/05/25 15:00:55 INFO Executor: Starting executor ID driver on host localhost
19/05/25 15:00:55 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54040.
19/05/25 15:00:55 INFO NettyBlockTransferService: Server created on 54040
19/05/25 15:00:55 INFO BlockManagerMaster: Trying to register BlockManager
19/05/25 15:00:55 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54040 with 1125.8 MB RAM, BlockManagerId(driver, localhost, 54040)
19/05/25 15:00:55 INFO BlockManagerMaster: Registered BlockManager
19/05/25 15:00:56 INFO VerifiableProperties: Verifying properties
19/05/25 15:00:56 INFO VerifiableProperties: Property group.id is overridden to
19/05/25 15:00:56 INFO VerifiableProperties: Property zookeeper.connect is overridden to
19/05/25 15:01:17 INFO SimpleConsumer: Reconnect due to socket error: java.nio.channels.ClosedChannelException
Exception in thread "main" org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
at org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at example.spark.KafkaConnector$.main(KafkaConnectorWithMongo.scala:87)
at example.spark.KafkaConnector.main(KafkaConnectorWithMongo.scala)
19/05/25 15:01:38 INFO SparkContext: Invoking stop() from shutdown hook
19/05/25 15:01:38 INFO SparkUI: Stopped Spark web UI at http://192.168.0.100:4040
19/05/25 15:01:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/05/25 15:01:38 INFO MemoryStore: MemoryStore cleared
19/05/25 15:01:38 INFO BlockManager: BlockManager stopped
19/05/25 15:01:38 INFO BlockManagerMaster: BlockManagerMaster stopped
19/05/25 15:01:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/05/25 15:01:38 INFO SparkContext: Successfully stopped SparkContext
19/05/25 15:01:38 INFO ShutdownHookManager: Shutdown hook called
19/05/25 15:01:38 INFO ShutdownHookManager: Deleting directory C:\Users\Denis.Denchev\AppData\Local\Temp\spark-10efc5c8-a803-49cd-b458-2044bd91c557
19/05/25 15:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
19/05/25 15:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
并且Spark正在关闭,并且没有任何内容写入MongoDB的表中。
这是应该初始化Spark Context并连接到MongoDB的scala类:
package example.spark
import java.io.File
import java.util.Date
import com.google.gson.{Gson,GsonBuilder, JsonParser}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import com.mongodb.casbah.Imports._
import com.mongodb.QueryBuilder
import com.mongodb.casbah.MongoClient
import com.mongodb.casbah.commons.{MongoDBList, MongoDBObject}
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
//import com.github.nscala_time.time.Imports._
/**
* Collect at least the specified number of json tweets into cassandra, mongo...
on mongo shell:
use alonsodb;
db.tweets.find();
*/
object KafkaConnector {
private var numTweetsCollected = 0L
private var partNum = 0
private val numTweetsToCollect = 10000000
//this settings must be in reference.conf
private val Database = "bigdata"
private val Collection = "tweets"
private val MongoHost = "127.0.0.1"
private val MongoPort = 27017
private val MongoProvider = "com.stratio.datasource.mongodb"
private val jsonParser = new JsonParser()
private val gson = new GsonBuilder().setPrettyPrinting().create()
private def prepareMongoEnvironment(): MongoClient = {
val mongoClient = MongoClient(MongoHost, MongoPort)
mongoClient
}
private def closeMongoEnviroment(mongoClient : MongoClient) = {
mongoClient.close()
println("mongoclient closed!")
}
private def cleanMongoEnvironment(mongoClient: MongoClient) = {
cleanMongoData(mongoClient)
mongoClient.close()
}
private def cleanMongoData(client: MongoClient): Unit = {
val collection = client(Database)(Collection)
collection.dropCollection()
}
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 2) {
System.err.println("Usage: " + this.getClass.getSimpleName +
"<brokers> <topics>")
System.exit(1)
}
val Array(brokers, topics) = args
println("Initializing Streaming Spark Context and kafka connector...")
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("KafkaConnector").setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true")
// val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
println("Initialized Streaming Spark Context and kafka connector...")
println("Initializing mongodb connector...")
val mongoClient = prepareMongoEnvironment()
val collection = mongoClient(Database)(Collection)
println("Initialized mongodb connector...")
try {
/*val sqlContext = new SQLContext(sc)
println("Creating temporary table in mongo instance...")
sqlContext.sql(
s"""|CREATE TEMPORARY TABLE $Collection
|(id STRING, tweets STRING)
|USING $MongoProvider
|OPTIONS (
|host '$MongoHost:$MongoPort',
|database '$Database',
|collection '$Collection'
|)
""".stripMargin.replaceAll("\n", " "))*/
messages.foreachRDD(rdd => {
val count = rdd.count()
if (count>0) {
val topList = rdd.take(count.toInt)
println("\nReading data from kafka broker... (%s total):".format(rdd.count()))
topList.foreach(println)
//println
for (tweet <- topList) {
collection.insert {MongoDBObject("id" -> new Date(),"tweets" -> tweet)}
}//for (tweet <- topList)
numTweetsCollected += count
if (numTweetsCollected > numTweetsToCollect) {
println
println("numTweetsCollected > numTweetsToCollect condition is reached. Stopping..." + numTweetsCollected + " " + count)
//cleanMongoEnvironment(mongoClient)
closeMongoEnviroment(mongoClient)
println("shutdown mongodb connector...")
System.exit(0)
}
}//if(count>0)
})//messages.foreachRDD(rdd =>
//studentsDF.where(studentsDF("age") > 15).groupBy(studentsDF("enrolled")).agg(avg("age"), max("age")).show(5)
//!val tweetsDF = sqlContext.read.format("com.stratio.datasource.mongodb").table(s"$Collection")
//tweetsDF.show(numTweetsCollected.toInt)
//! tweetsDF.show(5)
println("tested a mongodb connection with stratio library...")
} finally {
//sc.stop()
println("finished withSQLContext...")
}
ssc.start()
ssc.awaitTermination()
println("Finished!")
}
}
要澄清:
// val sc = new SparkContext(sparkConf)
之所以被评论是因为以前我收到并出错,因此在此JVM上只能运行一个Spark上下文。
我不知道这可能是什么原因。任何方向将不胜感激。