我希望在从Twitter流式传输时将数据保存到MongoDB中。 DStream中的每个RDD都保存带有值的Array [String],因此我设置了这些值的键并将它们包装到org.bson.document中。当我尝试将Seq of Documents写入MongoDB时,我得到了一个例外:
ERROR Executor: Exception in task 1.0 in stage 8.0 (TID 9)
java.lang.IllegalArgumentException: clusterListener can not be null
我使用了Spark MongoDB连接器,所以这里是build.sbt文件中的依赖项:
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.kafka" %% "kafka" % "1.1.0",
"org.apache.bahir" %% "spark-streaming-twitter" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-core" % "4.0.6",
"org.twitter4j" % "twitter4j-stream" % "4.0.6",
"com.twitter" %% "bijection-avro" % "0.9.6",
"org.mongodb.spark" %% "mongo-spark-connector" % "2.2.2",
"org.mongodb.scala" %% "mongo-scala-driver" % "2.2.0",
"org.json4s" %% "json4s-native" % "3.5.3"
)
另外,我在docker-compose文件中使用了MongoDB docker镜像:
version: '3.3'
services:
kafka:
image: spotify/kafka
ports:
- "9092:9092"
environment:
- ADVERTISED_HOST=localhost
mongo:
image: mongo
restart: always
environment:
MONGO_INITDB_ROOT_USERNAME: admin
MONGO_INITDB_ROOT_PASSWORD: pwd
mongo-express:
image: mongo-express
restart: always
ports:
- 8081:8081
environment:
ME_CONFIG_MONGODB_ADMINUSERNAME: admin
ME_CONFIG_MONGODB_ADMINPASSWORD: pwd
这是用于流式传输和写入数据库的代码。这里的WordArrays类型为DStream [Array [String]]
wordsArrays.foreachRDD(rdd => rdd.collect.foreach(
record => {
val docs = sparkContext.parallelize(Seq(new Document("tweetId", record(0)),
new Document("text", record(1)),
new Document("favoriteCount", record(1)),
new Document("retweetCount", record(1)),
new Document("geoLocation", record(1)),
new Document("language", record(1)),
new Document("createdAt", record(1))
))
MongoSpark.save(docs)
}
))
答案 0 :(得分:0)
由于DStream中的每个元素,Array [String] MongoDB-Spark连接器类型的RDD都有一个隐式方法将RDD直接写入MongoDB
创建SparkSession时提供给它的SparkConf()使用数据库和集合
wordsArrays.foreachRDD(rdd => rdd.saveToMongoDB())
您还可以将Map中的数据库,集合和连接URI密码传递给WriteConfig
对象,并与saveToMongoDB()
helper方法一起使用,如下所示(假设您的SparkSession对象称为 spark ):
import com.mongodb.spark.config._
val writeConfig = WriteConfig(
Map("uri":"mongodb://","database":"db_name",
"collection" -> "collectionname",
"writeConcern.w" -> "majority"),
Some(WriteConfig(spark.sparkContext)
)
)
wordsArrays.foreachRDD(rdd => rdd.saveToMongoDB(writeConfig))