我有以下代码,我无法使用以下内容将数据写入Mongo。我甚至看不到MongoDB中填充的数据库或集合名称。似乎有些不对劲。运行此代码时没有例外。
private SparkSession sparkSession;
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster(Configuration.getConfig().getString("spark.master"));
sparkConf.set("spark.mongodb.input.uri", "mongodb://localhost/analytics.counters");
sparkConf.set("spark.mongodb.output.uri", "mongodb://localhost/analytics.counters");
SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
sparkSession.sparkContext().setLogLevel("INFO");
this.sparkSession = sparkSession;
MongoConnector mongoConnector = MongoConnector.apply(sparkSession.sparkContext());
WriteConfig writeConfig = getMongoWriteConfig(sparkSession, "hello");
ReadConfig readConfig = getMongoReadConfig(sparkSession, "hello");
Dataset<String> jsonDS = newDS.select(to_json(struct(col("*")))).as(Encoders.STRING());
Dataset<String> dataset = jsonDS
.map(new MapFunction<String, Boolean>() {
@Override
public Boolean call(String kafkaPayload) throws Exception {
System.out.println(kafkaPayload);
Document jsonDocument = Document.parse(kafkaPayload);
String id = jsonDocument.getString("ID");
jsonDocument.put("_id", id);
return mongoConnector.withCollectionDo(writeConfig, Document.class, new Function<MongoCollection<Document>, Boolean>() {
@Override
public Boolean call(MongoCollection<Document> collection) throws Exception {
return collection.replaceOne(and(eq("_id", id), lt("TIMESTAMP", jsonDocument.getString("TIMESTAMP"))),
jsonDocument, new UpdateOptions().upsert(true)).wasAcknowledged();
}
});
}
}, Encoders.BOOLEAN())
StreamingQuery query1 = dataset
.writeStream()
.trigger(Trigger.ProcessingTime(1000))
.foreach(new KafkaSink("metrics"))
.option("checkpointLocation", getCheckpointPath(CheckpointPath.LOCAL_WRITE) + "/metrics")
.start();
query1.awaitTermination();
private static ReadConfig getMongoReadConfig(SparkSession sparkSession, String collectionName){
ReadConfig readConfig = ReadConfig.create(sparkSession);
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("readConcern.level", "majority");
readConfig.withOptions(readOverrides);
return readConfig;
}
private static WriteConfig getMongoWriteConfig(SparkSession sparkSession, String collectionName) {
WriteConfig writeConfig = WriteConfig.create(sparkSession);
Map<String, String> writeOverrides = new HashMap<String, String>();
writeOverrides.put("writeConcern.w", "majority");
writeConfig.withOptions(writeOverrides);
return writeConfig;
}
我使用spark-submit并传入以下参数;
spark-submit --master local[*] \
--driver-memory 4g \
--executor-memory 2g \
--class com.hello.stream.app.Hello
--conf "spark.mongodb.input.uri=mongodb://localhost/analytics.counters" \
--conf "spark.mongodb.output.uri=mongodb://localhost/analytics.counters" \
build/libs/hello-stream.jar
这是我使用的罐子列表
def sparkVersion = '2.2.0'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-streaming_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-streaming-kafka-0-10_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-sql-kafka-0-10_2.11', version: sparkVersion
compile group: 'org.apache.kafka', name: 'kafka-clients', version: '0.10.0.1'
compile group: 'org.mongodb.spark', name: 'mongo-spark-connector_2.11', version: sparkVersion
compile 'org.mongodb:mongodb-driver:3.0.4'
当我运行我的工作时,我得到以下输出(我的信息日志的更短版本)
17/09/12 10:16:12 INFO MongoClientCache: Closing MongoClient: [localhost:27017]
17/09/12 10:16:12 INFO connection: Closed connection [connectionId{localValue:2, serverValue:2897}] to localhost:27017 because the pool has been closed.
17/09/12 10:16:18 INFO StreamExecution: Streaming query made progress: {
"id" : "ddc38876-c44d-4370-a2e0-3c96974e6f24",
"runId" : "2ae73227-b9e1-4908-97d6-21d9067994c7",
"name" : null,
"timestamp" : "2017-09-12T17:16:18.001Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 2,
"triggerExecution" : 2
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[hello]]",
"startOffset" : {
"pn_ingestor_json" : {
"0" : 826404
}
},
"endOffset" : {
"pn_ingestor_json" : {
"0" : 826404
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ForeachSink@7656801e"
}
}
......它在打印INFO StreamExecution时继续前进:流式查询取得了进展:但我没有看到在Mongo中创建的任何数据库或集合
答案 0 :(得分:1)
您不能以结构化流的方式使用地图。我相信您应该使用foreach
方法。
回购中有一个scala示例 - SparkStructuredStreams.scala可能会有所帮助!