无法从Spark 2.2.0结构化流媒体向Mongo写入数据?

时间:2017-09-12 07:52:34

标签: mongodb apache-spark

我有以下代码,我无法使用以下内容将数据写入Mongo。我甚至看不到MongoDB中填充的数据库或集合名称。似乎有些不对劲。运行此代码时没有例外。

    private SparkSession sparkSession;

    SparkConf sparkConf = new SparkConf();
    sparkConf.setMaster(Configuration.getConfig().getString("spark.master"));
    sparkConf.set("spark.mongodb.input.uri", "mongodb://localhost/analytics.counters");
    sparkConf.set("spark.mongodb.output.uri", "mongodb://localhost/analytics.counters");
    SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
    sparkSession.sparkContext().setLogLevel("INFO");
    this.sparkSession = sparkSession;

    MongoConnector mongoConnector = MongoConnector.apply(sparkSession.sparkContext());
    WriteConfig writeConfig  = getMongoWriteConfig(sparkSession, "hello");
    ReadConfig readConfig = getMongoReadConfig(sparkSession, "hello");

    Dataset<String> jsonDS = newDS.select(to_json(struct(col("*")))).as(Encoders.STRING());

    Dataset<String> dataset = jsonDS
            .map(new MapFunction<String, Boolean>() {
                @Override
                public Boolean call(String kafkaPayload) throws Exception {
                    System.out.println(kafkaPayload);
                    Document jsonDocument = Document.parse(kafkaPayload);
                    String id = jsonDocument.getString("ID");
                    jsonDocument.put("_id", id);
                    return mongoConnector.withCollectionDo(writeConfig, Document.class, new Function<MongoCollection<Document>, Boolean>() {
                        @Override
                        public Boolean call(MongoCollection<Document> collection) throws Exception {
                            return collection.replaceOne(and(eq("_id", id), lt("TIMESTAMP", jsonDocument.getString("TIMESTAMP"))),
                                    jsonDocument, new UpdateOptions().upsert(true)).wasAcknowledged();
                        }
                    });
                }
            }, Encoders.BOOLEAN())

    StreamingQuery query1 = dataset
                                .writeStream()
                                .trigger(Trigger.ProcessingTime(1000))
                                .foreach(new KafkaSink("metrics"))
                                .option("checkpointLocation", getCheckpointPath(CheckpointPath.LOCAL_WRITE) + "/metrics")
                                .start();
    query1.awaitTermination();

private static ReadConfig getMongoReadConfig(SparkSession sparkSession, String collectionName){
    ReadConfig readConfig = ReadConfig.create(sparkSession);
    Map<String, String> readOverrides = new HashMap<String, String>();
    readOverrides.put("readConcern.level", "majority");
    readConfig.withOptions(readOverrides);
    return readConfig;
}

private static WriteConfig getMongoWriteConfig(SparkSession sparkSession, String collectionName) {
    WriteConfig writeConfig  = WriteConfig.create(sparkSession);
    Map<String, String> writeOverrides = new HashMap<String, String>();
    writeOverrides.put("writeConcern.w", "majority");
    writeConfig.withOptions(writeOverrides);
    return writeConfig;
}

我使用spark-submit并传入以下参数;

spark-submit --master local[*] \
  --driver-memory 4g \
  --executor-memory 2g \
  --class com.hello.stream.app.Hello
  --conf "spark.mongodb.input.uri=mongodb://localhost/analytics.counters" \
  --conf "spark.mongodb.output.uri=mongodb://localhost/analytics.counters" \
  build/libs/hello-stream.jar 

这是我使用的罐子列表

def sparkVersion = '2.2.0'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-streaming_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-streaming-kafka-0-10_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-sql-kafka-0-10_2.11', version: sparkVersion
compile group: 'org.apache.kafka', name: 'kafka-clients', version: '0.10.0.1'
compile group: 'org.mongodb.spark', name: 'mongo-spark-connector_2.11', version: sparkVersion
compile 'org.mongodb:mongodb-driver:3.0.4'

当我运行我的工作时,我得到以下输出(我的信息日志的更短版本)

17/09/12 10:16:12 INFO MongoClientCache: Closing MongoClient: [localhost:27017]
17/09/12 10:16:12 INFO connection: Closed connection [connectionId{localValue:2, serverValue:2897}] to localhost:27017 because the pool has been closed.
17/09/12 10:16:18 INFO StreamExecution: Streaming query made progress: {
  "id" : "ddc38876-c44d-4370-a2e0-3c96974e6f24",
  "runId" : "2ae73227-b9e1-4908-97d6-21d9067994c7",
  "name" : null,
  "timestamp" : "2017-09-12T17:16:18.001Z",
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "getOffset" : 2,
    "triggerExecution" : 2
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[hello]]",
    "startOffset" : {
      "pn_ingestor_json" : {
        "0" : 826404
      }
    },
    "endOffset" : {
      "pn_ingestor_json" : {
        "0" : 826404
      }
    },
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
    "description" : "org.apache.spark.sql.execution.streaming.ForeachSink@7656801e"
  }
}

......它在打印INFO StreamExecution时继续前进:流式查询取得了进展:但我没有看到在Mongo中创建的任何数据库或集合

1 个答案:

答案 0 :(得分:1)

您不能以结构化流的方式使用地图。我相信您应该使用foreach方法。

回购中有一个scala示例 - SparkStructuredStreams.scala可能会有所帮助!