我想从2个不同的Mongo数据库中制作2个数据集。我目前正在使用官方MongoSpark连接器。 sparkSession以下列方式启动。
SparkConf sparkConf = new SparkConf().setMaster("yarn").setAppName("test")
.set("spark.mongodb.input.partitioner", "MongoShardedPartitioner")
.set("spark.mongodb.input.uri", "mongodb://192.168.77.62/db1.coll1")
.set("spark.sql.crossJoin.enabled", "true");
SparkSession sparkSession = sparkSession.builder().appName("test1").config(sparkConf).getOrCreate();
如果我想更改spark.mongodb.input.uri,我该怎么做?我已经尝试更改sparkSession的runtimeConfig,并将readConfig与readOverrides一起使用,但这些不起作用。
方法1:
sparkSession.conf().set("spark.mongodb.input.uri", "mongodb://192.168.77.63/db1.coll2");
方法2:
Map<String, String> readOverrides = new HashMap<String, String>();
readoverrides.put("uri","192.168.77.63/db1.coll2");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides);
Dataset<Position> ds = MongoSpark.load(sparkSession, readConfig, Position.class);
编辑1:正如Karol所建议我尝试了以下方法
SparkConf sparkConf = new SparkConf().setMaster("yarn").setAppName("test");
SparkSession sparkSession = SparkSession.builder().appName("test1").config(sparkConf).getOrCreate();
Map<String, String> readOverrides1 = new HashMap<String, String>();
readOverrides1.put("uri", "mongodb://192.168.77.62:27017");
readOverrides1.put("database", "db1");
readOverrides1.put("collection", "coll1");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides1);
这在运行时失败说:
Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
编辑2:
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder().appName("test")
.config("spark.worker.cleanup.enabled", "true").config("spark.scheduler.mode", "FAIR").getOrCreate();
String mongoURI2 = "mongodb://192.168.77.63:27017/db1.coll1";
Map<String, String> readOverrides1 = new HashMap<String, String>();
readOverrides1.put("uri", mongoURI2);
ReadConfig readConfig1 = ReadConfig.create(sparkSession).withOptions(readOverrides1);
MongoSpark.load(sparkSession,readConfig1,Position.class).show();
}
这仍然是与上一次编辑相同的例外。
答案 0 :(得分:2)
built.sbt:
libraryDependencies += "org.mongodb.spark" % "mongo-spark-connector_2.11" % "2.0.0"
package com.example.app
import com.mongodb.spark.config.{ReadConfig, WriteConfig}
import com.mongodb.spark.sql._
object App {
def main(args: Array[String]): Unit = {
val MongoUri1 = args(0).toString
val MongoUri2 = args(1).toString
val SparkMasterUri= args(2).toString
def makeMongoURI(uri:String,database:String,collection:String) = (s"${uri}/${database}.${collection}")
val mongoURI1 = s"mongodb://${MongoUri1}:27017"
val mongoURI2 = s"mongodb://${MongoUri2}:27017"
val CONFdb1 = makeMongoURI(s"${mongoURI1}","MyColletion1,"df")
val CONFdb2 = makeMongoURI(s"${mongoURI2}","MyColletion2,"df")
val WRITEdb1: WriteConfig = WriteConfig(scala.collection.immutable.Map("uri"->CONFdb1))
val READdb1: ReadConfig = ReadConfig(Map("uri" -> CONFdb1))
val WRITEdb2: WriteConfig = WriteConfig(scala.collection.immutable.Map("uri"->CONFdb2))
val READdb2: ReadConfig = ReadConfig(Map("uri" -> CONFdb2))
val spark = SparkSession
.builder
.appName("AppMongo")
.config("spark.worker.cleanup.enabled", "true")
.config("spark.scheduler.mode", "FAIR")
.getOrCreate()
val df1 = spark.read.mongo(READdb1)
val df2 = spark.read.mongo(READdb2)
df1.write.mode("overwrite").mongo(WRITEdb1)
df2.write.mode("overwrite").mongo(WRITEdb2)
}
}
您现在可以将uri1
和uri2
作为参数传递到/usr/local/spark/bin/spark-submit pathToMyjar.app.jar MongoUri1 MongoUri2 sparkMasterUri
,然后为每个config
创建uri
spark.read.mongo(READdb)
答案 1 :(得分:0)
在ReadConfig中设置uri是没有用的。当调用ReadConfig.create()方法时,Spark-Mongo连接器将使用此信息。因此,在使用它之前,请尝试在SparkContext中进行设置。
就像下面这样:
SparkContext context = spark.sparkContext();
context.conf().set("spark.mongodb.input.uri","mongodb://host:ip/database.collection");
JavaSparkContext jsc = new JavaSparkContext(context);