我正在尝试使用mongo-spark-connector从Spark RDD写入MongoDB。
我面临两个问题
环境:
首先,我将设置一个虚拟示例来演示...
import org.bson.Document
import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.rdd.RDD
/**
* fake json data
*/
val recs: List[String] = List(
"""{"a": 123, "b": 456, "c": "apple"}""",
"""{"a": 345, "b": 72, "c": "banana"}""",
"""{"a": 456, "b": 754, "c": "cat"}""",
"""{"a": 876, "b": 43, "c": "donut"}""",
"""{"a": 432, "b": 234, "c": "existential"}"""
)
val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))
一些不会改变的值...
// credentials
val user = ???
val pwd = ???
// fixed values
val db = "db_name"
val replset = "replset_name"
val collection_name = "collection_name"
这是行不通的...在这种情况下,“ url”看起来像machine.unix.domain.org
,而“ ip”看起来像...是IP地址。
这是文档说的用副本集中的每台计算机定义主机...的方式。
val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"
我无法使它们同时工作。利用我能想到的所有排列方式...
val uri = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}@${host}" // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}@${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above
编辑 有关错误消息的更多详细信息,这些错误会以表格形式出现。
表格1
通常包括java.net.UnknownHostException: machine.unix.domain.org
此外,即使定义为IP地址,也以url形式返回服务器地址
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}]
表格2
(身份验证错误...尽管使用相同的凭据连接到主节点也只能正常工作)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,
type=UNKNOWN, state=CONNECTING, exception=
{com.mongodb.MongoSecurityException: Exception authenticating
MongoCredential{mechanism=null, userName='xx', source='admin', password=
<hidden>, mechanismProperties={}}}, caused by
{com.mongodb.MongoCommandException: Command failed with error 18:
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...
结束编辑
这是起作用的...仅在虚拟数据上...更多在下面...
val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}@${host}/${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri" -> uri,
"database" -> db,
"collection" -> collection_name,
"replicaSet" -> replset))
// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)
此...仅连接到主数据库...对虚假数据非常有效,但对于真实数据(从50到100GB的RDD和带有2700分区的RDD)崩溃。我的猜测是,它一次打开了太多的连接...看起来好像打开了约900个要写入的连接(这很有趣,因为默认并行度2700基于900个虚拟核,并行系数是3倍)。
我正在猜测是否要重新分区,以便它打开较少的连接,我会有更好的运......但我猜测这也与仅写主数据库有关,而不是散布到所有实例上。
>我已经阅读了所有可以在这里找到的内容...但是大多数示例都是针对单实例连接的。https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration
答案 0 :(得分:0)
事实证明这里有两个问题。从最初的问题来看,这些被称为“表格1”和“表格2”的错误。
“表格1”的错误-解决方案
问题的根源是mongo-spark-connector中的错误。事实证明,它无法使用IP地址连接到副本集...它需要URI。由于我们云中的DNS服务器没有这些查询,因此我通过在每个执行程序上修改/etc/hosts
并使用如下的连接字符串格式来使其工作:
val host = "URI1:27017,URI2:27017,URI3:27017"
val uri = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}&authSource=${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri"->uri,
"database"->db,
"collection"->collection,
"replicaSet"->replset,
"writeConcern.w"->"majority"))
这首先需要在每台计算机上将以下内容添加到/etc/hosts
:
IP1 URI1
IP2 URI2
IP3 URI3
当然,现在我无法弄清楚集群启动时如何在AWS EMR中使用引导操作来更新/etc/hosts
。但这是另一个问题。 (AWS EMR bootstrap action as sudo)
“表格2”的错误-解决方案
在uri中添加&authSource=${db}
解决了这个问题。