从Spark(在Scala中)写入Mongo副本集

时间:2018-08-22 17:48:32

标签: mongodb scala apache-spark

我正在尝试使用mongo-spark-connector从Spark RDD写入MongoDB。

我面临两个问题

  • [主要问题] ,如果我根据文档定义了主机(使用mongo副本集中的所有实例),则无法连接到Mongo
  • [次要/相关问题] 如果我仅连接到主要数据库,则可以编写...,但通常会在编写第一个集合时使主要数据库崩溃

环境:

  • mongo-spark-connector 1.1
  • 火花1.6
  • scala 2.10.5

首先,我将设置一个虚拟示例来演示...

import org.bson.Document 
import com.mongodb.spark.MongoSpark 
import com.mongodb.spark.config.WriteConfig

import org.apache.spark.rdd.RDD

/** 
  * fake json data
  */

val recs: List[String] = List(
  """{"a": 123, "b": 456, "c": "apple"}""",
  """{"a": 345, "b":  72, "c": "banana"}""",
  """{"a": 456, "b": 754, "c": "cat"}""",
  """{"a": 876, "b":  43, "c": "donut"}""",
  """{"a": 432, "b": 234, "c": "existential"}"""
)

val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))

一些不会改变的值...

// credentials
val user = ???
val pwd  = ???

// fixed values
val db              = "db_name"
val replset         = "replset_name"
val collection_name = "collection_name"

这是行不通的...在这种情况下,“ url”看起来像machine.unix.domain.org,而“ ip”看起来像...是IP地址。

这是文档说的用副本集中的每台计算机定义主机...的方式。

val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"

我无法使它们同时工作。利用我能想到的所有排列方式...

val uri = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}@${host}"       // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}@${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above

编辑 有关错误消息的更多详细信息,这些错误会以表格形式出现。

表格1

通常包括java.net.UnknownHostException: machine.unix.domain.org

此外,即使定义为IP地址,也以url形式返回服务器地址

com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting 
for a server that matches WritableServerSelector. Client view of cluster 
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017, 
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: 
machine.unix.domain.org}, caused by {java.net.UnknownHostException: 
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017, 
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: 
machine.unix.domain.org}, caused by {java.net.UnknownHostException: 
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017, 
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: 
machine.unix.domain.org}, caused by {java.net.UnknownHostException: 
machine.unix.domain.org}}]

表格2

(身份验证错误...尽管使用相同的凭据连接到主节点也只能正常工作)

com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting 
for a server that matches WritableServerSelector. Client view of cluster  
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,  
type=UNKNOWN, state=CONNECTING, exception= 
{com.mongodb.MongoSecurityException: Exception authenticating  
MongoCredential{mechanism=null, userName='xx', source='admin', password= 
<hidden>, mechanismProperties={}}}, caused by  
{com.mongodb.MongoCommandException: Command failed with error 18:  
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {  
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :  
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :  
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :  
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :  
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...

结束编辑

这是起作用的...仅在虚拟数据上...更多在下面...

val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}@${host}/${db}"

val writeConfig: WriteConfig = 
  WriteConfig(Map(
    "uri"        -> uri, 
    "database"   -> db, 
    "collection" -> collection_name, 
    "replicaSet" -> replset))

// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)

此...仅连接到主数据库...对虚假数据非常有效,但对于真实数据(从50到100GB的RDD和带有2700分区的RDD)崩溃。我的猜测是,它一次打开了太多的连接...看起来好像打开了约900个要写入的连接(这很有趣,因为默认并行度2700基于900个虚拟核,并行系数是3倍)。

我正在猜测是否要重新分区,以便它打开较少的连接,我会有更好的运......但我猜测这也与仅写主数据库有关,而不是散布到所有实例上。

>

我已经阅读了所有可以在这里找到的内容...但是大多数示例都是针对单实例连接的。https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration

1 个答案:

答案 0 :(得分:0)

事实证明这里有两个问题。从最初的问题来看,这些被称为“表格1”和“表格2”的错误。

“表格1”的错误-解决方案

问题的根源是mongo-spark-connector中的错误。事实证明,它无法使用IP地址连接到副本集...它需要URI。由于我们云中的DNS服务器没有这些查询,因此我通过在每个执行程序上修改/etc/hosts并使用如下的连接字符串格式来使其工作:

val host = "URI1:27017,URI2:27017,URI3:27017"

val uri  = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}&authSource=${db}"

val writeConfig: WriteConfig = 
  WriteConfig(Map(
    "uri"->uri, 
    "database"->db, 
    "collection"->collection, 
    "replicaSet"->replset, 
    "writeConcern.w"->"majority"))

这首先需要在每台计算机上将以下内容添加到/etc/hosts

IP1 URI1
IP2 URI2
IP3 URI3

当然,现在我无法弄清楚集群启动时如何在AWS EMR中使用引导操作来更新/etc/hosts。但这是另一个问题。 (AWS EMR bootstrap action as sudo

“表格2”的错误-解决方案

在uri中添加&authSource=${db}解决了这个问题。