根据数据

时间:2017-05-19 09:06:20

标签: algorithm scala apache-spark rdd

我在spark 2.1群集上使用yarn。我有一个RDD,其中包含我想根据其他RDD完成的数据(对应于我通过https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage获得的不同mongo databases,但我认为这不重要,只要提一下就可以了)

我的问题是,我必须用来完成数据的RDD取决于数据本身,因为数据包含要使用的database。以下是我必须做的简化示例:

/*
 * The RDD which needs information from databases
 */
val RDDtoDevelop = sc.parallelize(Array(
    Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
    Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
    Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
    .cache()

/*
 * Artificial databases for the exemple. Actually, mongo-hadoop is used. https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage 
 * This means that generate these RDDs COSTS so we don't want to generate all possible RDDs but only needed ones
 */ 
val A = sc.parallelize(Array(
    Map("id" -> "id1", "data" -> "data1"),
    Map("id" -> "id8", "data" -> "data8")
    ))

val B = sc.parallelize(Array(
    Map("id" -> "id1", "data" -> "data1bis"),
    Map("id" -> "id5", "data" -> "data5")
    ))

val C = sc.parallelize(Array(
    Map("id" -> "id2", "data" -> "data2"),
    Map("id" -> "id6", "data" -> "data6")
    ))

val generateRDDfromdbName = Map("A" -> A, "B" -> B, "C" -> C)

,想要的输出是:

Map(dbName -> A, id -> id8, other data -> some other other data, new data -> data8)
Map(dbName -> A, id -> id1, other data -> some data, new data -> data1)
Map(dbName -> C, id -> id6, other data -> some other data, new data -> data6)

由于嵌套RDD是不可能的,我想尽可能找到尽可能使用Spark并列论的最佳方法。我想到了两个解决方案。

首先创建一个包含所需数据库内容的集合,然后将其转换为RDD以获得RDD可伸缩性(如果该集合不适合driver memory ,我可以做几次)。最后在join上进行filterid内容。

其次是从所有需要的RDD获取databases,按dbnameid键入,然后执行join

以下是scala代码:

解决方案1 ​​

// Get all needed DB
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()

// Fill a list with key value pairs as (dbName,db content)
var dbContents = List[(String,Array[Map[String,String]])]()
dbList.foreach(dbName => dbContents = (dbName,generateRDDfromdbName(dbName).collect()) :: dbContents)

// Generate a RDD from this list to benefit to advantages of RDD
val RDDdbs = sc.parallelize(dbContents)

// Key the initial RDD by dbName and join with the contents of dbs
val joinedRDD = RDDtoDevelop.keyBy(map => map("dbName")).join(RDDdbs)

// Check for matched ids between RDD data to develop and dbContents
val result = joinedRDD.map({ case (s,(maptoDeveleop,content)) => maptoDeveleop + ("new data" -> content.find(mapContent => mapContent("id") == maptoDeveleop("id")).get("data"))})

解决方案2

val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()

// Create the list of the database RDDs keyed by (dbName, id)
var dbRDDList = List[RDD[((String,String),Map[String,String])]]()
dbList.foreach(dbName => dbRDDList = generateRDDfromdbName(dbName).keyBy(map => (dbName,map("id"))) :: dbRDDList)

// Create a RDD containing all dbRDD
val RDDdbs = sc.union(dbRDDList)

// Join the initial RDD based on the key with the dbRDDs
val joinedRDD = RDDtoDevelop.keyBy(map => (map("dbName"), map("id"))).join(RDDdbs)

// Reformate the result
val result = joinedRDD.map({ case ((dbName,id),(maptoDevelop,dbmap)) => maptoDevelop + ("new data" -> dbmap("data"))})

他们都给出了想要的输出。在我看来,第二个似乎更好,因为dbid的匹配使用Spark的并列论,但我不确定。你能帮我选择最好的,甚至更好的,给我一个比矿山更好的解决方案的线索。

赞赏任何其他评论(这是我在网站上的第一个问题;))。

先谢谢,

马特

1 个答案:

答案 0 :(得分:1)

我建议您将RDD转换为dataframe s,然后将joinsdistinct和其他functions转换为您要申请的Dataframes数据非常简单。
dataframe apis已分发,除了sql queries之外,还可以使用foreach。可以在Spark SQL, DataFrames and Datasets GuideIntroducing DataFrames in Apache Spark for Large Scale Data Science中找到更多信息。此外,您不需要collectval RDDtoDevelop = sc.parallelize(Array( Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"), Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"), Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data"))) .cache() 函数,这些函数会使代码运行缓慢。
将RDDtoDevelop转换为dataframe的示例如下所示

RDD

将上述dataFrame转换为val developColumns=RDDtoDevelop.take(1).flatMap(map=>map.keys) val developDF = RDDtoDevelop.map{value=> val list=value.values.toList (list(0),list(1),list(2)) }.toDF(developColumns:_*)

dataFrame

+------+---+---------------------+ |dbName|id |other data | +------+---+---------------------+ |A |id1|some data | |C |id6|some other data | |A |id8|some other other data| +------+---+---------------------+ 如下所示

A

rdd dataframe转换为val A = sc.parallelize(Array( Map("id" -> "id1", "data" -> "data1"), Map("id" -> "id8", "data" -> "data8") )) 如下所示
A的源代码:

DataFrame
A:

val aColumns=A.take(1).flatMap(map=>map.keys) val aDF = A.map{value => val list=value.values.toList (list(0),list(1)) }.toDF(aColumns:_*).withColumn("name", lit("A")) 代码

name

database name添加了一个新列join,以便developDF在最后得到正确的DataFrame
+---+-----+----+ |id |data |name| +---+-----+----+ |id1|data1|A | |id8|data8|A | +---+-----+----+ A的输出:

B

您可以通过类似方式转换Cval B = sc.parallelize(Array( Map("id" -> "id1", "data" -> "data1bis"), Map("id" -> "id5", "data" -> "data5") ))
来源B:

    val bColumns=B.take(1).flatMap(map=>map.keys)

    val bDF = B.map{value =>
      val list=value.values.toList
      (list(0),list(1))
    }.toDF(bColumns:_*).withColumn("name", lit("B"))

B的DataFrame:

+---+--------+----+
|id |data    |name|
+---+--------+----+
|id1|data1bis|B   |
|id5|data5   |B   |
+---+--------+----+

B的输出:

val C = sc.parallelize(Array(
  Map("id" -> "id2", "data" -> "data2"),
  Map("id" -> "id6", "data" -> "data6")
))

C的来源:

val cColumns=C.take(1).flatMap(map=>map.keys)

val cDF = C.map{value =>
  val list=value.values.toList
  (list(0),list(1))
}.toDF(cColumns:_*).withColumn("name", lit("C"))

C的DataFrame代码:

+---+-----+----+
|id |data |name|
+---+-----+----+
|id2|data2|C   |
|id6|data6|C   |
+---+-----+----+

C的输出:

A

转化后,BCunion可以使用var unionDF = aDF.union(bDF).union(cDF) 合并

+---+--------+----+
|id |data    |name|
+---+--------+----+
|id1|data1   |A   |
|id8|data8   |A   |
|id1|data1bis|B   |
|id5|data5   |B   |
|id2|data2   |C   |
|id6|data6   |C   |
+---+--------+----+

哪个是

developDF

然后它只是加入unionDF renaming id unionDF dropping unionDF = unionDF.withColumnRenamed("id", "id1") unionDF = developDF.join(unionDF, developDF("id") === unionDF("id1") && developDF("dbName") === unionDF("name"), "left").drop("id1", "name") 列后+------+---+---------------------+-----+ |dbName|id |other data |data | +------+---+---------------------+-----+ |A |id1|some data |data1| |C |id6|some other data |data6| |A |id8|some other other data|data8| +------+---+---------------------+-----+ lit。{/ p>

import org.apache.spark.sql.functions._

最后我们有

{{1}}

之后你可以做到这一点 注意:{{1}}功能适用于以下导入

{{1}}