我在spark 2.1
群集上使用yarn
。我有一个RDD
,其中包含我想根据其他RDD
完成的数据(对应于我通过https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage获得的不同mongo
databases
,但我认为这不重要,只要提一下就可以了)
我的问题是,我必须用来完成数据的RDD
取决于数据本身,因为数据包含要使用的database
。以下是我必须做的简化示例:
/*
* The RDD which needs information from databases
*/
val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
/*
* Artificial databases for the exemple. Actually, mongo-hadoop is used. https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
* This means that generate these RDDs COSTS so we don't want to generate all possible RDDs but only needed ones
*/
val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
val generateRDDfromdbName = Map("A" -> A, "B" -> B, "C" -> C)
,想要的输出是:
Map(dbName -> A, id -> id8, other data -> some other other data, new data -> data8)
Map(dbName -> A, id -> id1, other data -> some data, new data -> data1)
Map(dbName -> C, id -> id6, other data -> some other data, new data -> data6)
由于嵌套RDD
是不可能的,我想尽可能找到尽可能使用Spark
并列论的最佳方法。我想到了两个解决方案。
首先创建一个包含所需数据库内容的集合,然后将其转换为RDD
以获得RDD
可伸缩性(如果该集合不适合driver memory
,我可以做几次)。最后在join
上进行filter
和id
内容。
其次是从所有需要的RDD
获取databases
,按dbname
和id
键入,然后执行join
。
以下是scala
代码:
// Get all needed DB
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Fill a list with key value pairs as (dbName,db content)
var dbContents = List[(String,Array[Map[String,String]])]()
dbList.foreach(dbName => dbContents = (dbName,generateRDDfromdbName(dbName).collect()) :: dbContents)
// Generate a RDD from this list to benefit to advantages of RDD
val RDDdbs = sc.parallelize(dbContents)
// Key the initial RDD by dbName and join with the contents of dbs
val joinedRDD = RDDtoDevelop.keyBy(map => map("dbName")).join(RDDdbs)
// Check for matched ids between RDD data to develop and dbContents
val result = joinedRDD.map({ case (s,(maptoDeveleop,content)) => maptoDeveleop + ("new data" -> content.find(mapContent => mapContent("id") == maptoDeveleop("id")).get("data"))})
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Create the list of the database RDDs keyed by (dbName, id)
var dbRDDList = List[RDD[((String,String),Map[String,String])]]()
dbList.foreach(dbName => dbRDDList = generateRDDfromdbName(dbName).keyBy(map => (dbName,map("id"))) :: dbRDDList)
// Create a RDD containing all dbRDD
val RDDdbs = sc.union(dbRDDList)
// Join the initial RDD based on the key with the dbRDDs
val joinedRDD = RDDtoDevelop.keyBy(map => (map("dbName"), map("id"))).join(RDDdbs)
// Reformate the result
val result = joinedRDD.map({ case ((dbName,id),(maptoDevelop,dbmap)) => maptoDevelop + ("new data" -> dbmap("data"))})
他们都给出了想要的输出。在我看来,第二个似乎更好,因为db
和id
的匹配使用Spark
的并列论,但我不确定。你能帮我选择最好的,甚至更好的,给我一个比矿山更好的解决方案的线索。
赞赏任何其他评论(这是我在网站上的第一个问题;))。
先谢谢,
马特
答案 0 :(得分:1)
我建议您将RDD
转换为dataframe
s,然后将joins
,distinct
和其他functions
转换为您要申请的Dataframes
数据非常简单。
dataframe apis
已分发,除了sql queries
之外,还可以使用foreach
。可以在Spark SQL, DataFrames and Datasets Guide和Introducing DataFrames in Apache Spark for Large Scale Data Science中找到更多信息。此外,您不需要collect
和val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
函数,这些函数会使代码运行缓慢。
将RDDtoDevelop转换为dataframe的示例如下所示
RDD
将上述dataFrame
转换为val developColumns=RDDtoDevelop.take(1).flatMap(map=>map.keys)
val developDF = RDDtoDevelop.map{value=>
val list=value.values.toList
(list(0),list(1),list(2))
}.toDF(developColumns:_*)
dataFrame
+------+---+---------------------+
|dbName|id |other data |
+------+---+---------------------+
|A |id1|some data |
|C |id6|some other data |
|A |id8|some other other data|
+------+---+---------------------+
如下所示
A
将rdd
dataframe
转换为val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
如下所示
A的源代码:
DataFrame
A:的 val aColumns=A.take(1).flatMap(map=>map.keys)
val aDF = A.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(aColumns:_*).withColumn("name", lit("A"))
代码
name
database name
添加了一个新列join
,以便developDF
在最后得到正确的DataFrame
。
+---+-----+----+
|id |data |name|
+---+-----+----+
|id1|data1|A |
|id8|data8|A |
+---+-----+----+
A的输出:
B
您可以通过类似方式转换C
和val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
。
来源B:
val bColumns=B.take(1).flatMap(map=>map.keys)
val bDF = B.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(bColumns:_*).withColumn("name", lit("B"))
B的DataFrame:
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1bis|B |
|id5|data5 |B |
+---+--------+----+
B的输出:
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
C的来源:
val cColumns=C.take(1).flatMap(map=>map.keys)
val cDF = C.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(cColumns:_*).withColumn("name", lit("C"))
C的DataFrame代码:
+---+-----+----+
|id |data |name|
+---+-----+----+
|id2|data2|C |
|id6|data6|C |
+---+-----+----+
C的输出:
A
转化后,B
,C
和union
可以使用var unionDF = aDF.union(bDF).union(cDF)
合并
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1 |A |
|id8|data8 |A |
|id1|data1bis|B |
|id5|data5 |B |
|id2|data2 |C |
|id6|data6 |C |
+---+--------+----+
哪个是
developDF
然后它只是加入unionDF
renaming
id
unionDF
dropping
unionDF = unionDF.withColumnRenamed("id", "id1")
unionDF = developDF.join(unionDF, developDF("id") === unionDF("id1") && developDF("dbName") === unionDF("name"), "left").drop("id1", "name")
列后+------+---+---------------------+-----+
|dbName|id |other data |data |
+------+---+---------------------+-----+
|A |id1|some data |data1|
|C |id6|some other data |data6|
|A |id8|some other other data|data8|
+------+---+---------------------+-----+
lit
。{/ p>
import org.apache.spark.sql.functions._
最后我们有
{{1}}
之后你可以做到这一点 注意:{{1}}功能适用于以下导入
{{1}}