这是两个RDD。
表1对(键,值)
<style>
.hexagon {
overflow: hidden;
-webkit-transform: rotate(120deg);
-moz-transform: rotate(120deg);
-ms-transform: rotate(120deg);
-o-transform: rotate(120deg);
transform: rotate(120deg);
cursor: pointer;
}
.hexagon-in1 {
overflow: hidden;
width: 100%;
height: 100%;
-webkit-transform: rotate(-60deg);
-moz-transform: rotate(-60deg);
-ms-transform: rotate(-60deg);
-o-transform: rotate(-60deg);
transform: rotate(-60deg);
}
.hexagon1 {
width: 400px;
height: 200px;
margin: 0 0 0 -80px;
}
</style>
<div class="hexagon hexagon1"><div class="hexagon-in1"></div></div>
表2-阵列
val table1 = sc.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c")))
//RDD[(String, String)]
我正在尝试使用table1中的键和值将table2的元素(例如“1”)更改为“a”。我期望的结果如下:
val table2 = sc.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
//RDD[Array[String]]
有没有办法让这成为可能?
如果是这样,使用庞大的数据集会有效吗?
答案 0 :(得分:2)
我认为我们可以更好地使用数据框,同时避免连接,因为它可能涉及数据的混乱。
val table1 = spark.sparkContext.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c"))).collectAsMap()
//Brodcasting so that mapping is available to all nodes
val brodcastedMapping = spark.sparkContext.broadcast(table1)
val table2 = spark.sparkContext.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
def changeMapping(value: String): String = {
brodcastedMapping.value.getOrElse(value, value)
}
val changeMappingUDF = udf(changeMapping(_:String))
table2.toDF.withColumn("exploded", explode($"value"))
.withColumn("new", changeMappingUDF($"exploded"))
.groupBy("value")
.agg(collect_list("new").as("mappedCol"))
.select("mappedCol").rdd.map(r => r.toSeq.toArray.map(_.toString))
请告诉我是否符合您的要求,否则我可以根据需要进行修改。
答案 1 :(得分:1)
有没有办法让这成为可能?
是。使用数据集(不是RDD的效果和表现力较低),join
将它们放在一起,并将select
字段放在您喜欢的位置。
val table1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("key", "value")
scala> table1.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
val table2 = sc.parallelize(
Array(Array("1", "2", "d"), Array("1", "3", "e"))).
toDF("a").
select($"a"(0) as "a0", $"a"(1) as "a1", $"a"(2) as "a2")
scala> table2.show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| 1| 2| d|
| 1| 3| e|
+---+---+---+
scala> table2.join(table1, $"key" === $"a0").select($"value" as "a0", $"a1", $"a2").show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| a| 2| d|
| a| 3| e|
+---+---+---+
一起重复其他a
列和union
。在重复代码时,您会注意到使代码通用的模式。
如果是这样,使用庞大的数据集会有效吗?
是(再次)。我们在这里谈论Spark和一个巨大的数据集正是你选择Spark的原因,不是吗?
答案 2 :(得分:1)
您可以在数据集
中执行此操作package dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author vaquar.khan@gmail.com
*/
object Test {
case class table1Class(key: String, value: String)
case class table2Class(key: String, value: String, value1: String)
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
import spark.implicits._
//
val table1 = Seq(
table1Class("1", "a"), table1Class("2", "b"), table1Class("3", "c"))
val df1 = spark.sparkContext.parallelize(table1, 4).toDF()
df1.show()
val table2 = Seq(
table2Class("1", "2", "d"), table2Class("1", "3", "e"))
val df2 = spark.sparkContext.parallelize(table2, 4).toDF()
df2.show()
//
df1.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select d1.key,d1.value,d2.value1 from A d1 inner join B d2 on d1.key = d2.key").show()
//TODO
/* need to fix query
spark.sql( "select * from ( "+ //B1.value,B1.value1,A.value
" select A.value,B.value,B.value1 "+
" from B "+
" left join A "+
" on B.key = A.key ) B2 "+
" left join A " +
" on B2.value = A.key" ).show()
*/
}
}
结果:
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
+---+-----+------+
|key|value|value1|
+---+-----+------+
| 1| 2| d|
| 1| 3| e|
+---+-----+------+
[Stage 15:=====================================> (68 + 6) / 100]
[Stage 15:============================================> (80 + 4) / 100]
+-----+-----+------+
|value|value|value1|
+-----+-----+------+
| 1| a| d|
| 1| a| e|
+-----+-----+------+