我需要在一列/多列上加入两个普通的RDDs
。逻辑上,此操作等效于两个表的数据库连接操作。我想知道这是否只能通过Spark SQL
或其他方式来实现。
作为一个具体的例子,考虑一下
RDD r1
,主键为ITEM_ID
:
(ITEM_ID, ITEM_NAME, ITEM_UNIT, COMPANY_ID)
带有主键r2
的和RDD COMPANY_ID
:
(COMPANY_ID, COMPANY_NAME, COMPANY_CITY)
我想加入r1
和r2
。
如何做到这一点?
答案 0 :(得分:25)
Soumya Simanta给出了一个很好的答案。但是,连接的RDD中的值为Iterable
,因此结果可能与普通表连接不太相似。
或者,您可以:
val mappedItems = items.map(item => (item.companyId, item))
val mappedComp = companies.map(comp => (comp.companyId, comp))
mappedItems.join(mappedComp).take(10).foreach(println)
输出结果为:
(c1,(Item(1,first,2,c1),Company(c1,company-1,city-1)))
(c1,(Item(2,second,2,c1),Company(c1,company-1,city-1)))
(c2,(Item(3,third,2,c2),Company(c2,company-2,city-2)))
答案 1 :(得分:9)
这样的事情应该有效。
scala> case class Item(id:String, name:String, unit:Int, companyId:String)
scala> case class Company(companyId:String, name:String, city:String)
scala> val i1 = Item("1", "first", 2, "c1")
scala> val i2 = i1.copy(id="2", name="second")
scala> val i3 = i1.copy(id="3", name="third", companyId="c2")
scala> val items = sc.parallelize(List(i1,i2,i3))
items: org.apache.spark.rdd.RDD[Item] = ParallelCollectionRDD[14] at parallelize at <console>:20
scala> val c1 = Company("c1", "company-1", "city-1")
scala> val c2 = Company("c2", "company-2", "city-2")
scala> val companies = sc.parallelize(List(c1,c2))
scala> val groupedItems = items.groupBy( x => x.companyId)
groupedItems: org.apache.spark.rdd.RDD[(String, Iterable[Item])] = ShuffledRDD[16] at groupBy at <console>:22
scala> val groupedComp = companies.groupBy(x => x.companyId)
groupedComp: org.apache.spark.rdd.RDD[(String, Iterable[Company])] = ShuffledRDD[18] at groupBy at <console>:20
scala> groupedItems.join(groupedComp).take(10).foreach(println)
14/12/12 00:52:32 INFO DAGScheduler: Job 5 finished: take at <console>:35, took 0.021870 s
(c1,(CompactBuffer(Item(1,first,2,c1), Item(2,second,2,c1)),CompactBuffer(Company(c1,company-1,city-1))))
(c2,(CompactBuffer(Item(3,third,2,c2)),CompactBuffer(Company(c2,company-2,city-2))))
答案 2 :(得分:2)
Spark SQL可以在SPARK RDD上执行连接。
下面的代码在公司和项目RDD上执行SQL连接
object SparkSQLJoin {
case class Item(id:String, name:String, unit:Int, companyId:String)
case class Company(companyId:String, name:String, city:String)
def main(args: Array[String]) {
val sparkConf = new SparkConf()
val sc= new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.createSchemaRDD
val i1 = Item("1", "first", 1, "c1")
val i2 = Item("2", "second", 2, "c2")
val i3 = Item("3", "third", 3, "c3")
val c1 = Company("c1", "company-1", "city-1")
val c2 = Company("c2", "company-2", "city-2")
val companies = sc.parallelize(List(c1,c2))
companies.registerAsTable("companies")
val items = sc.parallelize(List(i1,i2,i3))
items.registerAsTable("items")
val result = sqlContext.sql("SELECT * FROM companies C JOIN items I ON C.companyId= I.companyId").collect
result.foreach(println)
}
}
输出显示为
[c1,company-1,city-1,1,first,1,c1]
[c2,company-2,city-2,2,second,2,c2]