如何在scala中加入两个数据集?

时间:2018-10-04 09:02:16

标签: scala apache-spark join

我有两个数据集:

itemname       itemId       coupons
A               1            true
A               2            false


itemname      purchases
B               10
A               10
C               10

我需要得到

itemname   itemId   coupons  purchases
A             1       true      10
A             2       false     10

我在做-

 val mm = items.join(purchases, items("itemname") === purchases("itemname")).drop(items("itemname"))

这是在Spark Scala中执行此操作的正确方法吗?

1 个答案:

答案 0 :(得分:1)

此代码:

val itemsSchema =  List(
  StructField("itemname", StringType, nullable = false),
  StructField("itemid", IntegerType, nullable = false),
  StructField("coupons", BooleanType, nullable = false))

val purchasesSchema =  List(
  StructField("itemname", StringType, nullable = false),
  StructField("purchases", IntegerType, nullable = false))


val items = Seq(Row("A", 1, true), Row("A", 2, false))
val purchases = Seq(Row("A", 10), Row("B", 10), Row("C", 10))

val itemsDF = spark.createDataFrame(
  spark.sparkContext.parallelize(items),
  StructType(itemsSchema)
)

val purchasesDF = spark.createDataFrame(
  spark.sparkContext.parallelize(purchases),
  StructType(purchasesSchema)
)

purchasesDF.join(itemsDF, Seq("itemname")).show(false)

给予:

+--------+---------+------+-------+
|itemname|purchases|itemid|coupons|
+--------+---------+------+-------+
|A       |10       |1     |true   |
|A       |10       |2     |false  |
+--------+---------+------+-------+

希望这会有所帮助