我有两个数据集:
itemname itemId coupons
A 1 true
A 2 false
itemname purchases
B 10
A 10
C 10
我需要得到
itemname itemId coupons purchases
A 1 true 10
A 2 false 10
我在做-
val mm = items.join(purchases, items("itemname") === purchases("itemname")).drop(items("itemname"))
这是在Spark Scala中执行此操作的正确方法吗?
答案 0 :(得分:1)
此代码:
val itemsSchema = List(
StructField("itemname", StringType, nullable = false),
StructField("itemid", IntegerType, nullable = false),
StructField("coupons", BooleanType, nullable = false))
val purchasesSchema = List(
StructField("itemname", StringType, nullable = false),
StructField("purchases", IntegerType, nullable = false))
val items = Seq(Row("A", 1, true), Row("A", 2, false))
val purchases = Seq(Row("A", 10), Row("B", 10), Row("C", 10))
val itemsDF = spark.createDataFrame(
spark.sparkContext.parallelize(items),
StructType(itemsSchema)
)
val purchasesDF = spark.createDataFrame(
spark.sparkContext.parallelize(purchases),
StructType(purchasesSchema)
)
purchasesDF.join(itemsDF, Seq("itemname")).show(false)
给予:
+--------+---------+------+-------+
|itemname|purchases|itemid|coupons|
+--------+---------+------+-------+
|A |10 |1 |true |
|A |10 |2 |false |
+--------+---------+------+-------+
希望这会有所帮助