Spark加入DataFrames和聚合

时间:2016-07-25 16:21:44

标签: scala apache-spark

这是一个人为的例子,但是使用Spark / Scala捕获我想要做的事情

宠物类型

val pets = Array(Row(1,"Cat"),Row(2,"Dog"))
val petsRDD = sc.parallelize(pets)
val petSchema = StructType(Array(StructField("id",IntegerType),StructField("type",StringType)))
val petsDF = sqlContext.createDataFrame(petsRDD,petSchema)

宠物名称

val petnames = Array(Row(1,1,"Tigger","M"),Row(2,1,"Winston","M"),Row(3,1,"Snowball","F"),Row(4,2,"Spot","M"),Row(5,2,"Barf","M"),Row(6,2,"Snoppy","M"))
val petnamesRDD = sc.parallelize(petnames)
val petnameSchema = StructType(Array(StructField("id",IntegerType),StructField("pet_id",IntegerType),StructField("name",StringType),StructField("gender",StringType)))
val petNamesDF = sqlContext.createDataFrame(petNamesRDD,petNameSchema)

从这里我可以加入数据帧...

val join = petsDF.join(petNamesDF, petsDF("id") === petNamesDF("pet_id") ), "leftouter")

结果

+---+-----+---+--------+---------+------+
| id| type| id| pet_id | name    |gender
+---+-----+---+--------+---------+------+
|  1|  Cat| 1 | 1      |Tigger   | M
|  1|  Cat| 2 | 1      |Winston  | M
|  1|  Cat| 3 | 1      |Snowball | F
|  2|  Dog| 4 | 2      |Spot     | M
|  3|  Dog| 5 | 2      |Barf     | M
|  3|  Dog| 6 | 2      |Snoopy   | F
+---+-----+---+--------+---------+------+

我想将结果展平,因此它看起来像这样,所以我可以将结果映射到更多处理的东西。

((1,"Cat"),(1,"Tigger","M"),(2,"Winston","M"),(3,"Snowball","F"))
((2,"Dog"),(1,"Spot","M"),(2,"barf","M"),(3,"Snoopy","F"))

我开始关注UserDefinedAggregateFunctions,但我无法真正开始工作。我并没有那么努力,但似乎这不太适合。

我也看起来像使用地图将每个petDF行转换为petDF(petNames列表),但不允许嵌套DF。

我希望我遗漏了内置于Spark中的内容或想要让它发挥作用的想法。我是Spark / Scala的新手 感谢

0 个答案:

没有答案