这是一个人为的例子,但是使用Spark / Scala捕获我想要做的事情
宠物类型
val pets = Array(Row(1,"Cat"),Row(2,"Dog"))
val petsRDD = sc.parallelize(pets)
val petSchema = StructType(Array(StructField("id",IntegerType),StructField("type",StringType)))
val petsDF = sqlContext.createDataFrame(petsRDD,petSchema)
宠物名称
val petnames = Array(Row(1,1,"Tigger","M"),Row(2,1,"Winston","M"),Row(3,1,"Snowball","F"),Row(4,2,"Spot","M"),Row(5,2,"Barf","M"),Row(6,2,"Snoppy","M"))
val petnamesRDD = sc.parallelize(petnames)
val petnameSchema = StructType(Array(StructField("id",IntegerType),StructField("pet_id",IntegerType),StructField("name",StringType),StructField("gender",StringType)))
val petNamesDF = sqlContext.createDataFrame(petNamesRDD,petNameSchema)
从这里我可以加入数据帧...
val join = petsDF.join(petNamesDF, petsDF("id") === petNamesDF("pet_id") ), "leftouter")
结果
+---+-----+---+--------+---------+------+
| id| type| id| pet_id | name |gender
+---+-----+---+--------+---------+------+
| 1| Cat| 1 | 1 |Tigger | M
| 1| Cat| 2 | 1 |Winston | M
| 1| Cat| 3 | 1 |Snowball | F
| 2| Dog| 4 | 2 |Spot | M
| 3| Dog| 5 | 2 |Barf | M
| 3| Dog| 6 | 2 |Snoopy | F
+---+-----+---+--------+---------+------+
我想将结果展平,因此它看起来像这样,所以我可以将结果映射到更多处理的东西。
((1,"Cat"),(1,"Tigger","M"),(2,"Winston","M"),(3,"Snowball","F"))
((2,"Dog"),(1,"Spot","M"),(2,"barf","M"),(3,"Snoopy","F"))
我开始关注UserDefinedAggregateFunctions,但我无法真正开始工作。我并没有那么努力,但似乎这不太适合。
我也看起来像使用地图将每个petDF行转换为petDF(petNames列表),但不允许嵌套DF。
我希望我遗漏了内置于Spark中的内容或想要让它发挥作用的想法。我是Spark / Scala的新手 感谢