如何使用Glue或PySpark将一帧作为另一帧的嵌套列连接?

时间:2019-07-04 07:46:49

标签: pyspark aws-glue

我有2个具有关系的平面数据集,我想嵌套(即将关系表转换为嵌套的JSON)

样本数据

Sub yearTest()
    Dim SrchRng As Range, cel As Range
    Set SrchRng = Range("D1:D9")

    For Each cel In SrchRng
        If IsEmpty(cel) And Year(cel.Offset(0, -1)) = 2020 Then
            cel.Offset(0, -2).Value = "Test"
        End If
    Next cel
End Sub

哪些印刷品:

from pyspark.context import SparkContext
from awsglue.context import DynamicFrame

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create some sample data
TableA = spark.createDataFrame(
    schema = ['name', 'a_id'],
    data   = [('Pirate',1),('Monkey',2)]
)

TableB = spark.createDataFrame(
    schema = ['name', 'b_id', 'a_id'],
    data   = [('banana', 1, 1),('ball', 2, 1),('coffee', 3, 2),('plant', 4, 2)]
)

# wrap in Glue DynamicFrame
# note: pretend we started with DynamicFrames, since we're working with Glue ETL Jobs
dfA = DynamicFrame.fromDF(TableA, glueContext, "TableA")
dfA.toDF().show()

dfB = DynamicFrame.fromDF(TableB, glueContext, "TableB")
dfB.toDF().show()

加入尝试

我尝试过的方法-根据文档加入

+------+----+
|  name|a_id|
+------+----+
|Pirate|   1|
|Monkey|   2|
+------+----+

+------+----+----+
|  name|b_id|a_id|
+------+----+----+
|banana|   1|   1|
|  ball|   2|   1|
|coffee|   3|   2|
| plant|   4|   2|
+------+----+----+

打印:

joined = Join.apply(dfA, dfB, 'a_id', 'a_id')
joined.toDF().show()

所需的输出

我想看到的是类似的东西

+----+------+----+------+-----+
|b_id|  name|a_id| .name|.a_id|
+----+------+----+------+-----+
|   1|banana|   1|Pirate|    1|
|   2|  ball|   1|Pirate|    1|
|   3|coffee|   2|Monkey|    2|
|   4| plant|   2|Monkey|    2|
+----+------+----+------+-----+

我想这是左联接,结果被分组了……但是不知道怎么做

0 个答案:

没有答案