将三个数据框列合并为一个数据框

时间:2019-08-22 03:34:55

标签: pyspark-sql

在pyspark中,我创建了三个数据框:B1,P1,C1。

    Dataframe: B1 has five columns (B_Num, B_Tin, B_Light, B_Dark, and 
    B_White)
    Dataframe: P1 has three columns(P_Prov, P_Tip, and P_Bye)
    Datafram: C1 has three columns(C_Cust, C_Addr1, and C_Addr2)

我尝试将三个数据框合并。很好,我不要     要做。

    B1 = B1.withColumn("id", monotonically_increasing_id())
    P1 = P1.withColumn("id", monotonically_increasing_id())
    C1 = C1.withColumn("id", monotonically_increasing_id())

    combined = B1.join(P1, "id", "outer").join(C1, "id", "outer").drop("id")
    display(combined)

以下是合并后的输出:

    B_Num, B_Tin, B_Light, B_Dark, B_White, P_Prov, P_Tip, P_Bye, C_Cust, 
    C_Addr1, and C_Addr2

我除了这样的输出:

B_Num,P_Prov,B_Tin,C_Addr2,B_Light,P_Tip,C_Cust,B_Dark,B_White,P_Bye,C_Addr1

1 个答案:

答案 0 :(得分:0)

由于您的问题只是列的排序(如注释中所述),因此可以按正确的顺序选择它们:

B1 = B1.withColumn("id", monotonically_increasing_id())
P1 = P1.withColumn("id", monotonically_increasing_id())
C1 = C1.withColumn("id", monotonically_increasing_id())

combined = B1.join(P1, "id", "outer").join(C1, "id", "outer").drop("id")

good_ordering = combined.select("B_Num", "P_Prov", "B_Tin", "C_Addr2", "B_Light", "P_Tip", "C_Cust", "B_Dark", "B_White", "P_Bye", "C_Addr1")
display(good_ordering)
>>> B_Num,P_Prov,B_Tin,C_Addr2,B_Light,P_Tip,C_Cust,B_Dark,B_White,P_Bye,C_Addr1