根据第二个数据框中的匹配键将列表追加到Spark数据框列

时间:2020-09-12 16:11:41

标签: pyspark apache-spark-sql user-defined-functions pyspark-dataframes

我有2个具有相同列名的spark数据帧,并且想要在关键列彼此匹配时使用df2中同一列中的列表来扩展第一个df中的某些列。

df1:
+----+---+--++------+---------+-----+--------+--------+-------+
|k1  |  k2  |list1  | list2   |list3|list4   |list5   |list 6 |
+----+---+--+-------+---------------------------------+-------+
|   a| 121  |[car1] |[price1] |[1]  |[False] |[0.000] |[vfdvf]|
|   b| 11   |[car3] |[price3] |[2]  |[False] |[1.000] |[00000]|
|   c| 23   |[car3] |[price3] |[4]  |[False] |[2.500] |[fdabh]|
|   d| 250  |[car6] |[price6] |[6]  |[True]  |[0.450] |[00000]|
+----+---+--++----+---+--+--++----+---+------+----------------+


df2:
+----+---+--++------+---------+-----+--------+--------+-------+
|k1  |  k2  |list1  | list2   |list3|list4   |list5   |list 6 |
+----+---+--+-------+---------------------------------+-------+
|   m| 121  |[car5] |[price5] |[5]  |[False] |[3.000] |[vfdvf]|
|   b| 11   |[car8] |[price8] |[8]  |[False] |[2.000] |[mnfaf]|
|   c| 23   |[car7] |[price7] |[7]  |[False] |[1.500] |[00000]|
|   n| 250  |[car9] |[price9] |[9]  |[False] |[0.450] |[00000]|
+----+---+--++----+---+--+--++----+---+------+----------------+

由于带有项目列表的列彼此相关,因此订单必须保持不变。仅当key1和key2在两个dfs之间匹配时,有什么方法可以将整个列表从df2追加到df1吗?

结果应如下所示(我无法容纳在列表6列中,但希望在结果中看到与其他列表列相同的模式)

   +--+--+-----------+---------------+-----+------------+--------------+
   |k1|k2|list1      | list2         |list3|list4       |list5        |
   +--+--+-----------+---------------+-----+------------+--------------+
   |b |11|[car3,car8]|[price3,price8]|[2,8]|[False,False]|[1.000,2.000]| 
   |c |23|[car3,car7]|[price3,price7]|[4,7]|[False,False]|[2.500,1.500]| 
   +--+--+-----------+---------------+-----+-------------+-------------+

我对使用UDF还是陌生的,并且在stackoverflow上找不到类似的问题,我发现的唯一类似的问题是使用pandas(How to merge two list columns when merging DataFrames?),这对于我的用例来说非常慢。任何对此的见识将不胜感激。

2 个答案:

答案 0 :(得分:0)

首先,您需要按照下面的步骤创建自己的架构,然后您的代码才能正常工作,请使用我更新的代码

尝试一下:您不需要UDF,首先执行内部联接,然后再将其串联

   from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import StringType
table_schema = StructType([StructField('key1', StringType(), True),
                     StructField('key2', IntegerType(), True),
                     StructField('list1', ArrayType(StringType()), False),
                     StructField('list2', ArrayType(StringType()), False),
                     StructField('list3', ArrayType(IntegerType()), False),
                     StructField('list4', StringType(), False),
                     StructField('list5', ArrayType(FloatType()), False),
                     StructField('list6', ArrayType(StringType()), False)
                     ])
df= spark.createDataFrame(
    [
 (  "a", 121  ,["car1"] ,["price1"] ,[1]  ,["False"] ,[0.000] ,["vfdvf"]),
(   "b", 11   ,["car3"] ,["price3"] ,[2]  ,["False"] ,[1.000] ,[00000]),
(   "c", 23   ,["car3"] ,["price3"] ,[4]  ,["False"] ,[2.500] ,["fdabh"]),
(   "d", 250  ,["car6"] ,["price6"] ,[6]  ,["True"]  ,[0.450] ,[00000])
       
        ],table_schema
    )

df2= spark.createDataFrame(
    [
 ("m", 121  ,["car5"] ,["price5"] ,[5]  ,["False"] ,[3.000] ,["vfdvf"]),
(   "b", 11   ,["car8"] ,["price8"] ,[8]  ,["False"] ,[2.000] ,["mnfaf"]),
(   "c", 23   ,["car7"] ,["price7"] ,[7]  ,["False"] ,[1.500] ,[00000]),
(  "n", 250  ,["car9"] ,["price9"] ,[9]  ,["False"] ,[0.450] ,[00000])

],table_schema
    )
df.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select a.key1,a.key2,concat(a.list1,b.list1)List1 ,concat(a.list2,b.list2)List2, \
concat(a.list3,b.list3)List3 ,concat(a.list4,b.list4)List4,\
          concat(a.list5,b.list5)List5 ,\
          concat(a.list6,b.list6)List6 \
from A a inner join B  b on a.key1=b.key1 order by a.key1").show(truncate=False)

 +----+----+------------+----------------+------+--------------+----------+----------+
|key1|key2|List1       |List2           |List3 |List4         |List5     |List6     |
+----+----+------------+----------------+------+--------------+----------+----------+
|b   |11  |[car3, car8]|[price3, price8]|[2, 8]|[False][False]|[1.0, 2.0]|[0, mnfaf]|
|c   |23  |[car3, car7]|[price3, price7]|[4, 7]|[False][False]|[2.5, 1.5]|[fdabh, 0]|
+----+----+------------+----------------+------+--------------+----------+----------+

答案 1 :(得分:0)

我找到了问题的答案,并希望将其发布在这里,以便与可能面临相同问题的其他人分享,以供将来参考。

    from pyspark.sql.types import BooleanType
    from pyspark.sql.types import StringType
    from pyspark.sql.types import DoubleType
    from pyspark.sql.types import IntegerType
    from pyspark.sql.types import ArrayType
    from pyspark.sql.types import LongType
    from pyspark.sql.types import ByteType
    
    def concatTypesFunc(array1, array2): 
        final_array=array1+array2
        return final_array

    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(BooleanType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(StringType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(DoubleType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(IntegerType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(LongType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ByteType(LongType()) 

    df= spark.createDataFrame(
        [
     (  "a", 121  ,["car1"] ,["price1"] ,[1]  ,["False"] ,[0.000] ,["vfdvf"]),
    (   "b", 11   ,["car3"] ,["price3"] ,[2]  ,["False"] ,[1.000] ,[00000]),
    (   "c", 23   ,["car3"] ,["price3"] ,[4]  ,["False"] ,[2.500] ,["fdabh"]),
    (   "d", 250  ,["car6"] ,["price6"] ,[6]  ,["True"]  ,[0.450] ,[00000])
           
            ],table_schema
        )
    
    df2= spark.createDataFrame(
        [
     ("m", 121  ,["car5"] ,["price5"] ,[5]  ,["False"] ,[3.000] ,["vfdvf"]),
    (   "b", 11   ,["car8"] ,["price8"] ,[8]  ,["False"] ,[2.000] ,["mnfaf"]),
    (   "c", 23   ,["car7"] ,["price7"] ,[7]  ,["False"] ,[1.500] ,[00000]),
    (  "n", 250  ,["car9"] ,["price9"] ,[9]  ,["False"] ,[0.450] ,[00000])
    
    ],table_schema
        )
    df.createOrReplaceTempView("a")
    df2.createOrReplaceTempView("b")
    spark.sql("select a.key1, a.key2, concat_types(a.list1,b.list1)List1 ,concat_types(a.list2,b.list2)List2, \
    concat_types(a.list3,b.list3)List3 ,concat_types(a.list4,b.list4)List4,\
              concat_types(a.list5,b.list5)List5 ,\
              concat_types(a.list6,b.list6)List6 \
    from a inner join b on a.key1=b.key1 order by a.key1").show(truncate=False)

 +----+----+------------+----------------+------+--------------+----------+----------+
|key1|key2|List1       |List2           |List3 |List4         |List5     |List6     |
+----+----+------------+----------------+------+--------------+----------+----------+
|b   |11  |[car3, car8]|[price3, price8]|[2, 8]|[False][False]|[1.0, 2.0]|[0, mnfaf]|
|c   |23  |[car3, car7]|[price3, price7]|[4, 7]|[False][False]|[2.5, 1.5]|[fdabh, 0]|
+----+----+------------+----------------+------+--------------+----------+----------+