在Pyspark中的相同两个数据帧上进行两个不同联接的最有效方法是什么

时间:2019-07-12 07:38:51

标签: apache-spark pyspark pyspark-sql

我正在尝试比较两个数据框以查找新记录和更新的记录,这些记录又将用于创建第三个数据框。我正在使用Pyspark 2.4.3

由于我来自SQL背景(ASE),所以我最初的想法是进行左连接以查找新记录,并在所有列的哈希上使用!=来查找更新:

SELECT a.*
FROM Todays_Data a
Left Join Yesterdays_PK_And_Hash b on a.pk = b.pk
WHERE (b.pk IS NULL) --finds new records
OR (b.hashOfColumns != HASHBYTES('md5',<converted and concatenated columns>)) --updated records

我一直在玩Pyspark,并想出了一个脚本,可以达到我想要的结果:

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit

sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)

sp = SparkSession \
    .builder \
    .appName("test App") \
    .getOrCreate()

df = sp.createDataFrame(
    [("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"),  # hashkey here is created based on  YOB of 1973.  To test for an update
     ("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
     ("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
    ["First_name", "Last_name", "hashkey"]
)

df_a = sp.createDataFrame(
    [("Fred", "Smith", "Adelaide", "Doctor", 1971),
     ("Fred", "Davis", "Melbourne", "Baker", 1970),
     ("Barry", "Clarke", "Sydney", "Scientist", 1975),
     ("Jane", "Hall", "Sydney", "Dentist", 1980)],
    ["First_name", "Last_name", "City", "Occupation", "YOB"]
)

df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))

df_ins = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
                              (col('a.Last_name') == col('b.Last_name')), 'left_anti') \
    .select(lit("Insert").alias("_action"), 'a.*') \
    .dropDuplicates()

df_up = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
                             (col('a.Last_name') == col('b.Last_name')) &
                             (col('a.hashkey') != col('b.hashkey')), 'inner') \
    .select(lit("Update").alias("_action"), 'a.*') \
    .dropDuplicates()

df_delta = df_ins.union(df_up).sort("YOB")

df_delta = df_delta.drop("hashkey")

df_delta.show(truncate=False)

这产生的是我的最终增量,例如:

+-------+----------+---------+--------+----------+----+
|_action|First_name|Last_name|City    |Occupation|YOB |
+-------+----------+---------+--------+----------+----+
|Update |Fred      |Smith    |Adelaide|Doctor    |1971|
|Insert |Jane      |Hall     |Sydney  |Dentist   |1980|
+-------+----------+---------+--------+----------+----+

虽然我得到的是我想要的结果,但是我不确定上述代码的效率如何。

最后,我想对100个百万条记录的数据集运行类似的模式。

总有没有办法提高效率?

谢谢

1 个答案:

答案 0 :(得分:0)

您探索过广播加入吗?如果您有100M +条记录,则join语句可能会出现问题。如果数据集B较小,这将是我尝试的微小修改:

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit, broadcast

sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)

sp = SparkSession \
    .builder \
    .appName("test App") \
    .getOrCreate()

df = sp.createDataFrame(
    [("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"),  # hashkey here is created based on  YOB of 1973.  To test for an update
     ("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
     ("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
    ["First_name", "Last_name", "hashkey"]
)

df_a = sp.createDataFrame(
    [("Fred", "Smith", "Adelaide", "Doctor", 1971),
     ("Fred", "Davis", "Melbourne", "Baker", 1970),
     ("Barry", "Clarke", "Sydney", "Scientist", 1975),
     ("Jane", "Hall", "Sydney", "Dentist", 1980)],
    ["First_name", "Last_name", "City", "Occupation", "YOB"]
)

df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))

df_ins = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
                              (col('a.Last_name') == col('b.Last_name')), 'left_anti') \
    .select(lit("Insert").alias("_action"), 'a.*') \
    .dropDuplicates()

df_up = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
                             (col('a.Last_name') == col('b.Last_name')) &
                             (col('a.hashkey') != col('b.hashkey')), 'inner') \
    .select(lit("Update").alias("_action"), 'a.*') \
    .dropDuplicates()

df_delta = df_ins.union(df_up).sort("YOB")

也许干净地重写代码也更容易遵循。

@Ash,从可读性的角度来看,您可以做几件事:

  1. 使用变量
  2. 使用功能。
  3. 尽可能使用pep-8指导风格。 (例如:一行中最多80个字符)
joinExpr = (col('a.First_name') == col('b.First_name')) &
                              (col('a.Last_name') == col('b.Last_name')
joinType = 'left_anti'
df_up = df_a.alias('a').join(broadcast(df.alias('b')), joinExpr) &
                             (col('a.hashkey') != col('b.hashkey')), joinType) \
    .select(lit("Update").alias("_action"), 'a.*') \
    .dropDuplicates()

这还很长,但是你明白了。